CN114678081A - Compound-protein interaction prediction method fusing network topology information - Google Patents

Compound-protein interaction prediction method fusing network topology information Download PDF

Info

Publication number
CN114678081A
CN114678081A CN202210491027.XA CN202210491027A CN114678081A CN 114678081 A CN114678081 A CN 114678081A CN 202210491027 A CN202210491027 A CN 202210491027A CN 114678081 A CN114678081 A CN 114678081A
Authority
CN
China
Prior art keywords
compound
protein
node
interaction
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210491027.XA
Other languages
Chinese (zh)
Inventor
刘宏生
于笑雪
张力
徐鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning University
Original Assignee
Liaoning University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning University filed Critical Liaoning University
Priority to CN202210491027.XA priority Critical patent/CN114678081A/en
Publication of CN114678081A publication Critical patent/CN114678081A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Landscapes

  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Genetics & Genomics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention relates to a compound-protein interaction prediction method fusing network topology information, which comprises the following steps: step 1: preprocessing data; step 2: constructing an interaction network, and calculating the centrality measurement of each node in the network; and step 3: for each pair of compound and protein in the data set, calculating a correlation metric of the compound to the protein using a common neighbor number-based method; and 4, step 4: constructing a transformer-based model, and adding the centrality of the node into the node characteristics. And 5: the correlation of each pair of nodes is taken as a bias term in the cross attention module. And 6: and outputting the prediction probability by using the full connection layer. The invention considers the topological information in the interaction network, fuses the properties of the protein and the compound with the topological information of the interaction network, and effectively utilizes the topological information to improve the accuracy of the compound-protein interaction prediction.

Description

Compound-protein interaction prediction method fusing network topology information
Technical Field
The invention belongs to the field of bioinformatics, and particularly relates to a compound-protein interaction prediction method fusing network topology information.
Background
Proteins are the basis of living activities of organisms, and play a wide and important role in organisms. Drugs are generally compounds with specific properties that affect the function of a protein by binding to a specific protein in the organism, thereby producing a drug effect. The research on the interaction between the compound and the protein is an important component of drug design, and has important significance on drug development. To improve drug development efficiency, many deep learning-based predictive models have been developed, but existing models fail to explicitly fuse network topology information into the model.
Disclosure of Invention
The invention aims to provide a compound-protein interaction prediction method fused with network topology information, which can effectively improve the accuracy of compound-protein interaction prediction.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for predicting compound-protein interaction fused with network topology information, comprising the following steps:
step 1: preprocessing the data;
step 2: constructing a compound-protein interaction network according to the data set, and calculating the degree of each node in the interaction network as the centrality measurement of the node;
and step 3: for each pair of compound and protein in the data set, calculating the number of mutual neighbors of each adjacent node of the compound in the interaction network and the protein as a correlation measure of the compound to the protein; obtaining a correlation measurement of the protein to the compound by the same method;
and 4, step 4: constructing a transform-based binary classification model, distributing a real-value embedded vector for each node according to the obtained centrality measurement of the node, and adding the real-value embedded vector into the node characteristics.
And 5: and (4) respectively allocating a learnable scalar to each possible value of the obtained correlation of each pair of nodes, and using the learnable scalar as a bias term of the cross attention module in the model in the step 4.
Step 6: and finally, outputting the prediction probability by utilizing the full connection layer.
Further, the step 1 specifically comprises:
step 1.1: preprocessing the compound-protein interaction data, the protein sequence information and the compound SMILES data, removing abnormal values, randomly generating negative examples, and randomly dividing a data set;
step 1.2: encoding a protein sequence using the seqvec model;
step 1.3: the rdkit was used to extract a contiguous matrix of compound features and compound profiles.
Further, the step 2 specifically comprises:
step 2.1: each compound and each protein in the raw data set was taken as a node, and the positive interactions of paired compounds and proteins were taken as edges, building a compound-protein interaction network.
Step 2.2: and calculating the number of neighbor nodes of each node in the network as the degree centrality of the node.
Further, the step 3 specifically includes:
step 3.1: and calculating and storing the number of common neighbors between every two proteins and the number of common neighbors between every two compounds in the interaction network.
Step 3.2: for each pair of compound and protein in the data set, the number of common neighbors of each neighboring node of the compound in the interaction network to the protein is found from the results stored in step 3.1, and the maximum value is recorded as a measure of the correlation of the compound to the protein.
Step 3.3: for each pair of compound and protein in the data set, the number of common neighbors of the protein to each neighboring node in the interaction network and the compound is found according to the results stored in step 3.1, and the maximum value is recorded as a measure of the correlation of the protein to the compound.
Further, the step 4 specifically includes:
step 4.1: a traditional transform model is constructed, the position coding of a decoder is removed, and the mask is transformed into an adjacent matrix by a lower triangular matrix so that the decoder can only see adjacent nodes.
And 4.2: each node is allocated with a real-valued embedded vector according to the centrality measurement of the node and is added into the node characteristics, and the method comprises the following steps:
F=X+Zdeg (1)
where F represents the resulting new feature vector. X represents the initial feature vector of an amino acid or atom. Z is a learnable embedded vector, specified by the degree of protein or compound nodes.
Further, the step 5 specifically includes:
Figure BDA0003631825970000021
wherein the content of the first and second substances,
Figure BDA0003631825970000022
is a conventional attention weight calculation method, the function phi is defined by the correlation between nodes,
Figure BDA0003631825970000023
is a learnable scalar, indexed by the output value of the function phi, and shared among all layers.
Compared with the prior art, the invention has the following beneficial effects:
1. according to the invention, network topology information is considered, and node centrality codes and correlation codes are fused in the model, so that the model can contain more effective information;
2. a two-classification model based on a transformer is constructed, and a cross attention mechanism in the two-classification model is utilized to process the relationship between the protein characteristics and the compound characteristics so as to fuse multi-modal information and improve the accuracy of interaction prediction.
Drawings
FIG. 1 is a schematic diagram of a node relevance computation method;
FIG. 2 is a schematic diagram of a system;
FIG. 3a is a line graph comparing the performance of an embodiment of the present invention with other methods on a human data set;
FIG. 3b is a line graph comparing the performance of examples of the invention with other methods on a C.elegans data set.
Detailed Description
The invention is further described with reference to the following figures and examples.
As shown in FIG. 1, the invention provides a method for predicting compound-protein interaction fused with network topology information, which comprises the steps of firstly, calculating the centrality of each node and the correlation of paired data according to the network topology information of a compound and a protein, and then establishing a model to predict potential interaction relation between the compound and the protein. And finally, evaluating the performance of the model by adopting corresponding indexes. In order to make the correlation calculation method more intuitively represented, the invention establishes a corresponding schematic diagram, as shown in fig. 1. And constructing a compound-protein interaction network through the data set, then performing projection operation on the compound-protein interaction network, and calculating the number of common neighbors of each adjacent node of the node 1 in the interaction network and the node 2 in the paired data to obtain the relevance measurement of the node 1 to the node 2. The upper half of the diagram shows the original interaction network, which can be seen to be essentially a bipartite graph, and the lower half shows the projection of the original network onto the set of compounds and onto the set of proteins, where two protein nodes, if connected to the same compound, connect the two proteins, and the weights of the edges are the number of compounds in common. The same treatment was also performed for the compound node. Calculating the correlation according to the obtained projections, as shown in fig. 1, the correlation of x1 to y1 is required, and then the node with the most common neighbors to y1 among the neighbors of x1 except y1 is first found, and the number of the common neighbors is taken. Here, x1 has only two neighbors, only y2 meets the conditions, and there are 2 neighbors in common with y2 and y1, x1 and x2, respectively, so the correlation of x1 to y1 is 2. Then, the node with the most common neighbors to x1 among other neighbors of y1 is found, which is x2 in fig. 1, and the number 2 of common neighbors of x2 and x1 is taken as the correlation of y1 to x 1. Finally, a learnable scalar is respectively distributed to each possible value obtained, and the learnable scalar is used as a bias item in the cross attention module, namely the cross attention module is used for coding the relevance, so that more effective information is added to the model. The invention aims to predict potential interaction by using a compound-protein interaction prediction model fusing a central code and a related code.
As shown in fig. 2, the flow of the embodiment of the present invention is as follows:
step 1: firstly, compound-protein interaction pairs are obtained from a human and Caenorhabditis elegans data set, wherein the human data set comprises 3369 positive interactions between 1052 compounds and 852 proteins; the Caenorhabditis elegans dataset contains 4000 positive interactions between 1434 compounds and 2504 proteins. And after removing the abnormal value, randomly generating negative samples with the same number as that of the positive samples, and randomly dividing a training set, a verification set and a test set. Amino acid insertions were obtained by the seqvec model, which was pre-trained on elmo with a large protein database. The molecular SMILES descriptor was used to obtain atomic insertions by an open source kit rdkit for chemical informatics.
Step 2: and respectively taking each compound and each protein in the two data sets as nodes, using positive interactions of paired compounds and proteins as edges, constructing a compound-protein interaction network, and calculating the number of neighbor nodes of each node in the network as the degree centrality of the node.
And step 3: for each pair of compounds and proteins in the data set, calculating the number of common neighbors of each adjacent node of the compounds in an interaction network and the proteins, and taking the number as a correlation measure of the compounds to the proteins, wherein the correlation measure represents the number of the same compounds of the proteins capable of interacting with the compounds and the target proteins, intuitively means that two protein nodes already have many common compounds and have a stronger trend of having more common compounds in the future; the same principle yields a measure of the relatedness of the protein to the compound.
And 4, step 4: constructing a transformer-based binary classification model, removing the position code of a decoder and transforming a mask from a lower triangular matrix into a contiguous matrix so that the decoder can only see adjacent nodes. And (3) allocating a real-valued embedded vector to each node according to the obtained centrality measurement of the node, adding the real-valued embedded vector and the original feature matrix of the node, and inputting the addition as a new feature matrix into the model according to a formula (1).
F=X+Zdeg (1)
And 5: and respectively allocating a learnable scalar to each possible value of the obtained correlation of each pair of nodes, and using the learnable scalar as a bias term of the cross attention module in the model in the step 4, such as the formula (2).
Figure BDA0003631825970000041
Wherein the content of the first and second substances,
Figure BDA0003631825970000042
is a conventional attention weight calculation method, the function phi is defined by the correlation between nodes,
Figure BDA0003631825970000043
is a learnable scalar, indexed by the output value of the function phi, and shared among all layers.
Step 6: and finally, outputting the prediction probability by utilizing the full connection layer.
The validity of the invention is verified:
through comparative experiments, the performance of the invention is evaluated on 5 indexes respectively, and the results of comparison of the invention with other methods are shown in fig. 3a and 3b, wherein the best index of the invention on a test set achieves the precision rate: 0.997, recall: 1, accuracy: 0.999, F1 score: 0.998, and AUC: 1. the verification result shows that the performance of the method is superior to that of other methods.

Claims (6)

1. A method for predicting compound-protein interaction fused with network topology information, which is characterized by comprising the following steps:
step 1: preprocessing the data;
step 2: constructing a compound-protein interaction network according to the data set, and calculating the degree of each node in the interaction network as the centrality measure of the node;
and step 3: for each pair of compound and protein in the data set, calculating the number of mutual neighbors of each adjacent node of the compound in the interaction network and the protein as a correlation measure of the compound to the protein; obtaining a correlation measurement of the protein to the compound by the same method;
and 4, step 4: constructing a two-classification model based on a transformer, distributing a real-value embedded vector for each node according to the obtained centrality measurement of the node, and adding the real-value embedded vector into the node characteristics;
and 5: respectively allocating a learnable scalar to each possible value of the obtained correlation of each pair of nodes, and using the learnable scalar as a bias item of the cross attention module in the model in the step 4;
step 6: and finally, outputting the prediction probability by utilizing the full connection layer.
2. The method for predicting a compound-protein interaction fused with network topology information according to claim 1, wherein the step 1 is specifically:
step 1.1: preprocessing compound-protein interaction data, protein sequence information and compound SMILES data, removing abnormal values, randomly generating negative examples, and randomly dividing a data set;
step 1.2: encoding a protein sequence using the seqvec model;
step 1.3: the rdkit was used to extract a contiguous matrix of compound features and compound profiles.
3. The method for predicting a compound-protein interaction fused with network topology information according to claim 1, wherein the step 2 is specifically:
step 2.1: constructing a compound-protein interaction network by taking each compound and each protein in the original data set as nodes and taking the positive interaction of the paired compounds and proteins as edges;
step 2.2: and calculating the number of neighbor nodes of each node in the network as the degree centrality of the node.
4. The method for predicting a compound-protein interaction fused with network topology information according to claim 1, wherein the step 3 is specifically:
step 3.1: calculating and storing the number of common neighbors between every two proteins and the number of common neighbors between every two compounds in the interaction network;
step 3.2: for each pair of compound and protein in the data set, according to the result stored in step 3.1, finding the number of common neighbors of each adjacent node of the compound in the interaction network and the protein, and recording the maximum value of the number as the correlation metric of the compound to the protein;
step 3.3: for each pair of compound and protein in the data set, the number of mutual neighbors of each adjacent node of the protein in the interaction network and the compound is found according to the result saved in step 3.1, and the maximum value is recorded as the correlation measure of the protein to the compound.
5. The method for predicting a compound-protein interaction fused with network topology information according to claim 1, wherein the step 4 is specifically:
step 4.1: constructing a traditional transformer model, removing the position code of a decoder and transforming a mask from a lower triangular matrix into an adjacent matrix so that the decoder can only see adjacent nodes;
step 4.2: each node is allocated with a real-valued embedded vector according to the centrality measurement of the node and is added into the node characteristics, and the method comprises the following steps:
F=X+Zdeg (1)
wherein, F represents a new feature vector obtained finally; x represents an initial feature vector of an amino acid or atom; z is a learnable embedded vector, specified by the degree of protein or compound nodes.
6. The method for predicting a compound-protein interaction fused with network topology information according to claim 1, wherein the step 5 is specifically:
Figure FDA0003631825960000021
wherein the content of the first and second substances,
Figure FDA0003631825960000022
is a conventional attention weight calculation method, the function phi is defined by the correlation between nodes,
Figure FDA0003631825960000023
is a learnable scalar, indexed by the output value of the function φ, and shared across all layers.
CN202210491027.XA 2022-05-07 2022-05-07 Compound-protein interaction prediction method fusing network topology information Pending CN114678081A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210491027.XA CN114678081A (en) 2022-05-07 2022-05-07 Compound-protein interaction prediction method fusing network topology information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210491027.XA CN114678081A (en) 2022-05-07 2022-05-07 Compound-protein interaction prediction method fusing network topology information

Publications (1)

Publication Number Publication Date
CN114678081A true CN114678081A (en) 2022-06-28

Family

ID=82080097

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210491027.XA Pending CN114678081A (en) 2022-05-07 2022-05-07 Compound-protein interaction prediction method fusing network topology information

Country Status (1)

Country Link
CN (1) CN114678081A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116486900A (en) * 2023-04-25 2023-07-25 徐州医科大学 Drug target affinity prediction method based on depth mode data fusion

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116486900A (en) * 2023-04-25 2023-07-25 徐州医科大学 Drug target affinity prediction method based on depth mode data fusion
CN116486900B (en) * 2023-04-25 2024-05-03 徐州医科大学 Drug target affinity prediction method based on depth mode data fusion

Similar Documents

Publication Publication Date Title
Sun et al. Disease prediction via graph neural networks
Berg et al. Structure and evolution of protein interaction networks: a statistical model for link dynamics and gene duplications
CN112765370B (en) Entity alignment method and device of knowledge graph, computer equipment and storage medium
CN112800231B (en) Power data verification method and device, computer equipment and storage medium
Wang et al. Machine learning-based methods for prediction of linear B-cell epitopes
CN114678081A (en) Compound-protein interaction prediction method fusing network topology information
Kawano et al. Seq2seq fingerprint with byte-pair encoding for predicting changes in protein stability upon single point mutation
CN111540405A (en) Disease gene prediction method based on rapid network embedding
Feng et al. MGMAE: molecular representation learning by reconstructing heterogeneous graphs with A high mask ratio
Biswas et al. Robust inductive matrix completion strategy to explore associations between lincrnas and human disease phenotypes
CN114582508A (en) Methods for predicting potentially relevant circular RNA-disease pairs based on GCN and integrated learning
CN117349494A (en) Graph classification method, system, medium and equipment for space graph convolution neural network
Shirmohammady et al. PPI‐GA: A Novel Clustering Algorithm to Identify Protein Complexes within Protein‐Protein Interaction Networks Using Genetic Algorithm
CN116798653A (en) Drug interaction prediction method, device, electronic equipment and storage medium
CN112466410B (en) Method and device for predicting binding free energy of protein and ligand molecule
Shen et al. Accurate identification of antioxidant proteins based on a combination of machine learning techniques and hidden Markov model profiles
Chan et al. 3D pride without 2D prejudice: Bias-controlled multi-level generative models for structure-based ligand design
Jiang et al. Kernel techniques in support vector machines for classification of biological data
Iqbal et al. Computational Technique for an Efficient Classification of Protein Sequences With Distance‐Based Sequence Encoding Algorithm
Pollastri et al. Distill: a machine learning approach to ab initio protein structure prediction
Narra et al. Use of extended phylogenetic profiles with E-values and support vector machines for protein family classification
CN117976244B (en) Medicine interaction prediction method and device based on multidimensional characteristics
Wang et al. A parallel model of DenseCNN and ordered‐neuron LSTM for generic and species‐specific succinylation site prediction
Yuan et al. Constructing a PPI Network Based on Deep Transfer Learning for Protein Complex Detection
Tang et al. A Drug Repositioning Approach Using Drug and Disease Features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination