CN114765063A - Protein and nucleic acid binding site prediction method based on graph neural network characterization - Google Patents

Protein and nucleic acid binding site prediction method based on graph neural network characterization Download PDF

Info

Publication number
CN114765063A
CN114765063A CN202110037110.5A CN202110037110A CN114765063A CN 114765063 A CN114765063 A CN 114765063A CN 202110037110 A CN202110037110 A CN 202110037110A CN 114765063 A CN114765063 A CN 114765063A
Authority
CN
China
Prior art keywords
protein
graph
node
residue
residues
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110037110.5A
Other languages
Chinese (zh)
Inventor
夏莹
沈红斌
潘小勇
夏春秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202110037110.5A priority Critical patent/CN114765063A/en
Publication of CN114765063A publication Critical patent/CN114765063A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A prediction method of protein and nucleic acid binding sites based on graph neural network representation includes constructing a protein and nucleic acid interaction data set, extracting position and feature information and structure context of each residue in protein after sample fusion processing, constructing a graph representation of the structure context of the residue, predicting the graph representation of the protein to be predicted through a graph neural network, obtaining the binding probability of each residue and DNA/RNA, and achieving prediction of the protein and nucleic acid binding sites. The present invention learns key structural and characteristic patterns of binding sites from a graph representation by a graph representation of residues based on structural context and a hierarchical graph neural network model.

Description

Protein and nucleic acid binding site prediction method based on graph neural network characterization
Technical Field
The invention relates to a technology in the field of biological engineering, in particular to a protein and nucleic acid binding site prediction method based on graph neural network representation of protein local structure context.
Background
Protein-nucleic acid interactions play an important role in a variety of life activities, such as DNA replication, transcription, translation, gene expression, signal transduction and recognition, and learning protein-nucleic acid interactions is of great importance in the analysis of genes, protein functions and drug design. Because the experimental analysis of the interaction between protein and nucleic acid has the disadvantages of high price, time consumption and the like, the current requirement of mass protein analysis cannot be met, and the calculation-based method becomes more and more important. Current methods based on computation can be divided into methods based on protein sequences and methods based on protein structure. Sequence-based methods learn the local pattern of binding sites from protein sequences, but because binding sites tend to be spatially conserved, sequence-based methods may not be sufficient to obtain sufficient characteristics of binding sites, thereby affecting prediction accuracy. Structure-based methods attempt to learn the local tertiary structure pattern of binding sites from the structural information of proteins, and since tertiary structure directly determines protein function, structure-based protein binding site prediction algorithms tend to have greater accuracy. The main challenge of structure-based approaches is how to encode structural information and learn key structural and physicochemical properties of the binding site from it. Some current methods use manually designed protein structure descriptors, which are predefined and therefore do not allow targeted extraction of critical information for downstream tasks (Li, s., Kazuo, y., Mar, a.k.and standard, D.M. (2014) quantitative sequence and structural defects of protein-rnanteractions. nucleic acids Research,42, 10086-. There are also methods of mapping protein position information into a three-dimensional grid in space and learning the pattern of binding sites structurally using a three-dimensional convolutional neural network, which have problems in that the distribution of atoms in the protein in three-dimensional space is sparse, which is not conducive to being mapped into Euclidean space, and it is difficult to ensure rotational-translational invariance of the protein (Jimenez, J., Doerr, S., Martinezrostel, G., Rose, A.S. and De Fabritis, G. (2017) DeepSite: protein-binding site prediction using 3D-volumetric neural networks, 33, 3036-3042.). How to better represent the structural information and physicochemical properties of proteins and accurately predict nucleic acid binding sites remains a challenge.
Disclosure of Invention
The invention provides a prediction method of protein and nucleic acid binding sites based on graph neural network representation aiming at the problem of low identification precision of the existing protein nucleic acid binding sites, and the key structure and characteristic pattern of the binding sites are learned from the graph representation through a residue graph representation based on structure context and a hierarchical graph neural network model.
The invention is realized by the following technical scheme:
the invention relates to a prediction method of protein and nucleic acid binding sites based on graph neural network representation, which comprises the steps of constructing a protein and nucleic acid interaction data set, extracting position and characteristic information and structure context of each residue in protein after sample fusion treatment, constructing a graph representation of the structure context of the residue according to the position and characteristic information and the structure context, predicting the graph representation of the protein to be predicted through a hierarchical graph neural network to obtain the probability of binding each residue with DNA/RNA, and realizing the prediction of the protein and nucleic acid binding sites.
The protein data set, namely the protein data set which interacts with the DNA/RNA, is constructed by the following method: a complex set of proteins and nucleic acids and tags of binding sites of proteins and nucleic acids are extracted from BioLip, and a protein set binding to DNA and a protein set binding to RNA are extracted according to the type of bases in the complex.
Preferably, the sequences released 6 days before 1 month in 2016 are used as training set and the rest are used as test set according to their release time.
The sample fusion treatment refers to: data enhancement is carried out on a training set aiming at the problem of unbalance of positive and negative samples of the data set, the sequence and the label of the binding site of the protein cluster with high structural similarity are fused, the positive sample proportion of the training set is improved, and the method specifically comprises the following steps:
the method comprises the following steps of clustering proteins in a training set: calculating the sequence similarity and the structure similarity of two proteins by applying bl2seq and TM-align; aggregating proteins with sequence similarity greater than 0.8 and TM-score greater than 0.5 into a cluster; migrating the labels of the actual binding sites of the proteins of the same cluster to the proteins with the most residues in the cluster; removing proteins with sequence similarity of more than 30% in the training set;
and step two, using CD-HIT to remove the protein with the sequence similarity of more than 30 percent with the training set in the test set and ensure that the sequence similarity of the test set per se is lower than 30 percent.
The position and characteristic information of the residues include: the method comprises the following steps of extracting the characteristics, the secondary structure information, the bond and torsion angle information, the solubility and the evolution conservative information of atoms in the following specific modes:
Step 1, for each residue, defining the position as the coordinate of the centroid of the residue;
step 2, extracting the characteristics of atoms belonging to each residue, and then averaging the characteristics of all atoms of the residue for each atomic characteristic to obtain an atomic characteristic matrix of the residue, wherein the size is L multiplied by 7, and L represents the residue number of the protein;
the characteristics of the atoms include: atomic mass, B-factor, whether it is a side chain atom, charge, number of bound hydrogen atoms, whether it is in a ring.
Step 3, calculating a secondary structure feature matrix with the size of L multiplied by 14 according to the protein structure by using DSSP for the protein with L residues;
the secondary structure characteristic matrix comprises: water soluble surface area of residues, bond and twist angle, one-hot encoded eight-state secondary structure.
Step 4, for the protein with L residues, using PSI-BLAST to search NR database of NCBI, and calculating an evolutionary conservation scoring matrix PSSM with the size of L multiplied by 20;
the evolution conservative scoring matrix PSSM represents the probability of mutation of amino acid into 20 amino acids, and is preferably normalized by a sigmoid function on an element x in the PSSM:
Figure BDA0002893623830000031
step 5, for the protein with L residues, HHblits are used for searching a unicust 30 database, and another evolutionary conservation matrix HMM with the size of L multiplied by 30 is calculated;
In the evolution conservative matrix HMM, the first 20 indicates the probability of mutation of an amino acid into 20 amino acids, the first 21-27 indicates the transition frequency, and the last three columns indicate local diversity, preferably, using normalization for each element x in the matrix:
Figure BDA0002893623830000032
and 6, splicing the atom feature matrix, the secondary structure feature matrix, the evolution conservative scoring matrix PSSM and the evolution conservative matrix HMM to obtain an L multiplied by 71 feature matrix for the protein with L residues.
The structural context is obtained by using a structure-based sliding sphere extraction for each residue of the protein according to the position information of the residue in the protein, and specifically comprises the following steps: based on the position coordinates of the residues in the tertiary structure, a sliding sphere along the polypeptide chain is used to extract the structural context of each residue, specifically, for a target residue, centered on the residue, rgA sphere is drawn for the radius, and all residues within the sphere and their positional relationships together form the structural context of the residues.
The graph representation defines residues in the structural context as nodes in the graph, an adjacency matrix is constructed according to node relative position information, and the characteristics of the nodes and edges of the graph are defined according to the relative position relationship and the characteristics of the residues, and the specific construction mode is as follows:
Step i) taking residues in the structural context range as nodes of the graph, calculating the distance from the node to the sphere center as a 1-dimensional position code of the residues, splicing the position code of the residues and the position information of the extracted residues to obtain node features with the size of 72, wherein a node feature set is expressed as
Figure BDA0002893623830000033
Wherein the node characteristic of node i is denoted vi
Step ii) calculating twoThe distance between two nodes is used to obtain the distance matrix D of the nodes, and the threshold value r is usedvThe distance matrix is binarized to obtain an adjacency matrix a of nodes, specifically, for node i and node j,
Figure BDA0002893623830000034
when A isijIf 1, it indicates that there is an edge between node i and node j, i.e. edge ij;
step iii) the edge feature set is denoted as E ═ Eij|Aij1} where the characteristic e of the edge ijijThe method comprises the following steps: 1) distance between node i and node j, 2) vector of node i to sphere center
Figure BDA0002893623830000035
And the vector of node j to the center of sphere
Figure BDA0002893623830000036
Form an included angle thetaijThe cosine value of (a) of (b),
Figure BDA0002893623830000037
the hierarchical neural network is used for learning local geometric and characteristic local patterns of the binding sites from a graph representation of a residue structure context in a supervised manner according to a training set sample, and specifically comprises the following steps: a graph network coding unit, a Gate Round Unit (GRU) based K-layer stacked graph neuron block, a multi-layer perceptron based bi-classifier, wherein: the graph network coding unit carries out edge coding, node coding and graph coding in sequence; edge updating, node updating and graph updating are sequentially carried out on K-layer stacking graph neuron blocks based on gate cycle units; graph features { u } obtained according to updating by using two classifiers based on multi-layer perceptron k1,. K } predicts the binding probability of a residue to DNA/RNA.
The edge coding means: splice edge feature
Figure BDA0002893623830000041
And node characteristics of two end nodes thereof
Figure BDA0002893623830000042
And
Figure BDA0002893623830000043
and passes through a non-linear transformation layer
Figure BDA0002893623830000044
Computing encoded edge characteristics
Figure BDA0002893623830000045
Node encoding means: splice node characterization
Figure BDA0002893623830000046
And the sum of all the encoded edge characteristics taking the node as a starting point, and passing through a nonlinear transformation layer
Figure BDA0002893623830000047
Computing coded node characteristics
Figure BDA0002893623830000048
The graph coding means: calculating the sum of all the encoded node features on the graph and passing through a non-linear transformation layer
Figure BDA0002893623830000049
Computing encoded graph features uenc(ii) a The nonlinear transformation layer is implemented as MLP (X) W2max(0,W1X+b1)+b2Wherein X is the input, W1、W2、b1、b2Is a parameter to be learned by the network.
The edge updating means that: first concatenating input edge features
Figure BDA00028936238300000410
Node characteristics of two end nodes
Figure BDA00028936238300000411
And
Figure BDA00028936238300000412
and graph feature uk-1And computing the edge characteristics of the intermediate state through a nonlinear transformation layer
Figure BDA00028936238300000413
The features of the two states are fused by a gate cycle unit to obtain updated edge features
Figure BDA00028936238300000414
The node update means that: first concatenating input node features
Figure BDA00028936238300000415
Sum of all updated edge features starting from the node, and graph feature uk-1And computing node characteristics of intermediate states through a nonlinear transformation layer
Figure BDA00028936238300000416
Fusing the characteristics of the two states through a gate cycle unit to obtain updated node characteristics
Figure BDA00028936238300000417
The graph updating means that: first the sum of the updated features of all nodes on the graph and the graph feature uk-1And calculating the graph characteristic u of the intermediate state through a nonlinear transformation layerk′And fusing the graph characteristics of the two states by a gate cycle unit to obtain an updated graph characteristic uk
The classifier is defined as that the classifier is,
Figure BDA00028936238300000418
wherein the content of the first and second substances,
Figure BDA00028936238300000419
the hierarchical graph neural network takes 80% of DNA/RNA combined protein data sets as training sets to supervise the training neural network, and the rest 20% as verification sets to adjust the hyper-parameters of the model and define a binarization threshold value T.
The hyper-parameters comprise: r isg、rvK, dimensions of W and b in the non-linear transformation.
The adjustment specifically comprises the steps of selecting a hyper-parameter combination which enables the Markov coefficient of the verification set to be the highest as a hyper-parameter of the model in a mode of carrying out grid search on the hyper-parameter, and selecting a binarization threshold which enables the Markov coefficient of the verification set to be the highest as a binarization threshold T of the model.
And (3) predicting the binding site of the protein and the nucleic acid, and comparing the probability of each residue combined with DNA/RNA, which is obtained by prediction of a neural network of a hierarchical diagram, with a threshold value T to obtain a combined residue and a non-combined residue.
The invention relates to a system for realizing the method, which comprises the following steps: the device comprises a feature extraction module, a structure context construction module, a graph representation module and a hierarchical graph neural network module, wherein the feature extraction module extracts the position and feature information of a protein from the sequence and the structure of the protein, the structure context construction module constructs the local context of residues through the position information of the protein, the structure context of the residues constructed by the protein information extracted by the feature extraction module and the structure context module is transmitted to the graph representation module to construct the graph representation of the residue structure context, and the hierarchical graph neural network module takes the graph representation of the residue structure context as the input of the module to predict the probability of the residues belonging to the protein and nucleic acid binding residues.
Technical effects
The invention integrally solves the problem of lower prediction precision of the binding site of protein and DNA/RNA in the prior art; compared with the prior art, the method has the advantages that the local neighborhood information of residues in the three-dimensional space is represented in a non-Euclidean space graph form, the local geometric information and various biochemical characteristics of the protein are coded by the characteristics of nodes and edges in the graph, the expression form does not need to manually design descriptors of the protein structure, the graph representation is more suitable for sparse and irregular protein residue distribution compared with the Euclidean space structure expression, the rotation and translation invariance is realized, and the important structure and biochemical characteristics in the binding sites can be automatically learned end to end through the designed hierarchical graph neural network, so that the prediction precision of the binding sites of the protein and DNA/RNA is improved. Compared with the prior art, the method provided by the invention has the advantages that the prediction precision of the binding site of the protein and the nucleic acid is obviously improved, the binding site of the protein and other ligands can be conveniently predicted by migration, and the generalization performance is good.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram of the building residues (A) feature extraction module (B) structural context building module (C) diagram representation module;
FIG. 3 is a flow chart of a hierarchical neural network;
FIG. 4 is a flow diagram of a graph neuron block in a hierarchical neural network.
Detailed Description
In this example, a data set of binding sites of protein and DNA \ RNA is used as a training set and a test set in this example, for the data set of DNA binding, the training set comprises 573 proteins, and data enhancement results in 14479 binding residues and 145404 non-binding residues, and the test set comprises 129 proteins, including 2249 binding residues and 35275 non-binding residues; for the RNA binding dataset, the training set contained 495 proteins, which were data enhanced to give 14609 binding residues and 122290 non-binding residues, and the test set contained 117 proteins, including 2031 binding residues and 35314 non-binding residues. The data enhancement procedure was to calculate the sequence and structural similarity of the proteins and to group proteins with sequence similarity greater than 0.8 and TM-score greater than 0.5 into clusters, and to migrate the binding site tags of the proteins in each cluster to the proteins with the most residues in the cluster.
As shown in FIG. 1, this example relates to a method for predicting protein and nucleic acid binding sites based on graph neural network characterization, which comprises the following steps:
extracting three-dimensional position information of each residue in the protein and characteristics of atoms included in the three-dimensional position information based on the sequence and the structure of the protein in the database, extracting secondary structure information, bond and torsion angle information and solubility of the protein by using a DSSP algorithm, calculating evolutionary conservative information of the protein by using PSI-BLAST and HHblits algorithms, normalizing the characteristics except the position information to be between [0 and 1], and obtaining a position information matrix of L multiplied by 3 and a characteristic matrix of L multiplied by 71 for the protein with L residues.
Design a radius of rgThe three-dimensional sphere of (1) has a target residue as a sphere center, and residues in the sphere and the positional relationship between the residues together constitute the structural context of the residues.
Constructing a graph representation of a residue local neighborhood, taking residues in a structure context range as nodes of the graph, calculating the distance from the nodes to the sphere center as 1-dimensional position codes of the residues, splicing the residue position codes and a characteristic matrix of protein to obtain node characteristics with the size of 72, and expressing the node characteristics of a node i as v iCalculating the distance between every two nodes to obtain the distance matrix D of the nodes, and using the threshold rvBinarizing the distance matrix to obtain an adjacent matrix A of nodes, and calculating the characteristics e of edgesijTwo location-dependent features are included: 1) distance between node i and node j, 2) vector of node i to sphere center
Figure BDA0002893623830000061
And the vector of node j to the center of sphere
Figure BDA0002893623830000062
The cosine of the angle formed.
And fourthly, taking 80 percent of protein in the training set as the training set and the rest 20 percent as the verification set, and ensuring that the similarity of the training set and the verification set is less than 30 percent under the CD-HIT algorithm. Learning parameters of the model through supervised training according to training set samples, inputting data in a training set into a constructed hierarchical graph neural network according to mini-batch, solving the difference between a forward propagation result and a real label on the training set according to a cross entropy loss function, and performing iterative optimization by using an Adam optimizer until the Markov coefficient index on a verification set cannot be increased in continuous 10 iterations. Adopting a grid searching method when optimizing the network hyper-parameters, and calculating the trained network for each group of hyper-parametersAnd (3) obtaining the Marx coefficient of the verification set by the complexation, selecting the hyper-parameter combination with the maximum value as the final hyper-parameter of the model, wherein the specific parameter is mini-batch-64 and r g=20、rvThe dimension of the map coding unit and the map neuron block is 128, K4. Thus, the models included in the present invention are trained.
After model training is finished, extracting the same characteristics of the proteins in the test set, constructing the same graph representation, inputting the trained graph neural network for prediction to obtain the probability P of the residue as the binding site, judging whether the residue is the binding site according to a binarization threshold value T optimized on the verification set, if the residue is more than T, judging the residue as the binding site, otherwise, judging the residue as the binding site.
The evaluation indexes adopted in the present embodiment include: recall rate
Figure BDA0002893623830000063
Rate of accuracy
Figure BDA0002893623830000064
F1 value
Figure BDA0002893623830000065
Figure BDA0002893623830000066
Coefficient of Marx
Figure BDA0002893623830000067
Area AUC under the ROC curve and the coordinate axis enclose, wherein: TP, FP, TN, FN are true positive, false positive, true negative, false negative results, respectively.
In the experimental phase, this example is compared with other representative protein and nucleic acid binding site recognition methods:
1)Hu,J.,Li,Y.,Zhang,M.,Yang,X.,Shen,H.B.and Yu,D.J.(2017)Predicting Protein-DNA Binding Residues by Weightedly Combining Sequence-Based Features and Boosting Multiple SVMs.IEEE/ACM Trans ComputBiolBioinform,14,1389-1398.
2)Yu,D.-J.,Hu,J.,Yang,J.,Shen,H.-B.,Tang,J.and Yang,J.-Y.(2013)Designing template-free predictor for targeting protein-ligand binding sites with classifier ensemble and spatial clustering.IEEE/ACM transactions on computational biology and bioinformatics,10,994-1008.
3)Li,S.,Kazuo,Y.,Mar,A.K.and Standley,D.M.(2014)Quantifying sequence and structural features ofprotein–RNAinteractions.NucleicAcidsResearch,42,10086-10098.
4)Lam,J.H.,Li,Y.,Zhu,L.,Umarov,R.,Jiang,H.,Héliou,A.,Sheong,F.K.,Liu,T.,Long,Y.and Li,Y.(2019)A deep learning framework to predict binding preference of RNA constituents on protein surface.Nature communications,10,1-13.
5)Su,H.,Liu,M.,Sun,S.,Peng,Z.and Yang,J.(2019)Improving the prediction of protein–nucleic acids binding residues via multiple sequence profiles and the consensus of complementarymethods.Bioinformatics,35,930-936.
6)Zhu,Y.-H.,Hu,J.,Song,X.-N.and Yu,D.-J.(2019)DNAPred:accurate identification of DNA-binding sites from protein sequence by ensembled hyperplane-distance-based support vector machines.Journal ofchemical information and modeling,59,3057-3071.
7)Walia,R.R.,Xue,L.C.,Wilkins,K.,El-Manzalawy,Y.,Dobbs,D.and Honavar,V.(2014)RNABindRPlus:a predictor that combines machine learning and sequence homology-based methods to improve the reliability of predicted RNA-binding residues in proteins.PLoS One,9,e97725.
8)Liu,R.and Hu,J.(2013)DNABind:A hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning-and template-based approaches.Proteins:Structure,Function,and Bioinformatics,81,1885-1899.)。
the results are shown in the following table. The method is obviously better than other methods in each index, which shows that the method can identify more real binding samples, has lower false positive rate than other methods, and has obvious advantages in identifying the binding sites of the protein and the nucleic acid.
Figure BDA0002893623830000071
Compared with the prior art, the method has the advantages that the structural neighborhood information of the residues is coded into the non-European space diagram representation, the manual design of a structure descriptor is avoided, the method is more suitable for the characterization of a protein sparse structure, and the local mode of the structure and the characteristics closely related to the task of predicting the binding site can be learned from the diagram representation through the training of an end-to-end neural network. The prediction of the protein and DNA binding site is improved by 8.2%, 8.8% and 6.9% on F1, MCC and AUC respectively, and the prediction of the protein and RNA binding site is improved by 9.7%, 10.6% and 6.6% on F1, MCC and AUC respectively. Specifically, the prediction accuracy of the method is higher than that of other sequence-based or structure-based methods in the field, and the method has higher accuracy compared with a DNABind and aaRNA method which uses manually designed descriptors to characterize protein structures and a nucleoicNet method which uses a convolutional neural network, so that on one hand, the method retains more complete topological information of the proteins and is more suitable for characterizing local structures of the proteins, and on the other hand, the end-to-end hierarchical graph neural network can learn local patterns of the structures and biochemical features which are more relevant to a downstream binding site prediction task from a graph representation of the local protein structures.
The foregoing embodiments may be modified in many different ways by one skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and not by the preceding embodiments, and all embodiments within their scope are intended to be limited by the scope of the invention.

Claims (10)

1. A prediction method of protein and nucleic acid binding sites based on graph neural network representation is characterized in that a protein data set is constructed, position information and structure context of each residue in protein are extracted after sample fusion processing, graph representation of the structure context of the residue is constructed according to the position information and the structure context, and the graph representation of the protein to be predicted is predicted through a hierarchical graph neural network to obtain the probability of binding each residue with DNA/RNA, so that the prediction of the protein and nucleic acid binding sites is realized;
the protein data set, namely the protein data set which interacts with the DNA/RNA, is constructed by the following method: extracting a complex set of proteins and nucleic acids and labels of binding sites of the proteins and the nucleic acids from BioLip, and extracting a protein set combined with DNA and a protein set combined with RNA according to the types of bases in the complex;
The hierarchical neural network is used for learning local geometric and characteristic local patterns of the binding sites from a graph representation of a residue structure context in a supervised manner according to a training set sample, and specifically comprises the following steps: the device comprises a graph network coding unit, a K-layer stacking graph neuron block based on a gate cycle unit and a classifier based on a multilayer perceptron, wherein: the graph network coding unit carries out edge coding, node coding and graph coding in sequence; edge updating, node updating and graph updating are sequentially carried out on K-layer stacking graph neuron blocks based on gate cycle units; graph features (u) obtained according to updating by using two classifiers based on multi-layer perceptronk1.,. K } predicts the binding probability of residues to DNA/RNA;
and predicting the binding site of the protein and the nucleic acid, predicting the probability of each residue to be combined with DNA/RNA through a neural network of a hierarchical graph, and comparing the probability with a threshold value T to obtain a combined residue and a non-combined residue.
2. The prediction method according to claim 1, wherein the sample fusion process is: data enhancement is carried out to the training set to the unbalanced problem of positive negative sample of data set, fuses the label of the binding site of the protein cluster that sequence and structure similarity are high, promotes the positive sample proportion of training set, specifically includes:
The method comprises the following steps of clustering proteins in a training set: calculating the sequence similarity and the structural similarity of the two proteins by using bl2seq and TM-align; aggregating proteins with sequence similarity greater than 0.8 and TM-score greater than 0.5 into a cluster; migrating a label of the true binding site of a protein of the same cluster to the protein having the most residues in the cluster; removing proteins with sequence similarity of more than 30% in the training set;
and step two, using CD-HIT to remove the protein with the sequence similarity of more than 30 percent between the test set and the training set and ensure that the sequence similarity of the test set is less than 30 percent.
3. The prediction method of claim 1, wherein the information on the position and characteristics of the residue comprises: the method comprises the following steps of extracting the characteristics, the secondary structure information, the bond and torsion angle information, the solubility and the evolution conservative information of atoms in the following specific modes:
step 1, for each residue, defining the position as the coordinate of the centroid of the residue;
step 2, extracting the characteristics of atoms belonging to each residue, and then averaging the characteristics of all atoms of the residue for each atomic characteristic to obtain an atomic characteristic matrix of the residue, wherein the size is L multiplied by 7, and L represents the residue number of the protein;
Step 3, calculating a secondary structure feature matrix with the size of L multiplied by 14 according to the protein structure by using DSSP for the protein with L residues;
step 4, for the protein with L residues, using PSI-BLAST to search NR database of NCBI, and calculating an evolutionary conservation scoring matrix PSSM with the size of L multiplied by 20;
step 5, for the protein with L residues, HHblits are used for searching a unicust 30 database, and another evolutionary conservation matrix HMM with the size of L multiplied by 30 is calculated;
and 6, splicing the atom feature matrix, the secondary structure feature matrix, the evolution conservative scoring matrix PSSM and the evolution conservative matrix HMM to obtain an L multiplied by 71 feature matrix for the protein with L residues.
4. The prediction method of claim 1, wherein the structural context,the protein is obtained by using structure-based sliding sphere extraction on each residue of the protein according to the position information of the residue in the protein, and specifically comprises the following steps: based on the position coordinates of the residues in the tertiary structure, a sliding sphere along the polypeptide chain is used to extract the structural context of each residue, specifically, for a target residue, centered on the residue, r gA sphere is drawn for the radius, and all residues within the sphere and their positional relationships together form the structural context of the residues.
5. The prediction method of claim 1, wherein the graph representation defines residues in the structural context as nodes in the graph, constructs the adjacency matrix based on the relative position information of the nodes, and defines the characteristics of the nodes and edges of the graph based on the relative positional relationship and the characteristics of the residues by:
step i) using the residues in the structural context range as nodes of the graph, calculating the distance from the nodes to the sphere center as 1-dimensional position codes of the residues, splicing the position codes of the residues and the extracted position information of the residues to obtain node features with the size of 72, wherein a node feature set is expressed as
Figure FDA0002893623820000021
Wherein the node characteristic of node i is denoted by vi
Step ii) calculating the distance between every two nodes to obtain a distance matrix D of the nodes, and using a threshold value rvThe distance matrix is binarized to obtain an adjacency matrix a of nodes, specifically, for node i and node j,
Figure FDA0002893623820000022
when A isijIf 1, it indicates that there is an edge between node i and node j, i.e. edge ij;
step iii) the edge feature set is denoted as E ═ { E ═ Eij|Aij1} where the characteristic e of the edge ij ijThe method comprises the following steps: 1) distance between node i and node j, 2) vector of node i to sphere center
Figure FDA0002893623820000023
And the vector of node j to the center of the sphere
Figure FDA0002893623820000024
Form an included angle thetaijThe value of the cosine of (a) is,
Figure FDA0002893623820000025
Figure FDA0002893623820000026
6. the prediction method of claim 1, wherein the edge coding is: splice edge feature
Figure FDA0002893623820000031
And node characteristics of two end nodes thereof
Figure FDA0002893623820000032
And
Figure FDA0002893623820000033
and passing through a non-linear transformation layer
Figure FDA0002893623820000034
Computing encoded edge features
Figure FDA0002893623820000035
The node coding means: splice node features
Figure FDA0002893623820000036
And the sum of all the encoded edge features starting from the node, and passing through a non-linear transformation layer
Figure FDA0002893623820000037
Computing encoded nodesFeature(s)
Figure FDA0002893623820000038
The graph coding means: calculating the sum of all the encoded node features on the graph and passing through a non-linear transformation layer
Figure FDA0002893623820000039
Computing encoded graph features uenc(ii) a The nonlinear transformation layer is implemented as MLP (X) ═ W2max(0,W1X+b1)+b2Wherein X is the input, W1、W2、b1、b2Is the network parameter to be learned.
7. The prediction method of claim 1, wherein the edge update is: first concatenating input edge features
Figure FDA00028936238200000310
Node characteristics of two end nodes
Figure FDA00028936238200000311
And
Figure FDA00028936238200000312
and graph feature uk-1And computing the edge characteristics of the intermediate state through a nonlinear transformation layer
Figure FDA00028936238200000313
Fusing the characteristics of the two states through a gate cycle unit to obtain an updated edge characteristic
Figure FDA00028936238200000314
The node update means that: first concatenating input node features
Figure FDA00028936238200000315
All updated sidelets starting from the nodeSum of features and graph feature uk-1And computing node characteristics of intermediate states through a nonlinear transformation layer
Figure FDA00028936238200000316
Fusing the characteristics of the two states through a gate cycle unit to obtain updated node characteristics
Figure FDA00028936238200000317
The graph updating refers to the following steps: first splicing the sum of the updated features of all nodes on the graph and graph feature uk-1And calculating the graph characteristic u of the intermediate state through a nonlinear transformation layerk′And fusing the graph characteristics of the two states by a gate cycle unit to obtain an updated graph characteristic uk
8. The prediction method of claim 1, wherein the classifier
Figure FDA00028936238200000318
Figure FDA00028936238200000319
Wherein the content of the first and second substances,
Figure FDA00028936238200000320
9. the prediction method according to claim 1, wherein the adjustment is specifically to select a hyper-parameter combination that makes the mahius coefficient of the verification set the highest as the hyper-parameter of the model by performing a grid search on the hyper-parameter, and select a binarization threshold T that makes the mahius coefficient of the verification set the highest as the binarization threshold.
10. A system for implementing the method of any preceding claim, comprising: the device comprises a feature extraction module, a structure context construction module, a graph representation module and a hierarchical graph neural network module, wherein the feature extraction module extracts the position and feature information of a protein from the sequence and the structure of the protein, the structure context construction module constructs the local context of residues through the position information of the protein, the structure context of the residues constructed by the protein information extracted by the feature extraction module and the structure context module is transmitted to the graph representation module to construct the graph representation of the residue structure context, and the hierarchical graph neural network module takes the graph representation of the residue structure context as the input of the module to predict the probability that the residues are the protein-nucleic acid binding residues.
CN202110037110.5A 2021-01-12 2021-01-12 Protein and nucleic acid binding site prediction method based on graph neural network characterization Pending CN114765063A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110037110.5A CN114765063A (en) 2021-01-12 2021-01-12 Protein and nucleic acid binding site prediction method based on graph neural network characterization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110037110.5A CN114765063A (en) 2021-01-12 2021-01-12 Protein and nucleic acid binding site prediction method based on graph neural network characterization

Publications (1)

Publication Number Publication Date
CN114765063A true CN114765063A (en) 2022-07-19

Family

ID=82363680

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110037110.5A Pending CN114765063A (en) 2021-01-12 2021-01-12 Protein and nucleic acid binding site prediction method based on graph neural network characterization

Country Status (1)

Country Link
CN (1) CN114765063A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114927165A (en) * 2022-07-20 2022-08-19 深圳大学 Method, device, system and storage medium for identifying ubiquitination sites
CN116705194A (en) * 2023-06-06 2023-09-05 之江实验室 Method and device for predicting drug cancer suppression sensitivity based on graph neural network
WO2023217290A1 (en) * 2022-10-11 2023-11-16 之江实验室 Genophenotypic prediction based on graph neural network

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114927165A (en) * 2022-07-20 2022-08-19 深圳大学 Method, device, system and storage medium for identifying ubiquitination sites
CN114927165B (en) * 2022-07-20 2022-12-02 深圳大学 Method, device, system and storage medium for identifying ubiquitination sites
WO2023217290A1 (en) * 2022-10-11 2023-11-16 之江实验室 Genophenotypic prediction based on graph neural network
CN116705194A (en) * 2023-06-06 2023-09-05 之江实验室 Method and device for predicting drug cancer suppression sensitivity based on graph neural network

Similar Documents

Publication Publication Date Title
Yuan et al. Structure-aware protein–protein interaction site prediction using deep graph convolutional network
Xia et al. GraphBind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues
CN114765063A (en) Protein and nucleic acid binding site prediction method based on graph neural network characterization
CN115188412A (en) Drug prediction algorithm based on Transformer and graph neural network
Sriwastava et al. Predicting protein-protein interaction sites with a novel membership based fuzzy SVM classifier
CN114464247A (en) Method and device for predicting binding affinity based on antigen and antibody sequences
CN113744799A (en) End-to-end learning-based compound and protein interaction and affinity prediction method
CN116206688A (en) Multi-mode information fusion model and method for DTA prediction
CN112270950B (en) Network enhancement and graph regularization-based fusion network drug target relation prediction method
CN109784404A (en) A kind of the multi-tag classification prototype system and method for fusion tag information
CN113257357A (en) Method for predicting protein residue contact map
Zhou et al. Accurate and definite mutational effect prediction with lightweight equivariant graph neural networks
CN116312752A (en) Rigid body protein butt joint method based on isomorphism map neural network
Jagtap et al. Multiomics data integration for gene regulatory network inference with exponential family embeddings
US20240006017A1 (en) Protein Structure Prediction
Ma et al. Drug-target binding affinity prediction method based on a deep graph neural network
Xia et al. LigBind: Identifying Binding Residues for Over 1000 Ligands with Relation-Aware Graph Neural Networks
Hu et al. Improving protein-protein interaction site prediction using deep residual neural network
Antony et al. Protein secondary structure assignment using residual networks
Jha et al. Protein-protein interactions prediction based on Bi-directional gated recurrent unit and multimodal representation
CN114512188B (en) DNA binding protein recognition method based on improved protein sequence position specificity matrix
Xiong et al. Graph Representation Learning for Interactive Biomolecule Systems
Wang et al. Drug-target affinity prediction based on self-attention graph pooling and mutual interaction neural network
Sunny et al. Deepbindppi: epitope-paratope prediction using attention based graph convolutional network
CN115579151A (en) Drug prediction algorithm based on Transformer and graph neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination