CN114765063A

CN114765063A - Protein and nucleic acid binding site prediction method based on graph neural network characterization

Info

Publication number: CN114765063A
Application number: CN202110037110.5A
Authority: CN
Inventors: 夏莹; 沈红斌; 潘小勇; 夏春秋
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-01-12
Filing date: 2021-01-12
Publication date: 2022-07-19

Abstract

A prediction method of protein and nucleic acid binding sites based on graph neural network representation includes constructing a protein and nucleic acid interaction data set, extracting position and feature information and structure context of each residue in protein after sample fusion processing, constructing a graph representation of the structure context of the residue, predicting the graph representation of the protein to be predicted through a graph neural network, obtaining the binding probability of each residue and DNA/RNA, and achieving prediction of the protein and nucleic acid binding sites. The present invention learns key structural and characteristic patterns of binding sites from a graph representation by a graph representation of residues based on structural context and a hierarchical graph neural network model.

Description

Protein and nucleic acid binding site prediction method based on graph neural network characterization

Technical Field

The invention relates to a technology in the field of biological engineering, in particular to a protein and nucleic acid binding site prediction method based on graph neural network representation of protein local structure context.

Background

Protein-nucleic acid interactions play an important role in a variety of life activities, such as DNA replication, transcription, translation, gene expression, signal transduction and recognition, and learning protein-nucleic acid interactions is of great importance in the analysis of genes, protein functions and drug design. Because the experimental analysis of the interaction between protein and nucleic acid has the disadvantages of high price, time consumption and the like, the current requirement of mass protein analysis cannot be met, and the calculation-based method becomes more and more important. Current methods based on computation can be divided into methods based on protein sequences and methods based on protein structure. Sequence-based methods learn the local pattern of binding sites from protein sequences, but because binding sites tend to be spatially conserved, sequence-based methods may not be sufficient to obtain sufficient characteristics of binding sites, thereby affecting prediction accuracy. Structure-based methods attempt to learn the local tertiary structure pattern of binding sites from the structural information of proteins, and since tertiary structure directly determines protein function, structure-based protein binding site prediction algorithms tend to have greater accuracy. The main challenge of structure-based approaches is how to encode structural information and learn key structural and physicochemical properties of the binding site from it. Some current methods use manually designed protein structure descriptors, which are predefined and therefore do not allow targeted extraction of critical information for downstream tasks (Li, s., Kazuo, y., Mar, a.k.and standard, D.M. (2014) quantitative sequence and structural defects of protein-rnanteractions. nucleic acids Research,42, 10086-. There are also methods of mapping protein position information into a three-dimensional grid in space and learning the pattern of binding sites structurally using a three-dimensional convolutional neural network, which have problems in that the distribution of atoms in the protein in three-dimensional space is sparse, which is not conducive to being mapped into Euclidean space, and it is difficult to ensure rotational-translational invariance of the protein (Jimenez, J., Doerr, S., Martinezrostel, G., Rose, A.S. and De Fabritis, G. (2017) DeepSite: protein-binding site prediction using 3D-volumetric neural networks, 33, 3036-3042.). How to better represent the structural information and physicochemical properties of proteins and accurately predict nucleic acid binding sites remains a challenge.

Disclosure of Invention

The invention provides a prediction method of protein and nucleic acid binding sites based on graph neural network representation aiming at the problem of low identification precision of the existing protein nucleic acid binding sites, and the key structure and characteristic pattern of the binding sites are learned from the graph representation through a residue graph representation based on structure context and a hierarchical graph neural network model.

The invention is realized by the following technical scheme:

the invention relates to a prediction method of protein and nucleic acid binding sites based on graph neural network representation, which comprises the steps of constructing a protein and nucleic acid interaction data set, extracting position and characteristic information and structure context of each residue in protein after sample fusion treatment, constructing a graph representation of the structure context of the residue according to the position and characteristic information and the structure context, predicting the graph representation of the protein to be predicted through a hierarchical graph neural network to obtain the probability of binding each residue with DNA/RNA, and realizing the prediction of the protein and nucleic acid binding sites.

The protein data set, namely the protein data set which interacts with the DNA/RNA, is constructed by the following method: a complex set of proteins and nucleic acids and tags of binding sites of proteins and nucleic acids are extracted from BioLip, and a protein set binding to DNA and a protein set binding to RNA are extracted according to the type of bases in the complex.

Preferably, the sequences released 6 days before 1 month in 2016 are used as training set and the rest are used as test set according to their release time.

The sample fusion treatment refers to: data enhancement is carried out on a training set aiming at the problem of unbalance of positive and negative samples of the data set, the sequence and the label of the binding site of the protein cluster with high structural similarity are fused, the positive sample proportion of the training set is improved, and the method specifically comprises the following steps:

the method comprises the following steps of clustering proteins in a training set: calculating the sequence similarity and the structure similarity of two proteins by applying bl2seq and TM-align; aggregating proteins with sequence similarity greater than 0.8 and TM-score greater than 0.5 into a cluster; migrating the labels of the actual binding sites of the proteins of the same cluster to the proteins with the most residues in the cluster; removing proteins with sequence similarity of more than 30% in the training set;

and step two, using CD-HIT to remove the protein with the sequence similarity of more than 30 percent with the training set in the test set and ensure that the sequence similarity of the test set per se is lower than 30 percent.

The position and characteristic information of the residues include: the method comprises the following steps of extracting the characteristics, the secondary structure information, the bond and torsion angle information, the solubility and the evolution conservative information of atoms in the following specific modes:

Step 1, for each residue, defining the position as the coordinate of the centroid of the residue;

step 2, extracting the characteristics of atoms belonging to each residue, and then averaging the characteristics of all atoms of the residue for each atomic characteristic to obtain an atomic characteristic matrix of the residue, wherein the size is L multiplied by 7, and L represents the residue number of the protein;

the characteristics of the atoms include: atomic mass, B-factor, whether it is a side chain atom, charge, number of bound hydrogen atoms, whether it is in a ring.

Step 3, calculating a secondary structure feature matrix with the size of L multiplied by 14 according to the protein structure by using DSSP for the protein with L residues;

the secondary structure characteristic matrix comprises: water soluble surface area of residues, bond and twist angle, one-hot encoded eight-state secondary structure.

Step 4, for the protein with L residues, using PSI-BLAST to search NR database of NCBI, and calculating an evolutionary conservation scoring matrix PSSM with the size of L multiplied by 20;

the evolution conservative scoring matrix PSSM represents the probability of mutation of amino acid into 20 amino acids, and is preferably normalized by a sigmoid function on an element x in the PSSM:

step 5, for the protein with L residues, HHblits are used for searching a unicust 30 database, and another evolutionary conservation matrix HMM with the size of L multiplied by 30 is calculated;

In the evolution conservative matrix HMM, the first 20 indicates the probability of mutation of an amino acid into 20 amino acids, the first 21-27 indicates the transition frequency, and the last three columns indicate local diversity, preferably, using normalization for each element x in the matrix:

and 6, splicing the atom feature matrix, the secondary structure feature matrix, the evolution conservative scoring matrix PSSM and the evolution conservative matrix HMM to obtain an L multiplied by 71 feature matrix for the protein with L residues.

The structural context is obtained by using a structure-based sliding sphere extraction for each residue of the protein according to the position information of the residue in the protein, and specifically comprises the following steps: based on the position coordinates of the residues in the tertiary structure, a sliding sphere along the polypeptide chain is used to extract the structural context of each residue, specifically, for a target residue, centered on the residue, r_gA sphere is drawn for the radius, and all residues within the sphere and their positional relationships together form the structural context of the residues.

The graph representation defines residues in the structural context as nodes in the graph, an adjacency matrix is constructed according to node relative position information, and the characteristics of the nodes and edges of the graph are defined according to the relative position relationship and the characteristics of the residues, and the specific construction mode is as follows:

Step i) taking residues in the structural context range as nodes of the graph, calculating the distance from the node to the sphere center as a 1-dimensional position code of the residues, splicing the position code of the residues and the position information of the extracted residues to obtain node features with the size of 72, wherein a node feature set is expressed as

Wherein the node characteristic of node i is denoted v_i；

Step ii) calculating twoThe distance between two nodes is used to obtain the distance matrix D of the nodes, and the threshold value r is used_vThe distance matrix is binarized to obtain an adjacency matrix a of nodes, specifically, for node i and node j,

when A is_ijIf 1, it indicates that there is an edge between node i and node j, i.e. edge ij;

step iii) the edge feature set is denoted as E ═ E_ij|A_ij1} where the characteristic e of the edge ij_ijThe method comprises the following steps: 1) distance between node i and node j, 2) vector of node i to sphere center

And the vector of node j to the center of sphere

Form an included angle theta_ijThe cosine value of (a) of (b),

the hierarchical neural network is used for learning local geometric and characteristic local patterns of the binding sites from a graph representation of a residue structure context in a supervised manner according to a training set sample, and specifically comprises the following steps: a graph network coding unit, a Gate Round Unit (GRU) based K-layer stacked graph neuron block, a multi-layer perceptron based bi-classifier, wherein: the graph network coding unit carries out edge coding, node coding and graph coding in sequence; edge updating, node updating and graph updating are sequentially carried out on K-layer stacking graph neuron blocks based on gate cycle units; graph features { u } obtained according to updating by using two classifiers based on multi-layer perceptron ^k1,. K } predicts the binding probability of a residue to DNA/RNA.

The edge coding means: splice edge feature

And node characteristics of two end nodes thereof

And

and passes through a non-linear transformation layer

Computing encoded edge characteristics

Node encoding means: splice node characterization

And the sum of all the encoded edge characteristics taking the node as a starting point, and passing through a nonlinear transformation layer

Computing coded node characteristics

The graph coding means: calculating the sum of all the encoded node features on the graph and passing through a non-linear transformation layer

Computing encoded graph features u^enc(ii) a The nonlinear transformation layer is implemented as MLP (X) W₂max(0，W₁X+b₁)+b₂Wherein X is the input, W₁、W₂、b₁、b₂Is a parameter to be learned by the network.

The edge updating means that: first concatenating input edge features

Node characteristics of two end nodes

And

and graph feature u^k-1And computing the edge characteristics of the intermediate state through a nonlinear transformation layer

The features of the two states are fused by a gate cycle unit to obtain updated edge features

The node update means that: first concatenating input node features

Sum of all updated edge features starting from the node, and graph feature u^k-1And computing node characteristics of intermediate states through a nonlinear transformation layer

Fusing the characteristics of the two states through a gate cycle unit to obtain updated node characteristics

The graph updating means that: first the sum of the updated features of all nodes on the graph and the graph feature u^k-1And calculating the graph characteristic u of the intermediate state through a nonlinear transformation layer^k′And fusing the graph characteristics of the two states by a gate cycle unit to obtain an updated graph characteristic u^k。

The classifier is defined as that the classifier is,

wherein the content of the first and second substances,

the hierarchical graph neural network takes 80% of DNA/RNA combined protein data sets as training sets to supervise the training neural network, and the rest 20% as verification sets to adjust the hyper-parameters of the model and define a binarization threshold value T.

The hyper-parameters comprise: r is_g、r_vK, dimensions of W and b in the non-linear transformation.

The adjustment specifically comprises the steps of selecting a hyper-parameter combination which enables the Markov coefficient of the verification set to be the highest as a hyper-parameter of the model in a mode of carrying out grid search on the hyper-parameter, and selecting a binarization threshold which enables the Markov coefficient of the verification set to be the highest as a binarization threshold T of the model.

And (3) predicting the binding site of the protein and the nucleic acid, and comparing the probability of each residue combined with DNA/RNA, which is obtained by prediction of a neural network of a hierarchical diagram, with a threshold value T to obtain a combined residue and a non-combined residue.

The invention relates to a system for realizing the method, which comprises the following steps: the device comprises a feature extraction module, a structure context construction module, a graph representation module and a hierarchical graph neural network module, wherein the feature extraction module extracts the position and feature information of a protein from the sequence and the structure of the protein, the structure context construction module constructs the local context of residues through the position information of the protein, the structure context of the residues constructed by the protein information extracted by the feature extraction module and the structure context module is transmitted to the graph representation module to construct the graph representation of the residue structure context, and the hierarchical graph neural network module takes the graph representation of the residue structure context as the input of the module to predict the probability of the residues belonging to the protein and nucleic acid binding residues.

Technical effects

The invention integrally solves the problem of lower prediction precision of the binding site of protein and DNA/RNA in the prior art; compared with the prior art, the method has the advantages that the local neighborhood information of residues in the three-dimensional space is represented in a non-Euclidean space graph form, the local geometric information and various biochemical characteristics of the protein are coded by the characteristics of nodes and edges in the graph, the expression form does not need to manually design descriptors of the protein structure, the graph representation is more suitable for sparse and irregular protein residue distribution compared with the Euclidean space structure expression, the rotation and translation invariance is realized, and the important structure and biochemical characteristics in the binding sites can be automatically learned end to end through the designed hierarchical graph neural network, so that the prediction precision of the binding sites of the protein and DNA/RNA is improved. Compared with the prior art, the method provided by the invention has the advantages that the prediction precision of the binding site of the protein and the nucleic acid is obviously improved, the binding site of the protein and other ligands can be conveniently predicted by migration, and the generalization performance is good.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of the building residues (A) feature extraction module (B) structural context building module (C) diagram representation module;

FIG. 3 is a flow chart of a hierarchical neural network;

FIG. 4 is a flow diagram of a graph neuron block in a hierarchical neural network.

Detailed Description

In this example, a data set of binding sites of protein and DNA \ RNA is used as a training set and a test set in this example, for the data set of DNA binding, the training set comprises 573 proteins, and data enhancement results in 14479 binding residues and 145404 non-binding residues, and the test set comprises 129 proteins, including 2249 binding residues and 35275 non-binding residues; for the RNA binding dataset, the training set contained 495 proteins, which were data enhanced to give 14609 binding residues and 122290 non-binding residues, and the test set contained 117 proteins, including 2031 binding residues and 35314 non-binding residues. The data enhancement procedure was to calculate the sequence and structural similarity of the proteins and to group proteins with sequence similarity greater than 0.8 and TM-score greater than 0.5 into clusters, and to migrate the binding site tags of the proteins in each cluster to the proteins with the most residues in the cluster.

As shown in FIG. 1, this example relates to a method for predicting protein and nucleic acid binding sites based on graph neural network characterization, which comprises the following steps:

extracting three-dimensional position information of each residue in the protein and characteristics of atoms included in the three-dimensional position information based on the sequence and the structure of the protein in the database, extracting secondary structure information, bond and torsion angle information and solubility of the protein by using a DSSP algorithm, calculating evolutionary conservative information of the protein by using PSI-BLAST and HHblits algorithms, normalizing the characteristics except the position information to be between [0 and 1], and obtaining a position information matrix of L multiplied by 3 and a characteristic matrix of L multiplied by 71 for the protein with L residues.

Design a radius of r_gThe three-dimensional sphere of (1) has a target residue as a sphere center, and residues in the sphere and the positional relationship between the residues together constitute the structural context of the residues.

Constructing a graph representation of a residue local neighborhood, taking residues in a structure context range as nodes of the graph, calculating the distance from the nodes to the sphere center as 1-dimensional position codes of the residues, splicing the residue position codes and a characteristic matrix of protein to obtain node characteristics with the size of 72, and expressing the node characteristics of a node i as v _iCalculating the distance between every two nodes to obtain the distance matrix D of the nodes, and using the threshold r_vBinarizing the distance matrix to obtain an adjacent matrix A of nodes, and calculating the characteristics e of edges_ijTwo location-dependent features are included: 1) distance between node i and node j, 2) vector of node i to sphere center

And the vector of node j to the center of sphere

The cosine of the angle formed.

And fourthly, taking 80 percent of protein in the training set as the training set and the rest 20 percent as the verification set, and ensuring that the similarity of the training set and the verification set is less than 30 percent under the CD-HIT algorithm. Learning parameters of the model through supervised training according to training set samples, inputting data in a training set into a constructed hierarchical graph neural network according to mini-batch, solving the difference between a forward propagation result and a real label on the training set according to a cross entropy loss function, and performing iterative optimization by using an Adam optimizer until the Markov coefficient index on a verification set cannot be increased in continuous 10 iterations. Adopting a grid searching method when optimizing the network hyper-parameters, and calculating the trained network for each group of hyper-parametersAnd (3) obtaining the Marx coefficient of the verification set by the complexation, selecting the hyper-parameter combination with the maximum value as the final hyper-parameter of the model, wherein the specific parameter is mini-batch-64 and r _g＝20、r_vThe dimension of the map coding unit and the map neuron block is 128, K4. Thus, the models included in the present invention are trained.

After model training is finished, extracting the same characteristics of the proteins in the test set, constructing the same graph representation, inputting the trained graph neural network for prediction to obtain the probability P of the residue as the binding site, judging whether the residue is the binding site according to a binarization threshold value T optimized on the verification set, if the residue is more than T, judging the residue as the binding site, otherwise, judging the residue as the binding site.

The evaluation indexes adopted in the present embodiment include: recall rate

Rate of accuracy

F1 value

Coefficient of Marx

Area AUC under the ROC curve and the coordinate axis enclose, wherein: TP, FP, TN, FN are true positive, false positive, true negative, false negative results, respectively.

In the experimental phase, this example is compared with other representative protein and nucleic acid binding site recognition methods:

1)Hu,J.,Li,Y.,Zhang,M.,Yang,X.,Shen,H.B.and Yu,D.J.(2017)Predicting Protein-DNA Binding Residues by Weightedly Combining Sequence-Based Features and Boosting Multiple SVMs.IEEE/ACM Trans ComputBiolBioinform,14,1389-1398.

2)Yu,D.-J.,Hu,J.,Yang,J.,Shen,H.-B.,Tang,J.and Yang,J.-Y.(2013)Designing template-free predictor for targeting protein-ligand binding sites with classifier ensemble and spatial clustering.IEEE/ACM transactions on computational biology and bioinformatics,10,994-1008.

3)Li,S.,Kazuo,Y.,Mar,A.K.and Standley,D.M.(2014)Quantifying sequence and structural features ofprotein–RNAinteractions.NucleicAcidsResearch,42,10086-10098.

4)Lam,J.H.,Li,Y.,Zhu,L.,Umarov,R.,Jiang,H.,Héliou,A.,Sheong,F.K.,Liu,T.,Long,Y.and Li,Y.(2019)A deep learning framework to predict binding preference of RNA constituents on protein surface.Nature communications,10,1-13.

5)Su,H.,Liu,M.,Sun,S.,Peng,Z.and Yang,J.(2019)Improving the prediction of protein–nucleic acids binding residues via multiple sequence profiles and the consensus of complementarymethods.Bioinformatics,35,930-936.

6)Zhu,Y.-H.,Hu,J.,Song,X.-N.and Yu,D.-J.(2019)DNAPred:accurate identification of DNA-binding sites from protein sequence by ensembled hyperplane-distance-based support vector machines.Journal ofchemical information and modeling,59,3057-3071.

7)Walia,R.R.,Xue,L.C.,Wilkins,K.,El-Manzalawy,Y.,Dobbs,D.and Honavar,V.(2014)RNABindRPlus:a predictor that combines machine learning and sequence homology-based methods to improve the reliability of predicted RNA-binding residues in proteins.PLoS One,9,e97725.

8)Liu,R.and Hu,J.(2013)DNABind:A hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning-and template-based approaches.Proteins:Structure,Function,and Bioinformatics,81,1885-1899.)。

the results are shown in the following table. The method is obviously better than other methods in each index, which shows that the method can identify more real binding samples, has lower false positive rate than other methods, and has obvious advantages in identifying the binding sites of the protein and the nucleic acid.

Compared with the prior art, the method has the advantages that the structural neighborhood information of the residues is coded into the non-European space diagram representation, the manual design of a structure descriptor is avoided, the method is more suitable for the characterization of a protein sparse structure, and the local mode of the structure and the characteristics closely related to the task of predicting the binding site can be learned from the diagram representation through the training of an end-to-end neural network. The prediction of the protein and DNA binding site is improved by 8.2%, 8.8% and 6.9% on F1, MCC and AUC respectively, and the prediction of the protein and RNA binding site is improved by 9.7%, 10.6% and 6.6% on F1, MCC and AUC respectively. Specifically, the prediction accuracy of the method is higher than that of other sequence-based or structure-based methods in the field, and the method has higher accuracy compared with a DNABind and aaRNA method which uses manually designed descriptors to characterize protein structures and a nucleoicNet method which uses a convolutional neural network, so that on one hand, the method retains more complete topological information of the proteins and is more suitable for characterizing local structures of the proteins, and on the other hand, the end-to-end hierarchical graph neural network can learn local patterns of the structures and biochemical features which are more relevant to a downstream binding site prediction task from a graph representation of the local protein structures.

The foregoing embodiments may be modified in many different ways by one skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and not by the preceding embodiments, and all embodiments within their scope are intended to be limited by the scope of the invention.

Claims

1. A prediction method of protein and nucleic acid binding sites based on graph neural network representation is characterized in that a protein data set is constructed, position information and structure context of each residue in protein are extracted after sample fusion processing, graph representation of the structure context of the residue is constructed according to the position information and the structure context, and the graph representation of the protein to be predicted is predicted through a hierarchical graph neural network to obtain the probability of binding each residue with DNA/RNA, so that the prediction of the protein and nucleic acid binding sites is realized;

the protein data set, namely the protein data set which interacts with the DNA/RNA, is constructed by the following method: extracting a complex set of proteins and nucleic acids and labels of binding sites of the proteins and the nucleic acids from BioLip, and extracting a protein set combined with DNA and a protein set combined with RNA according to the types of bases in the complex;

The hierarchical neural network is used for learning local geometric and characteristic local patterns of the binding sites from a graph representation of a residue structure context in a supervised manner according to a training set sample, and specifically comprises the following steps: the device comprises a graph network coding unit, a K-layer stacking graph neuron block based on a gate cycle unit and a classifier based on a multilayer perceptron, wherein: the graph network coding unit carries out edge coding, node coding and graph coding in sequence; edge updating, node updating and graph updating are sequentially carried out on K-layer stacking graph neuron blocks based on gate cycle units; graph features (u) obtained according to updating by using two classifiers based on multi-layer perceptron^k1.,. K } predicts the binding probability of residues to DNA/RNA;

and predicting the binding site of the protein and the nucleic acid, predicting the probability of each residue to be combined with DNA/RNA through a neural network of a hierarchical graph, and comparing the probability with a threshold value T to obtain a combined residue and a non-combined residue.

2. The prediction method according to claim 1, wherein the sample fusion process is: data enhancement is carried out to the training set to the unbalanced problem of positive negative sample of data set, fuses the label of the binding site of the protein cluster that sequence and structure similarity are high, promotes the positive sample proportion of training set, specifically includes:

The method comprises the following steps of clustering proteins in a training set: calculating the sequence similarity and the structural similarity of the two proteins by using bl2seq and TM-align; aggregating proteins with sequence similarity greater than 0.8 and TM-score greater than 0.5 into a cluster; migrating a label of the true binding site of a protein of the same cluster to the protein having the most residues in the cluster; removing proteins with sequence similarity of more than 30% in the training set;

and step two, using CD-HIT to remove the protein with the sequence similarity of more than 30 percent between the test set and the training set and ensure that the sequence similarity of the test set is less than 30 percent.

3. The prediction method of claim 1, wherein the information on the position and characteristics of the residue comprises: the method comprises the following steps of extracting the characteristics, the secondary structure information, the bond and torsion angle information, the solubility and the evolution conservative information of atoms in the following specific modes:

4. The prediction method of claim 1, wherein the structural context,the protein is obtained by using structure-based sliding sphere extraction on each residue of the protein according to the position information of the residue in the protein, and specifically comprises the following steps: based on the position coordinates of the residues in the tertiary structure, a sliding sphere along the polypeptide chain is used to extract the structural context of each residue, specifically, for a target residue, centered on the residue, r _gA sphere is drawn for the radius, and all residues within the sphere and their positional relationships together form the structural context of the residues.

5. The prediction method of claim 1, wherein the graph representation defines residues in the structural context as nodes in the graph, constructs the adjacency matrix based on the relative position information of the nodes, and defines the characteristics of the nodes and edges of the graph based on the relative positional relationship and the characteristics of the residues by:

step i) using the residues in the structural context range as nodes of the graph, calculating the distance from the nodes to the sphere center as 1-dimensional position codes of the residues, splicing the position codes of the residues and the extracted position information of the residues to obtain node features with the size of 72, wherein a node feature set is expressed as

Wherein the node characteristic of node i is denoted by v_i；

Step ii) calculating the distance between every two nodes to obtain a distance matrix D of the nodes, and using a threshold value r_vThe distance matrix is binarized to obtain an adjacency matrix a of nodes, specifically, for node i and node j,

step iii) the edge feature set is denoted as E ═ { E ═ E_ij|A_ij1} where the characteristic e of the edge ij _ijThe method comprises the following steps: 1) distance between node i and node j, 2) vector of node i to sphere center

And the vector of node j to the center of the sphere

Form an included angle theta_ijThe value of the cosine of (a) is,

6. the prediction method of claim 1, wherein the edge coding is: splice edge feature

And node characteristics of two end nodes thereof

And

and passing through a non-linear transformation layer

Computing encoded edge features

The node coding means: splice node features

And the sum of all the encoded edge features starting from the node, and passing through a non-linear transformation layer

Computing encoded nodesFeature(s)

Computing encoded graph features u^enc(ii) a The nonlinear transformation layer is implemented as MLP (X) ═ W₂max(0，W₁X+b₁)+b₂Wherein X is the input, W₁、W₂、b₁、b₂Is the network parameter to be learned.

7. The prediction method of claim 1, wherein the edge update is: first concatenating input edge features

Node characteristics of two end nodes

And

Fusing the characteristics of the two states through a gate cycle unit to obtain an updated edge characteristic

The node update means that: first concatenating input node features

All updated sidelets starting from the nodeSum of features and graph feature u^k-1And computing node characteristics of intermediate states through a nonlinear transformation layer

The graph updating refers to the following steps: first splicing the sum of the updated features of all nodes on the graph and graph feature u^k-1And calculating the graph characteristic u of the intermediate state through a nonlinear transformation layer^k′And fusing the graph characteristics of the two states by a gate cycle unit to obtain an updated graph characteristic u^k。

8. The prediction method of claim 1, wherein the classifier

Wherein the content of the first and second substances,

9. the prediction method according to claim 1, wherein the adjustment is specifically to select a hyper-parameter combination that makes the mahius coefficient of the verification set the highest as the hyper-parameter of the model by performing a grid search on the hyper-parameter, and select a binarization threshold T that makes the mahius coefficient of the verification set the highest as the binarization threshold.

10. A system for implementing the method of any preceding claim, comprising: the device comprises a feature extraction module, a structure context construction module, a graph representation module and a hierarchical graph neural network module, wherein the feature extraction module extracts the position and feature information of a protein from the sequence and the structure of the protein, the structure context construction module constructs the local context of residues through the position information of the protein, the structure context of the residues constructed by the protein information extracted by the feature extraction module and the structure context module is transmitted to the graph representation module to construct the graph representation of the residue structure context, and the hierarchical graph neural network module takes the graph representation of the residue structure context as the input of the module to predict the probability that the residues are the protein-nucleic acid binding residues.