CN110335640B

CN110335640B - Prediction method of drug-DBPs binding sites

Info

Publication number: CN110335640B
Application number: CN201910616620.0A
Authority: CN
Inventors: 王伟; 吕贺贺; 赵远; 王世勋; 王亚茹; 黄军伟
Original assignee: Henan Normal University
Current assignee: Henan Normal University
Priority date: 2019-07-09
Filing date: 2019-07-09
Publication date: 2022-01-25
Anticipated expiration: 2039-07-09
Also published as: CN110335640A

Abstract

The invention relates to a prediction method of drug-DBPs binding sites, belonging to the technical field of prediction of drug-protein binding sites. The invention provides a prediction method of drug-DBPs binding sites based on a binary network, which takes binding site fragments consisting of three amino acids as a research object, carries out clustering according to the physicochemical properties of the binding site fragments, establishes a drug-cluster network relation, then calculates the drug-cluster interaction fraction by using a CN method, and selects the drug-cluster with the fraction larger than the average standard error; the binding site fragments in the clusters have obvious interaction with the drug and are predicted drug-DBPs binding sites. In the absence of characteristic information, structure-based link prediction algorithms are used to predict novel drug-DBPs binding site fragment interactions. Based on this, the mechanism of binding of the drug and the binding site fragment in the prediction outcome can be further analyzed.

Description

Prediction method of drug-DBPs binding sites

Technical Field

The invention relates to a prediction method of drug-DBPs binding sites, belonging to the technical field of prediction of drug-protein binding sites.

Background

The study of drug-DNA Binding Proteins (DBPs) interaction opens up a new approach for the treatment of genetic diseases and cancers. The binding mechanism of the drug and the binding site of DBPs has important significance for drug development of DBPs and research of related diseases. Therefore, researchers hope to develop new methods for predicting drug-DBPs binding site interactions.

Currently, machine learning methods are widely used for drug-protein related prediction, such as those based on Automatic Encoders (AE) and Support Vector Machines (SVM). While these methods perform well, they require information about proteins and drugs. For example, an automatic encoder in deep learning is used to predict DTIs, and although it only requires the sequence of proteins, structural information of drugs is not easily described. And this model does not produce satisfactory results when the interaction pattern of the drug and the protein is unknown. This is disadvantageous because the database containing the DTI may be only partially annotated or unannotated. For example, only annotations are provided for proteins in drug bank, but not for drug-protein interactions. Furthermore, DTI predictions based on similarity between chemical structures or protein sequences have limitations, as the assumption that similar drugs share similar proteins is not necessarily correct.

In the traditional drug-protein network, many methods consider the protein as a whole, and few people take the drug binding site of the protein as a research object and ignore the binding mechanism of the drug and the protein binding site.

Disclosure of Invention

The object of the present invention is to provide a method for predicting binding sites of drug-DBPs, which uses a binding site fragment consisting of three amino acids as a subject of study, and which can be used to predict novel drug-DBPs binding site fragment interactions in the absence of characteristic information.

In order to achieve the purpose, the invention adopts the technical scheme that:

a method for predicting binding sites of drug-DBPs, comprising the steps of:

1) establishing a data set consisting of a drug and a DNA binding protein that binds to the drug; extracting binding sites which can be combined with the medicine on the DNA binding protein sequence, wherein every three binding sites are used as a binding site fragment;

2) dividing the binding site fragments into different clusters by a hierarchical clustering method according to the amino acid physicochemical properties of the binding site fragments, and constructing a binary network of the interaction between the drug and the clusters according to the binding relationship between the drug and the DNA binding protein binding site fragments;

3) and calculating the degree of the interaction relation between the drugs and the clusters according to the binary network, and selecting drug-cluster combinations with strong interaction relation as a prediction group, wherein the clusters comprise binding site fragments which are not combined with the drugs and are used as predicted drug-DBPs binding sites in the prediction group.

The invention provides a prediction method of drug-DBPs binding sites based on a binary network, which takes binding site fragments consisting of three amino acids as a research object, carries out clustering according to the physicochemical properties of the binding site fragments, establishes a drug-cluster network relation, then calculates the drug-cluster interaction fraction by using a CN method, and selects the drug-cluster with the fraction more than or equal to the average standard error; the binding site fragments in the cluster have significant interaction with the drug, and the binding site fragments that have not been previously described as interacting with the drug are predicted drug-DBPs binding sites.

Network analysis in the present invention shows that the drug tends to bind to positively charged binding site fragments and that the binding process is more likely to occur within DBPs. Network analysis also reveals some binding characteristics of the drug-DBPs binding site fragments, e.g., the propensity of the drug to bind hydrophobic fragments. In the absence of characteristic information, structure-based link prediction algorithms are used to predict novel drug-DBPs binding site fragment interactions. Based on this, the mechanism of binding of the drug and the binding site fragment in the prediction outcome can be further analyzed.

In the partition of the binding site fragment, there are various ways to partition the binding site fragment. Preferably, three adjacent binding sites on the DNA binding protein sequence in step 1) are used as one binding site fragment. More preferably, three binding sites adjacent in sequence from the N-terminus of the DNA binding protein sequence are used as a binding site fragment.

Preferably, the amino acid physicochemical properties of the binding site fragment in step 2) are quantified using the following method:

ɑ_tri＝X(α₀₀)+(X(α₀₁)+X(α₁₀))/k；

wherein alpha_triRepresents the physicochemical quality of the amino acid of the binding site fragment(ii) a X is the amino acid physicochemical mass value of the binding site fragment, and is alpha₀₀、ɑ₀₁Or alpha₁₀，ɑ₀₀Is a central amino acid, alpha₀₁And alpha₁₀Are each a flanking amino acid; k is a modification coefficient of the physical and chemical quality value of the amino acid at the side, and k is more than or equal to 2 and less than or equal to 6;

the calculation method of the X (#) is as follows:

wherein E₁～E_nCharacteristic values, lambda, for the physicochemical properties of the different amino acids of the binding site₁～λ_nWeights representing the physicochemical properties of the different amino acids of the binding site; wherein n is the nth position of the main component obtained by the physical and chemical properties of the amino acid through a main component analysis method, and n is more than or equal to 5.

Pca (principal Component analysis), a principal Component analysis method, is a most widely used data dimension reduction algorithm; currently amino acids are represented in vector form by 237 features derived from SWISSPROT and dbGET from public databases. To reduce dimensionality and simplify subsequent analysis, principal component analysis was performed on the 237 features, and the first n principal components were retained. The amino acids can be expressed as n-dimensional vectors, and n-dimensional vectors of the binding site fragments are calculated. Using the above calculation formula, an n-dimensional vector of each amino acid in the binding site fragment can be calculated. In the above formula, k is preferably 4.

Preferably, the n-5, the first 5 main components of the physicochemical properties of the amino acid are hydrophobicity, amino acid size, preference of the amino acid in the α -helix, the number of degenerate triplet codons and the frequency of occurrence of amino acid residues in the β -chain. Specifically, in the calculation of 5-dimensional vectors of amino acids, E₁、E₂、E₃、E₄、E₅The values of (A) are respectively: 1961.504, 788.2, 539.776, 276.624, 244.1.

Preferably, the number of clusters in step 2) is 80% -120% of the number of drug species in the data set. More preferably, the number of clusters in step 2) is the same as the number of drug species in the data set. In the present invention it is desirable that each drug is clustered to a class of physicochemical groups, so the number of clusters is the same as the number of drug classes in the data set.

Specifically, when constructing the bipartite network of drug interaction with the clusters in step 2), first find clusters where each binding site fragment of a certain DNA binding protein is located, then establish interaction relationships between the drug bound with the DNA binding protein and the clusters, and construct the bipartite network according to the interaction relationships.

The drug set is denoted as D ═ D₁，d₂，...，d_nSet C ═ C in the cluster₁，c₂，...，c_nDescribed as a bipartite DC graph G (D, C, E), where E (E)_ij：d_i∈D，c_jE.g. C); when medicine d_iAnd cluster c_jAt the time of bonding of d_iAnd c_jThere is a connection between them; the DC bipartite network may be formed from neighboring matrices { a }_ijDenotes, if d_iAnd c_jIs linked, then a_ij1, otherwise a_ij0. For example, binding site fragments in the drug MRC are: TLG, PPY, HMG, whereas TLG, PPY, HMG are located in cluster 2, cluster 45, cluster 74, respectively, so that the drug MRC establishes a link with cluster 2, cluster 45, cluster 74.

Preferably, the interaction relationship score between the drug and the cluster is calculated by Common neighbor method in step 3), and the calculation formula is:

representing the number of paths for communicating the drug i and the protein j through the two nodes; wherein

Is defined as cluster j linked by a drugR (i) represents the cluster set on which drug i acts.

In actual calculations, for example, cluster 96 is connected to drug 5JZ, and the cluster set of 5JZ connections is {95,91,7,5}, then

Set of {95,91,7,5 }; for example, if the set of cluster clusters to which the drug MRC is connected is {2,5,6,84,85}, then Γ (MRC) represents the set {2,5,6,84,85 }.

In the significant interaction analysis, only those elements whose values are greater than the mean standard error are considered significant interactions, while interactions greater than the first 20% are considered important interactions. Thus, in particular, the drug-cluster with a score greater than the mean standard error is selected in step 3) as the prediction group. Preferably, the drug-cluster with the top 20% of the drug-cluster scores in step 3) is selected as the prediction group.

Drawings

FIG. 1 is a flowchart illustrating the entire operation of the method for predicting binding sites of drug-DBPs according to the present invention;

FIG. 2 is a diagram showing the process of generating an amino acid trimer binding site fragment according to the present invention;

FIG. 3 is a schematic diagram of hierarchical clustering of binding site fragments according to the present invention;

FIG. 4 is a schematic diagram of a drug-cluster interaction network constructed in accordance with the present invention;

FIG. 5 is a schematic diagram showing the communication between drug i and protein j via two nodes in the present invention;

FIG. 6 is a graph of the extent of overlap of binding site fragments in different proteins in the data set of the present invention;

FIG. 7 is a graph of hydrophobicity and charge intensity analysis of clusters of the present invention;

FIG. 8 is a graph of the degree of drug distribution in the drug-cluster interaction network of the present invention;

FIG. 9 is a graph of the degree of clustering in a drug-cluster interaction network according to the present invention;

FIG. 10 is a graph showing ROC curves and base lines for similarity indices in three link prediction methods according to the present invention;

FIG. 11 is a graph of a prediction score matrix obtained by the CN method of the present invention;

FIG. 12 is a graph of a predicted drug-cluster interaction network according to the present invention;

FIG. 13 is a representation of the binding mechanism of the drug-binding site fragment in the prediction results of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples.

While there are limitations to the methods of machine learning, the study of drug-protein interaction networks has made significant progress. The use of network-based reasoning (NBI) has been successfully applied to the discovery of new targets for drugs. Social networking similarity algorithms have proven to be applicable to drug-protein mutual prediction. Similarity-based drug-proteins in heterogeneous networks predict successful targeting of drug novel effects. Binary network projection-based recommendation techniques are applied to resource transfer within drug-protein networks. The network-based approach relocates specific targets for specific diseases. These network methods provide a new idea for predicting drug and protein interactions. Therefore, the invention utilizes a network prediction method to research the interaction relationship of the drug-DBPs binding sites.

Example 1 method for predicting binding sites of drug-DBPs

The process of the method for predicting the binding sites of the drug-DBPs in this example is shown in fig. 1, and includes the following steps:

1. the complex of drug-DBPs was obtained from the SC-PDB database (http:// bioinfo-pharma. u-strasbg. fr/scPDB /). By 6 months in 2019, 16034 entries, 4782 proteins and 6326 ligands were published on the SC-PDB website. After downloading all the protein-small molecules, 110 drug-DBPs complexes were obtained as a data set by screening. 110 DNA binding proteins and 97 drugs are involved.

1) DNA binding proteins (110)

1al7 1al8 1cd2 1gg5 1gg5 1jty 1jup 1jus 1kbo 1kbo 1kbq 1kbq 1mq0 1ozq 1p0b 1p0e 1qu2 1qu3 1qvt 1s6q 1s9e 1s9g 1sv5 1w13 1w5v 1w5w 1w5x 1w5y 2b5j 2ban 2be2 2brg 2brh 2brm 2brn 2bro 2c3k 2c3l 2cci 2cgu 2cgv 2cgw 2cgx 2gdo 2h42 2hk9 2hk9 2hkj 2j0d 2vf0 2vg5 2vg7 2vuk 2wkm 2wkz 2wl0 2x0v 2x0w 2x6o 2x9d 2xp2 2xye 2xyf 2ynf 2ynh 2zd1 2ze2 2zoz 3bgr 3bt9 3bti 3btj 3bvb 3cku 3cyw 3d20 3d6y 3d70 3ey4 3gbk 3ha8 3k1o 3ml8 3ml9 3nb5 3o8g 3o8h 3pm1 3qps 3qqa 3s2o 3sfi 3tdl 3zv7 4agc 4agd 4agl 4ago 4agq 4ase 4c8e 4cee 4cef 4ceo 4duh 4gqs 4i22 4i23 4qt3 4u0i。

2) Medicine (97 pieces)

X0W,1C9,017,E09,2TC,11D,HST,TCH,RHQ,P74,DFY,NNC,3SF,AZA,G40,P84,TPB,DHP,3B3,3A3,CXG,MRC,ATR,ITC,3C3,WHU,ERY,ML9,FAD,T27,DEQ,RDC,AV9,2PQ,VGH,ET,VIA,F89,BE3,C5P,BRD,O8H,X0V,PFY,ATP,SM1,DFW,NNI,G0T,P83,5AH,IDZ,O8G,0XV,R21,CHD,12C,ADB,DF1,B49,BER,POZ,BE5,340,R22,352,0LI,357,DFZ,ML8,EV6,3AC,PQ0,P96,RAP,IMD,IRE,B0T,D0T,BE4,3D3,ABO,PRL,936,FOL,BE6,5JZ,PRF,DF2,NHG,AXI,65B,MGR,NAP,RLI,ABZ,EUR。

2. Generation of binding site fragments

drug-DBPs binding is the binding of a drug to a binding site fragment on the sequence of DBPs, and thus the use of amino acid trimers is used to denote binding site fragments. Firstly, the sequence of DBPs is obtained, and the drug binding site is labeled. Although the drug binding sites are discontinuous in the sequence of DBPs, they are closely spaced in the spatial structure, and thus it is considered that these binding sites are approximately continuous, forming a continuous sequence of binding sites by ligation. Then, a distance of 3 amino acids was used as the length of the window, and the binding site fragment was generated by sliding on the sequence. For example, the sequence NGMGNG generates two binding site fragments NGM and GNG. Finally, the binding sites for DBPs generate 3219 binding site fragments (as shown in figure 2).

3. Physicochemical Properties of the binding site fragment

Currently, amino acids are represented in vector form by 237 features derived from the public databases SWISSPROT and dbGET. To reduce dimensionality and simplify subsequent analysis, principal component analysis was performed on the 237 features, and the first five principal components were retained (as shown in table 1). The amino acids can be expressed as 5-dimensional vectors.

Specifically, 237 dimensions are reduced to 5 dimensions using PCA. Pca (principal Component analysis), a principal Component analysis method, is one of the most widely used data dimension reduction algorithms. These five main components do not correspond to a single chemical property. The properties associated with the five major components are hydrophobicity, amino acid size, preference of amino acids in the alpha-helix, number of degenerate triplet codons and frequency of occurrence of amino acid residues in the beta-chain.

TABLE 1 vectors and eigenvalues of the top five principal components

Binding site fragments are represented by a single combination of amino acids. The binding site fragment emphasizes the core position of the middle amino acid by distinguishing it from the flanking amino acids (without distinguishing the order of the flanking amino acids). The calculation method of the physicochemical properties of the binding site fragments is as follows:

ɑ_tri(α₀₁,α₀₀,α₁₀)＝X(α₀₀)+(X(α₀₁)+X(α₁₀))/k；

wherein alpha_triA 5-dimensional vector representing a binding site fragment, X (—) represents a 5-dimensional vector of amino acids, and a₀₀、ɑ₀₁Or alpha₁₀，ɑ₀₀Is a central amino acid; alpha₀₀Is the central amino acid (main), alpha₀₁And alpha₁₀Respectively, left and right amino acids (dependent); k is a coefficient for correcting the physicochemical mass value of the side amino acid, and k is 4.

Wherein, the calculation method of X (#) is as follows:

wherein, the lambda represents the weight of the physicochemical properties of different amino acids, and the E represents the characteristic value of the physicochemical properties of different amino acids.

E.g. of alanine₁＝0.008，E₂＝0.134，E₃＝-0.475，E₄＝-0.039，E₅＝-0.181，λ₁＝1961.504，λ₂＝788.2，λ₃＝539.776，λ₄＝276.624，λ₅＝244.10。

Alanine can therefore be expressed in five dimensions (0.354,3.762, -11.036, -0.649, -2.828).

4. Clustering of binding site fragments

In order to study the physicochemical properties of the drug binding to the binding sites of DBPs in the network. And (3) clustering the binding site fragments with similar physicochemical properties into different clusters by using a hierarchical clustering method, representing the category of the binding site fragments by using the clusters, and clustering in a five-dimensional space. Since there are 97 drugs and it is generally desirable that each drug aggregate a class of physicochemical groups, the binding fragment is defined as 97 clusters.

Each amino acid is represented as a five-dimensional vector, and amino acid triplets can also be represented as five-dimensional vectors with these individual amino acids. The amino acid trimer is represented by a five-dimensional vector, placed in a five-dimensional space. And clustering the amino acid trimers which are close to each other into one class according to a hierarchical clustering algorithm. Hierarchical clustering is a process of looking at points in space and grouping them into "clusters" according to some measure of distance (here, euclidean distance) as shown in figure 3. The goal of clustering is to make the distance between points within the same cluster shorter, while the distance between points in different clusters is larger. The calculation process is calculated according to the existing algorithm program, and the program can be found on the network and can also provide codes.

The hierarchical clustering method specifically comprises the following steps: first, each binding site fragment was treated as a separate class and the distance for each 2 fragments was calculated. The two fragments with the smallest distance are then merged into the same class. And finally, circularly iterating to the preset category number. By way of clustering, 3210 binding site fragments were represented as 97 clusters. The binding site fragments within each cluster contain similar physicochemical properties. Thus, the drug-binding site fragment interactions were expressed as drug-cluster interactions (1993 drug-cluster interactions) and a drug-cluster interaction network was constructed based on these interactions.

5. Establishment of drug-cluster interaction network

All binding site fragments were first clustered into 97 clusters, each containing fragments of the same type. The cluster to which the binding site fragment of each DNA binding protein belongs is then searched. The corresponding drug is in action relationship with the related cluster. For example, binding site fragments in the drug MRC are: TLG, PPY, HMG, whereas TLG, PPY, HMG are located in cluster 2, cluster 45, cluster 74, respectively, such that the drug MRC establishes a link with cluster 2, cluster 45, cluster 74 (as shown in fig. 4).

That is, the clusters in which the protein binding site fragments are located are first found, and then the drug is brought into an interaction relationship with these clusters. Also as the site of binding of protein 1cd2 and drug FOL was divided into fragments TSI, PFR, LKR … …. By looking for these fragments in

clusters

3, 7, 8; the drug FOL and the

corresponding clusters

3, 7,8 are used to establish an action relationship.

Network description:

the drug set is denoted as D ═ D₁，d₂，...，d_nSet C ═ C in the cluster₁，c₂，...，c_nDescribed as a bipartite DC graph G (D, C, E), where E (E)_ij：d_i∈D，c_jE.g. C); when medicine d_iAnd cluster c_jAt the time of bonding of d_iAnd c_jThere is a connection between them; the DC bipartite network may be formed from neighboring matrices { a }_ijDenotes, if d_iAnd c_jIs linked, then a_ij1, otherwise a_ij＝0。

6. Structure-based link prediction method

Since only drug-cluster interactions are not aware of other characteristic information, link prediction methods based on structural similarity in networks predict drug-cluster effects. The method does not need other characteristic information for link prediction based on the topological structure of the network.

Based on successful experience in the relevant network, the following three link prediction methods were chosen:

common Neighbor (CN) method:

jaccard (JA) method:

the Preferred Attachment (PA) method:

wherein

Cluster set defined as cluster j connected by drug, e.g., cluster 96 connected drug 5JZ, 5JZ connected cluster set is {95,91,7,5}, then

Set of {95,91,7,5 }. Γ (i) represents the set of clusters on which drug i acts, e.g., the set of clusters to which drug MRC is linked is {2,5,6,84,85}, then Γ (MRC) represents the set {2,5,6,84,85 }. ki is the degree of drug i, e.g., if the number of clusters to which the MRC of drug is linked is 5, then k_MRCIs 5. kj is the degree of cluster j. E.g., cluster 96 linked only 5JZ drugs, then k₉₆Is 1.

Wherein the CN algorithm is a modified Common neighbor algorithm. In a single-node network, the algorithm calculates the number of nodes which are not connected and are connected together by two nodesIs the basis for the similarity of the two nodes. The drug and target interaction network is composed of two types of nodes, and Common neighbor nodes are improved (formula is shown as

)。

Indicating the number of pathways through which drug i and protein j communicate via two nodes. As shown in fig. 5, for example, the cluster j connected drug set is { a, b }, wherein the cluster set of drug a connected { j, a }; drug b connected cluster set j, a. The cluster set for drug i connection is { A, B }. j is connected to A through a, j is connected to A through b, then

The set of (A) is { A, A }. Then Γ (i) represents the set { A, B }, then

Is equal to 2. As graphs i and j communicate through two paths.

In the actual calculation, for example, cluster 95 is linked to drug CHD, 5JZ, P83. The cluster set of CHD connections is {95,29,7,2 }; the cluster set of 5JZ connections is {95,91,7,5 }; the cluster set of P83 connections is 95,75,12, 1. Then

Set of {95,95,95,91,75,29,12,7,7,5,2,1 }; for example, the set of clusters to which the drug MRC is linked is {2,5,6,7,80,87}, then Γ (MRC) represents the set {2,5,6,7,80,87 }. The intersection is 2,5,7,

the fractional value of (a) is 4.

7. Evaluation of link prediction methods

Using 10-fold cross-validation, it is known that it can give the lowest deviation and variance in the sub-dataset. The data set is randomly divided into an equal number of 10 non-overlapping subsets. A subset was chosen at a time and the same number of non-interactions were randomly sampled as a test set (test set comprised 199 interactions and 199 non-interactions). The remaining 9 subsets build the network. This process is repeated 10 times and the false positive rate and the true positive rate calculated for each iteration are averaged to produce a final score. In the prediction process, a score for drug-cluster interaction is calculated. The score is then used as a threshold. When the score is greater than or equal to the threshold, the predicted outcome is the presence of an interaction, otherwise the absence of an interaction is predicted.

False Positive Rate (FPR) is defined as:

the true normal rate (TPR) is defined as:

wherein FP is predicted to have an interaction, but not actually present; TN is predicted to have no interaction, nor in fact; FN is predicted to be absent, but in fact present; TP is predicted to have an interaction, and in fact, an interaction.

Results and discussion

1. Investigating interactive data

X-ray and other biological studies have shown that many proteins contain more than one drug binding site, and that there is local overlap in the binding sites for these drugs. The binding site fragments were analyzed to check the degree of overlap of binding site fragments in different proteins (as shown in FIG. 6). The numerical values on the abscissa in fig. 6 represent: 1 represents the number of binding site fragments present in only one cluster; 2 represents the number of binding site fragments present in both clusters; … …, respectively; 15 denotes the number of binding site fragments present in fifteen clusters. As can be seen from the figure, more than 65% of the binding site fragments are located on multiple DBPs, which is consistent with the fact that the drug binding sites of the protein partially overlap.

The hydrophobicity and charge strength of proteins play an important role in drug-protein binding processes. The hydrophobicity and charge intensity of the clusters were therefore analyzed (as shown in figure 7). As can be seen from the figure, the drug tends to act on both hydrophobic and positively charged clusters. It is presumed that the drug-DBP binding process occurs inside proteins and DBPs tend to bind negatively charged drug molecules. This provides a guide for the study of drug-DBPs binding processes.

The moderate distribution of the network reflects the sparsity of the drug-cluster links. The network was therefore analyzed to examine the degree distribution of clusters and drugs (as shown in figures 8-9). As can be seen from fig. 8, over 87% of the drugs interacted with between 15 and 30 clusters. Figure 9 shows that more than 66% of the binding site clusters interact with less than 20 drugs. This indicates that the drug-cluster bipartite junction is sparse.

2. Link prediction method comparison

Firstly, evaluating the performances of the three link prediction methods, then briefly analyzing the prediction mechanisms of the three methods in the network, and finally selecting the optimal prediction method for network prediction.

The three methods exhibit different performance in different networks, and therefore a performance comparison is performed in the network to select the best method. The predicted outcome can be misled by randomly creating invalid interactions, with the resulting curve as the baseline. The CN method exhibited the best predicted performance by comparison of the three methods to the baseline (as shown in figure 10).

The difference in the prediction mechanism of the three methods in the network results in differences in performance. The CN approach only considers the neighbors of nodes in the network. The JA method considers not only the common neighbors of the node but also other neighbors of the node, but the JA method does not perform as well in the network as the CN method. Analyzing the JA method may be that irrelevant node additions to node links after the second order path lead to a degradation of the JA method's predicted performance. The PA method only considers the degree of a node, and cannot effectively utilize the structural information in the network. By comparison, the prediction mechanism of the CN method in the network is more reliable.

The CN method was employed to predict drug-cluster interactions in the network through analysis of performance and prediction mechanisms. The score calculated by the CN method is used as the basis of network prediction, and the possibility of the drug-cluster interaction is judged according to the score.

3. Network prediction

Firstly, a prediction matrix of drug-cluster interaction is established according to the link prediction score, then the prediction matrix is analyzed, and finally the prediction result is verified.

By means of the method comparison, the CN method is used to perform link prediction. A prediction matrix of drug-cluster interactions was constructed from the scores calculated by the CN method (as shown in fig. 11), and the values in the matrix represent the prediction scores. There were 7416 non-zero elements in the prediction matrix for drug-cluster interaction, and only those with values greater than 101 were considered significant effects (mean standard error of 101). As a result, there were 3468 significant interactions in the network. Of the significant interactions, those with values greater than 274 (the first 20%) were considered important.

At present, the results with the top ranking of the predicted results are subjected to chemical analysis through cross-examination, random examination and published chemical experiment results of a statistical method. According to the existing chemical knowledge, the prediction result is judged, for example, when a hydrogen atom is covalently bonded with an atom N with large electronegativity and an atom F with large electronegativity and small radius approaches, hydrogen is used as a medium between N and F to generate a hydrogen bond in the form of N-H … F.

To determine drug-cluster interactions from the scores, it is necessary to investigate whether drug-cluster interactions in the predicted outcome. Clusters are composed of binding site fragments, and thus drug-cluster interactions are validated by validating the drug and fragment interactions within the cluster. The prediction results are visualized (as shown in fig. 12). FIG. 12 shows predicted drug-cluster interactions; the cluster consists of a binding site fragment, the first letter of which represents the central amino acid of the fragment and the letters in parentheses represent the dependent amino acids.

4. Mechanism of binding of drug-binding site fragments in predictive outcome

Analyzing the binding mechanism of the drug-binding site fragment in the prediction result. For example, the hydroxyl group in MRC (mupirocin) reacts with the carboxyl group of aspartic acid to form an aliphatic group and water. In some cases, the primary amino acid is not capable of significant chemical interaction with the drug, but rather interacts with the drug through hydrogen bonds. For example, a hydrogen atom is covalently bonded to an atom N having a large electronegativity in glycine, and when an atom F having a large electronegativity and a small radius is close to an atom F in VGH (seekery), hydrogen is mediated between N and F to form a hydrogen bond of N — H … F. Other interactions (as shown in fig. 13) may be similarly analyzed.

In the present invention, the interaction relationship is determined from the chemical properties of the fragment and the chemical properties of the drug, and the binding mechanism is predicted based on the existing chemical reaction mechanism, such as the reaction between hydrogen ions and hydroxyl radicals.

Conclusion

In the present invention, drug-DBPs interaction is described as the interaction of a drug and a segment of the binding site for DBPs. To analyze the physicochemical properties of drug binding to the binding site fragments, similar binding site fragments were clustered into clusters, forming a drug-cluster interaction relationship. Since only the drug-cluster interaction relationship is known and no other characteristic information is available, a link prediction method based on a network structure is selected to predict a new drug-cluster interaction relationship. By comparing the three link prediction methods, the CN method is used to perform link prediction.

In addition, the binding mechanisms of the 5 drug-binding site fragments in the prediction results were also analyzed.

Compared to traditional drug-protein networks, the proposed network prediction model enables the discovery of candidate binding site fragments. drug-DBPs binding sites are extracted from the drug-DBPs complexes and binding site fragments are used to describe the binding sites. In this way, the mechanism of interaction between the drug and the binding site fragment can be clearly understood.

Claims

1. A method for predicting binding sites of drug-DBPs, comprising: the method comprises the following steps:

3) calculating the degree of interaction relation between the drugs and the clusters according to a binary network, and selecting drug-cluster combinations with strong interaction relation as a prediction group, wherein in the prediction group, binding site fragments which are not combined with the drugs and contained in the clusters are used as predicted drug-DBPs binding sites;

three binding sites which are adjacent in sequence from the N end of the DNA binding protein sequence in the step 1) are taken as a binding site fragment;

the amino acid physicochemical properties of the binding site fragments in step 2) are quantified by the following method:

ɑ_tri＝X(α₀₀)+(X(α₀₁)+X(α₁₀))/k；

wherein alpha_triRepresents the amino acid physicochemical mass value of the binding site fragment; x is the amino acid physicochemical mass value of the binding site fragment, and is alpha₀₀、ɑ₀₁Or alpha₁₀，ɑ₀₀Is a central amino acid, alpha₀₁And alpha₁₀Are each a flanking amino acid; k is a modification coefficient of the physical and chemical quality value of the amino acid at the side, and k is more than or equal to 2 and less than or equal to 6;

the calculation method of the X (#) is as follows:

wherein E₁～E_nCharacteristic values, lambda, for the physicochemical properties of the different amino acids of the binding site₁～λ_nWeights, λ, representing the physicochemical properties of the different amino acids of the binding site₁＝1961.504，λ₂＝788.2，λ₃＝539.776，λ₄＝276.624，λ₅244.10; wherein n is the nth position of the main component obtained by the analysis method of the main component on the physicochemical property of the amino acid, n is 5, and the first 5 main components of the physicochemical property of the amino acid are hydrophobicity, the size of the amino acid, the preference of the amino acid in an alpha-helix, the number of degenerate triplet codons and the occurrence frequency of amino acid residues in a beta-chain;

the number of clusters in step 2) is the same as the number of drug species in the data set;

when constructing a binary network of drug and cluster interaction in step 2), firstly finding clusters where each binding site fragment of a certain DNA binding protein is located, then establishing interaction relations between the drug combined with the DNA binding protein and the clusters, and constructing the binary network according to the interaction relations;

calculating the interaction relationship score between the medicine and the cluster by using a Common neighbor method in the step 3), wherein the calculation formula is as follows:

Is defined as the set of clusters where cluster j is linked by a drug, and Γ (i) represents the set of clusters on which drug i acts.

2. The method for predicting binding sites of drug-DBPs according to claim 1, wherein: and 3) selecting the medicine-cluster with the score larger than the average standard error as a prediction group.

3. The method for predicting binding sites of drug-DBPs according to claim 1, wherein: the drug-cluster with the top 20% of the drug-cluster is selected in step 3) as the prediction group.