CN110335640B - Prediction method of drug-DBPs binding sites - Google Patents

Prediction method of drug-DBPs binding sites Download PDF

Info

Publication number
CN110335640B
CN110335640B CN201910616620.0A CN201910616620A CN110335640B CN 110335640 B CN110335640 B CN 110335640B CN 201910616620 A CN201910616620 A CN 201910616620A CN 110335640 B CN110335640 B CN 110335640B
Authority
CN
China
Prior art keywords
drug
binding site
cluster
binding
amino acid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910616620.0A
Other languages
Chinese (zh)
Other versions
CN110335640A (en
Inventor
王伟
吕贺贺
赵远
王世勋
王亚茹
黄军伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan Normal University
Original Assignee
Henan Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan Normal University filed Critical Henan Normal University
Priority to CN201910616620.0A priority Critical patent/CN110335640B/en
Publication of CN110335640A publication Critical patent/CN110335640A/en
Application granted granted Critical
Publication of CN110335640B publication Critical patent/CN110335640B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Abstract

The invention relates to a prediction method of drug-DBPs binding sites, belonging to the technical field of prediction of drug-protein binding sites. The invention provides a prediction method of drug-DBPs binding sites based on a binary network, which takes binding site fragments consisting of three amino acids as a research object, carries out clustering according to the physicochemical properties of the binding site fragments, establishes a drug-cluster network relation, then calculates the drug-cluster interaction fraction by using a CN method, and selects the drug-cluster with the fraction larger than the average standard error; the binding site fragments in the clusters have obvious interaction with the drug and are predicted drug-DBPs binding sites. In the absence of characteristic information, structure-based link prediction algorithms are used to predict novel drug-DBPs binding site fragment interactions. Based on this, the mechanism of binding of the drug and the binding site fragment in the prediction outcome can be further analyzed.

Description

Prediction method of drug-DBPs binding sites
Technical Field
The invention relates to a prediction method of drug-DBPs binding sites, belonging to the technical field of prediction of drug-protein binding sites.
Background
The study of drug-DNA Binding Proteins (DBPs) interaction opens up a new approach for the treatment of genetic diseases and cancers. The binding mechanism of the drug and the binding site of DBPs has important significance for drug development of DBPs and research of related diseases. Therefore, researchers hope to develop new methods for predicting drug-DBPs binding site interactions.
Currently, machine learning methods are widely used for drug-protein related prediction, such as those based on Automatic Encoders (AE) and Support Vector Machines (SVM). While these methods perform well, they require information about proteins and drugs. For example, an automatic encoder in deep learning is used to predict DTIs, and although it only requires the sequence of proteins, structural information of drugs is not easily described. And this model does not produce satisfactory results when the interaction pattern of the drug and the protein is unknown. This is disadvantageous because the database containing the DTI may be only partially annotated or unannotated. For example, only annotations are provided for proteins in drug bank, but not for drug-protein interactions. Furthermore, DTI predictions based on similarity between chemical structures or protein sequences have limitations, as the assumption that similar drugs share similar proteins is not necessarily correct.
In the traditional drug-protein network, many methods consider the protein as a whole, and few people take the drug binding site of the protein as a research object and ignore the binding mechanism of the drug and the protein binding site.
Disclosure of Invention
The object of the present invention is to provide a method for predicting binding sites of drug-DBPs, which uses a binding site fragment consisting of three amino acids as a subject of study, and which can be used to predict novel drug-DBPs binding site fragment interactions in the absence of characteristic information.
In order to achieve the purpose, the invention adopts the technical scheme that:
a method for predicting binding sites of drug-DBPs, comprising the steps of:
1) establishing a data set consisting of a drug and a DNA binding protein that binds to the drug; extracting binding sites which can be combined with the medicine on the DNA binding protein sequence, wherein every three binding sites are used as a binding site fragment;
2) dividing the binding site fragments into different clusters by a hierarchical clustering method according to the amino acid physicochemical properties of the binding site fragments, and constructing a binary network of the interaction between the drug and the clusters according to the binding relationship between the drug and the DNA binding protein binding site fragments;
3) and calculating the degree of the interaction relation between the drugs and the clusters according to the binary network, and selecting drug-cluster combinations with strong interaction relation as a prediction group, wherein the clusters comprise binding site fragments which are not combined with the drugs and are used as predicted drug-DBPs binding sites in the prediction group.
The invention provides a prediction method of drug-DBPs binding sites based on a binary network, which takes binding site fragments consisting of three amino acids as a research object, carries out clustering according to the physicochemical properties of the binding site fragments, establishes a drug-cluster network relation, then calculates the drug-cluster interaction fraction by using a CN method, and selects the drug-cluster with the fraction more than or equal to the average standard error; the binding site fragments in the cluster have significant interaction with the drug, and the binding site fragments that have not been previously described as interacting with the drug are predicted drug-DBPs binding sites.
Network analysis in the present invention shows that the drug tends to bind to positively charged binding site fragments and that the binding process is more likely to occur within DBPs. Network analysis also reveals some binding characteristics of the drug-DBPs binding site fragments, e.g., the propensity of the drug to bind hydrophobic fragments. In the absence of characteristic information, structure-based link prediction algorithms are used to predict novel drug-DBPs binding site fragment interactions. Based on this, the mechanism of binding of the drug and the binding site fragment in the prediction outcome can be further analyzed.
In the partition of the binding site fragment, there are various ways to partition the binding site fragment. Preferably, three adjacent binding sites on the DNA binding protein sequence in step 1) are used as one binding site fragment. More preferably, three binding sites adjacent in sequence from the N-terminus of the DNA binding protein sequence are used as a binding site fragment.
Preferably, the amino acid physicochemical properties of the binding site fragment in step 2) are quantified using the following method:
ɑtri=X(α00)+(X(α01)+X(α10))/k;
wherein alphatriRepresents the physicochemical quality of the amino acid of the binding site fragment(ii) a X is the amino acid physicochemical mass value of the binding site fragment, and is alpha00、ɑ01Or alpha10,ɑ00Is a central amino acid, alpha01And alpha10Are each a flanking amino acid; k is a modification coefficient of the physical and chemical quality value of the amino acid at the side, and k is more than or equal to 2 and less than or equal to 6;
the calculation method of the X (#) is as follows:
Figure BDA0002124154900000021
wherein E1~EnCharacteristic values, lambda, for the physicochemical properties of the different amino acids of the binding site1~λnWeights representing the physicochemical properties of the different amino acids of the binding site; wherein n is the nth position of the main component obtained by the physical and chemical properties of the amino acid through a main component analysis method, and n is more than or equal to 5.
Pca (principal Component analysis), a principal Component analysis method, is a most widely used data dimension reduction algorithm; currently amino acids are represented in vector form by 237 features derived from SWISSPROT and dbGET from public databases. To reduce dimensionality and simplify subsequent analysis, principal component analysis was performed on the 237 features, and the first n principal components were retained. The amino acids can be expressed as n-dimensional vectors, and n-dimensional vectors of the binding site fragments are calculated. Using the above calculation formula, an n-dimensional vector of each amino acid in the binding site fragment can be calculated. In the above formula, k is preferably 4.
Preferably, the n-5, the first 5 main components of the physicochemical properties of the amino acid are hydrophobicity, amino acid size, preference of the amino acid in the α -helix, the number of degenerate triplet codons and the frequency of occurrence of amino acid residues in the β -chain. Specifically, in the calculation of 5-dimensional vectors of amino acids, E1、E2、E3、E4、E5The values of (A) are respectively: 1961.504, 788.2, 539.776, 276.624, 244.1.
Preferably, the number of clusters in step 2) is 80% -120% of the number of drug species in the data set. More preferably, the number of clusters in step 2) is the same as the number of drug species in the data set. In the present invention it is desirable that each drug is clustered to a class of physicochemical groups, so the number of clusters is the same as the number of drug classes in the data set.
Specifically, when constructing the bipartite network of drug interaction with the clusters in step 2), first find clusters where each binding site fragment of a certain DNA binding protein is located, then establish interaction relationships between the drug bound with the DNA binding protein and the clusters, and construct the bipartite network according to the interaction relationships.
The drug set is denoted as D ═ D1,d2,...,dnSet C ═ C in the cluster1,c2,...,cnDescribed as a bipartite DC graph G (D, C, E), where E (E)ij:di∈D,cjE.g. C); when medicine diAnd cluster cjAt the time of bonding of diAnd cjThere is a connection between them; the DC bipartite network may be formed from neighboring matrices { a }ijDenotes, if diAnd cjIs linked, then aij1, otherwise aij0. For example, binding site fragments in the drug MRC are: TLG, PPY, HMG, whereas TLG, PPY, HMG are located in cluster 2, cluster 45, cluster 74, respectively, so that the drug MRC establishes a link with cluster 2, cluster 45, cluster 74.
Preferably, the interaction relationship score between the drug and the cluster is calculated by Common neighbor method in step 3), and the calculation formula is:
Figure BDA0002124154900000031
Figure BDA0002124154900000032
representing the number of paths for communicating the drug i and the protein j through the two nodes; wherein
Figure BDA0002124154900000033
Is defined as cluster j linked by a drugR (i) represents the cluster set on which drug i acts.
In actual calculations, for example, cluster 96 is connected to drug 5JZ, and the cluster set of 5JZ connections is {95,91,7,5}, then
Figure BDA0002124154900000034
Set of {95,91,7,5 }; for example, if the set of cluster clusters to which the drug MRC is connected is {2,5,6,84,85}, then Γ (MRC) represents the set {2,5,6,84,85 }.
In the significant interaction analysis, only those elements whose values are greater than the mean standard error are considered significant interactions, while interactions greater than the first 20% are considered important interactions. Thus, in particular, the drug-cluster with a score greater than the mean standard error is selected in step 3) as the prediction group. Preferably, the drug-cluster with the top 20% of the drug-cluster scores in step 3) is selected as the prediction group.
Drawings
FIG. 1 is a flowchart illustrating the entire operation of the method for predicting binding sites of drug-DBPs according to the present invention;
FIG. 2 is a diagram showing the process of generating an amino acid trimer binding site fragment according to the present invention;
FIG. 3 is a schematic diagram of hierarchical clustering of binding site fragments according to the present invention;
FIG. 4 is a schematic diagram of a drug-cluster interaction network constructed in accordance with the present invention;
FIG. 5 is a schematic diagram showing the communication between drug i and protein j via two nodes in the present invention;
FIG. 6 is a graph of the extent of overlap of binding site fragments in different proteins in the data set of the present invention;
FIG. 7 is a graph of hydrophobicity and charge intensity analysis of clusters of the present invention;
FIG. 8 is a graph of the degree of drug distribution in the drug-cluster interaction network of the present invention;
FIG. 9 is a graph of the degree of clustering in a drug-cluster interaction network according to the present invention;
FIG. 10 is a graph showing ROC curves and base lines for similarity indices in three link prediction methods according to the present invention;
FIG. 11 is a graph of a prediction score matrix obtained by the CN method of the present invention;
FIG. 12 is a graph of a predicted drug-cluster interaction network according to the present invention;
FIG. 13 is a representation of the binding mechanism of the drug-binding site fragment in the prediction results of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples.
While there are limitations to the methods of machine learning, the study of drug-protein interaction networks has made significant progress. The use of network-based reasoning (NBI) has been successfully applied to the discovery of new targets for drugs. Social networking similarity algorithms have proven to be applicable to drug-protein mutual prediction. Similarity-based drug-proteins in heterogeneous networks predict successful targeting of drug novel effects. Binary network projection-based recommendation techniques are applied to resource transfer within drug-protein networks. The network-based approach relocates specific targets for specific diseases. These network methods provide a new idea for predicting drug and protein interactions. Therefore, the invention utilizes a network prediction method to research the interaction relationship of the drug-DBPs binding sites.
Example 1 method for predicting binding sites of drug-DBPs
The process of the method for predicting the binding sites of the drug-DBPs in this example is shown in fig. 1, and includes the following steps:
1. the complex of drug-DBPs was obtained from the SC-PDB database (http:// bioinfo-pharma. u-strasbg. fr/scPDB /). By 6 months in 2019, 16034 entries, 4782 proteins and 6326 ligands were published on the SC-PDB website. After downloading all the protein-small molecules, 110 drug-DBPs complexes were obtained as a data set by screening. 110 DNA binding proteins and 97 drugs are involved.
1) DNA binding proteins (110)
1al7 1al8 1cd2 1gg5 1gg5 1jty 1jup 1jus 1kbo 1kbo 1kbq 1kbq 1mq0 1ozq 1p0b 1p0e 1qu2 1qu3 1qvt 1s6q 1s9e 1s9g 1sv5 1w13 1w5v 1w5w 1w5x 1w5y 2b5j 2ban 2be2 2brg 2brh 2brm 2brn 2bro 2c3k 2c3l 2cci 2cgu 2cgv 2cgw 2cgx 2gdo 2h42 2hk9 2hk9 2hkj 2j0d 2vf0 2vg5 2vg7 2vuk 2wkm 2wkz 2wl0 2x0v 2x0w 2x6o 2x9d 2xp2 2xye 2xyf 2ynf 2ynh 2zd1 2ze2 2zoz 3bgr 3bt9 3bti 3btj 3bvb 3cku 3cyw 3d20 3d6y 3d70 3ey4 3gbk 3ha8 3k1o 3ml8 3ml9 3nb5 3o8g 3o8h 3pm1 3qps 3qqa 3s2o 3sfi 3tdl 3zv7 4agc 4agd 4agl 4ago 4agq 4ase 4c8e 4cee 4cef 4ceo 4duh 4gqs 4i22 4i23 4qt3 4u0i。
2) Medicine (97 pieces)
X0W,1C9,017,E09,2TC,11D,HST,TCH,RHQ,P74,DFY,NNC,3SF,AZA,G40,P84,TPB,DHP,3B3,3A3,CXG,MRC,ATR,ITC,3C3,WHU,ERY,ML9,FAD,T27,DEQ,RDC,AV9,2PQ,VGH,ET,VIA,F89,BE3,C5P,BRD,O8H,X0V,PFY,ATP,SM1,DFW,NNI,G0T,P83,5AH,IDZ,O8G,0XV,R21,CHD,12C,ADB,DF1,B49,BER,POZ,BE5,340,R22,352,0LI,357,DFZ,ML8,EV6,3AC,PQ0,P96,RAP,IMD,IRE,B0T,D0T,BE4,3D3,ABO,PRL,936,FOL,BE6,5JZ,PRF,DF2,NHG,AXI,65B,MGR,NAP,RLI,ABZ,EUR。
2. Generation of binding site fragments
drug-DBPs binding is the binding of a drug to a binding site fragment on the sequence of DBPs, and thus the use of amino acid trimers is used to denote binding site fragments. Firstly, the sequence of DBPs is obtained, and the drug binding site is labeled. Although the drug binding sites are discontinuous in the sequence of DBPs, they are closely spaced in the spatial structure, and thus it is considered that these binding sites are approximately continuous, forming a continuous sequence of binding sites by ligation. Then, a distance of 3 amino acids was used as the length of the window, and the binding site fragment was generated by sliding on the sequence. For example, the sequence NGMGNG generates two binding site fragments NGM and GNG. Finally, the binding sites for DBPs generate 3219 binding site fragments (as shown in figure 2).
3. Physicochemical Properties of the binding site fragment
Currently, amino acids are represented in vector form by 237 features derived from the public databases SWISSPROT and dbGET. To reduce dimensionality and simplify subsequent analysis, principal component analysis was performed on the 237 features, and the first five principal components were retained (as shown in table 1). The amino acids can be expressed as 5-dimensional vectors.
Specifically, 237 dimensions are reduced to 5 dimensions using PCA. Pca (principal Component analysis), a principal Component analysis method, is one of the most widely used data dimension reduction algorithms. These five main components do not correspond to a single chemical property. The properties associated with the five major components are hydrophobicity, amino acid size, preference of amino acids in the alpha-helix, number of degenerate triplet codons and frequency of occurrence of amino acid residues in the beta-chain.
TABLE 1 vectors and eigenvalues of the top five principal components
Figure BDA0002124154900000051
Figure BDA0002124154900000061
Binding site fragments are represented by a single combination of amino acids. The binding site fragment emphasizes the core position of the middle amino acid by distinguishing it from the flanking amino acids (without distinguishing the order of the flanking amino acids). The calculation method of the physicochemical properties of the binding site fragments is as follows:
ɑtri010010)=X(α00)+(X(α01)+X(α10))/k;
wherein alphatriA 5-dimensional vector representing a binding site fragment, X (—) represents a 5-dimensional vector of amino acids, and a00、ɑ01Or alpha10,ɑ00Is a central amino acid; alpha00Is the central amino acid (main), alpha01And alpha10Respectively, left and right amino acids (dependent); k is a coefficient for correcting the physicochemical mass value of the side amino acid, and k is 4.
Wherein, the calculation method of X (#) is as follows:
Figure BDA0002124154900000062
wherein, the lambda represents the weight of the physicochemical properties of different amino acids, and the E represents the characteristic value of the physicochemical properties of different amino acids.
E.g. of alanine1=0.008,E2=0.134,E3=-0.475,E4=-0.039,E5=-0.181,λ1=1961.504,λ2=788.2,λ3=539.776,λ4=276.624,λ5=244.10。
Figure BDA0002124154900000063
Figure BDA0002124154900000064
Alanine can therefore be expressed in five dimensions (0.354,3.762, -11.036, -0.649, -2.828).
4. Clustering of binding site fragments
In order to study the physicochemical properties of the drug binding to the binding sites of DBPs in the network. And (3) clustering the binding site fragments with similar physicochemical properties into different clusters by using a hierarchical clustering method, representing the category of the binding site fragments by using the clusters, and clustering in a five-dimensional space. Since there are 97 drugs and it is generally desirable that each drug aggregate a class of physicochemical groups, the binding fragment is defined as 97 clusters.
Each amino acid is represented as a five-dimensional vector, and amino acid triplets can also be represented as five-dimensional vectors with these individual amino acids. The amino acid trimer is represented by a five-dimensional vector, placed in a five-dimensional space. And clustering the amino acid trimers which are close to each other into one class according to a hierarchical clustering algorithm. Hierarchical clustering is a process of looking at points in space and grouping them into "clusters" according to some measure of distance (here, euclidean distance) as shown in figure 3. The goal of clustering is to make the distance between points within the same cluster shorter, while the distance between points in different clusters is larger. The calculation process is calculated according to the existing algorithm program, and the program can be found on the network and can also provide codes.
The hierarchical clustering method specifically comprises the following steps: first, each binding site fragment was treated as a separate class and the distance for each 2 fragments was calculated. The two fragments with the smallest distance are then merged into the same class. And finally, circularly iterating to the preset category number. By way of clustering, 3210 binding site fragments were represented as 97 clusters. The binding site fragments within each cluster contain similar physicochemical properties. Thus, the drug-binding site fragment interactions were expressed as drug-cluster interactions (1993 drug-cluster interactions) and a drug-cluster interaction network was constructed based on these interactions.
5. Establishment of drug-cluster interaction network
All binding site fragments were first clustered into 97 clusters, each containing fragments of the same type. The cluster to which the binding site fragment of each DNA binding protein belongs is then searched. The corresponding drug is in action relationship with the related cluster. For example, binding site fragments in the drug MRC are: TLG, PPY, HMG, whereas TLG, PPY, HMG are located in cluster 2, cluster 45, cluster 74, respectively, such that the drug MRC establishes a link with cluster 2, cluster 45, cluster 74 (as shown in fig. 4).
That is, the clusters in which the protein binding site fragments are located are first found, and then the drug is brought into an interaction relationship with these clusters. Also as the site of binding of protein 1cd2 and drug FOL was divided into fragments TSI, PFR, LKR … …. By looking for these fragments in clusters 3, 7, 8; the drug FOL and the corresponding clusters 3, 7,8 are used to establish an action relationship.
Network description:
the drug set is denoted as D ═ D1,d2,...,dnSet C ═ C in the cluster1,c2,...,cnDescribed as a bipartite DC graph G (D, C, E), where E (E)ij:di∈D,cjE.g. C); when medicine diAnd cluster cjAt the time of bonding of diAnd cjThere is a connection between them; the DC bipartite network may be formed from neighboring matrices { a }ijDenotes, if diAnd cjIs linked, then aij1, otherwise aij=0。
6. Structure-based link prediction method
Since only drug-cluster interactions are not aware of other characteristic information, link prediction methods based on structural similarity in networks predict drug-cluster effects. The method does not need other characteristic information for link prediction based on the topological structure of the network.
Based on successful experience in the relevant network, the following three link prediction methods were chosen:
common Neighbor (CN) method:
Figure BDA0002124154900000081
jaccard (JA) method:
Figure BDA0002124154900000082
the Preferred Attachment (PA) method:
Figure BDA0002124154900000083
wherein
Figure BDA0002124154900000084
Cluster set defined as cluster j connected by drug, e.g., cluster 96 connected drug 5JZ, 5JZ connected cluster set is {95,91,7,5}, then
Figure BDA0002124154900000085
Set of {95,91,7,5 }. Γ (i) represents the set of clusters on which drug i acts, e.g., the set of clusters to which drug MRC is linked is {2,5,6,84,85}, then Γ (MRC) represents the set {2,5,6,84,85 }. ki is the degree of drug i, e.g., if the number of clusters to which the MRC of drug is linked is 5, then kMRCIs 5. kj is the degree of cluster j. E.g., cluster 96 linked only 5JZ drugs, then k96Is 1.
Wherein the CN algorithm is a modified Common neighbor algorithm. In a single-node network, the algorithm calculates the number of nodes which are not connected and are connected together by two nodesIs the basis for the similarity of the two nodes. The drug and target interaction network is composed of two types of nodes, and Common neighbor nodes are improved (formula is shown as
Figure BDA0002124154900000086
)。
Figure BDA0002124154900000087
Indicating the number of pathways through which drug i and protein j communicate via two nodes. As shown in fig. 5, for example, the cluster j connected drug set is { a, b }, wherein the cluster set of drug a connected { j, a }; drug b connected cluster set j, a. The cluster set for drug i connection is { A, B }. j is connected to A through a, j is connected to A through b, then
Figure BDA0002124154900000088
The set of (A) is { A, A }. Then Γ (i) represents the set { A, B }, then
Figure BDA0002124154900000089
Is equal to 2. As graphs i and j communicate through two paths.
In the actual calculation, for example, cluster 95 is linked to drug CHD, 5JZ, P83. The cluster set of CHD connections is {95,29,7,2 }; the cluster set of 5JZ connections is {95,91,7,5 }; the cluster set of P83 connections is 95,75,12, 1. Then
Figure BDA00021241549000000810
Set of {95,95,95,91,75,29,12,7,7,5,2,1 }; for example, the set of clusters to which the drug MRC is linked is {2,5,6,7,80,87}, then Γ (MRC) represents the set {2,5,6,7,80,87 }. The intersection is 2,5,7,
Figure BDA00021241549000000811
the fractional value of (a) is 4.
7. Evaluation of link prediction methods
Using 10-fold cross-validation, it is known that it can give the lowest deviation and variance in the sub-dataset. The data set is randomly divided into an equal number of 10 non-overlapping subsets. A subset was chosen at a time and the same number of non-interactions were randomly sampled as a test set (test set comprised 199 interactions and 199 non-interactions). The remaining 9 subsets build the network. This process is repeated 10 times and the false positive rate and the true positive rate calculated for each iteration are averaged to produce a final score. In the prediction process, a score for drug-cluster interaction is calculated. The score is then used as a threshold. When the score is greater than or equal to the threshold, the predicted outcome is the presence of an interaction, otherwise the absence of an interaction is predicted.
False Positive Rate (FPR) is defined as:
Figure BDA0002124154900000091
the true normal rate (TPR) is defined as:
Figure BDA0002124154900000092
wherein FP is predicted to have an interaction, but not actually present; TN is predicted to have no interaction, nor in fact; FN is predicted to be absent, but in fact present; TP is predicted to have an interaction, and in fact, an interaction.
Results and discussion
1. Investigating interactive data
X-ray and other biological studies have shown that many proteins contain more than one drug binding site, and that there is local overlap in the binding sites for these drugs. The binding site fragments were analyzed to check the degree of overlap of binding site fragments in different proteins (as shown in FIG. 6). The numerical values on the abscissa in fig. 6 represent: 1 represents the number of binding site fragments present in only one cluster; 2 represents the number of binding site fragments present in both clusters; … …, respectively; 15 denotes the number of binding site fragments present in fifteen clusters. As can be seen from the figure, more than 65% of the binding site fragments are located on multiple DBPs, which is consistent with the fact that the drug binding sites of the protein partially overlap.
The hydrophobicity and charge strength of proteins play an important role in drug-protein binding processes. The hydrophobicity and charge intensity of the clusters were therefore analyzed (as shown in figure 7). As can be seen from the figure, the drug tends to act on both hydrophobic and positively charged clusters. It is presumed that the drug-DBP binding process occurs inside proteins and DBPs tend to bind negatively charged drug molecules. This provides a guide for the study of drug-DBPs binding processes.
The moderate distribution of the network reflects the sparsity of the drug-cluster links. The network was therefore analyzed to examine the degree distribution of clusters and drugs (as shown in figures 8-9). As can be seen from fig. 8, over 87% of the drugs interacted with between 15 and 30 clusters. Figure 9 shows that more than 66% of the binding site clusters interact with less than 20 drugs. This indicates that the drug-cluster bipartite junction is sparse.
2. Link prediction method comparison
Firstly, evaluating the performances of the three link prediction methods, then briefly analyzing the prediction mechanisms of the three methods in the network, and finally selecting the optimal prediction method for network prediction.
The three methods exhibit different performance in different networks, and therefore a performance comparison is performed in the network to select the best method. The predicted outcome can be misled by randomly creating invalid interactions, with the resulting curve as the baseline. The CN method exhibited the best predicted performance by comparison of the three methods to the baseline (as shown in figure 10).
The difference in the prediction mechanism of the three methods in the network results in differences in performance. The CN approach only considers the neighbors of nodes in the network. The JA method considers not only the common neighbors of the node but also other neighbors of the node, but the JA method does not perform as well in the network as the CN method. Analyzing the JA method may be that irrelevant node additions to node links after the second order path lead to a degradation of the JA method's predicted performance. The PA method only considers the degree of a node, and cannot effectively utilize the structural information in the network. By comparison, the prediction mechanism of the CN method in the network is more reliable.
The CN method was employed to predict drug-cluster interactions in the network through analysis of performance and prediction mechanisms. The score calculated by the CN method is used as the basis of network prediction, and the possibility of the drug-cluster interaction is judged according to the score.
3. Network prediction
Firstly, a prediction matrix of drug-cluster interaction is established according to the link prediction score, then the prediction matrix is analyzed, and finally the prediction result is verified.
By means of the method comparison, the CN method is used to perform link prediction. A prediction matrix of drug-cluster interactions was constructed from the scores calculated by the CN method (as shown in fig. 11), and the values in the matrix represent the prediction scores. There were 7416 non-zero elements in the prediction matrix for drug-cluster interaction, and only those with values greater than 101 were considered significant effects (mean standard error of 101). As a result, there were 3468 significant interactions in the network. Of the significant interactions, those with values greater than 274 (the first 20%) were considered important.
At present, the results with the top ranking of the predicted results are subjected to chemical analysis through cross-examination, random examination and published chemical experiment results of a statistical method. According to the existing chemical knowledge, the prediction result is judged, for example, when a hydrogen atom is covalently bonded with an atom N with large electronegativity and an atom F with large electronegativity and small radius approaches, hydrogen is used as a medium between N and F to generate a hydrogen bond in the form of N-H … F.
To determine drug-cluster interactions from the scores, it is necessary to investigate whether drug-cluster interactions in the predicted outcome. Clusters are composed of binding site fragments, and thus drug-cluster interactions are validated by validating the drug and fragment interactions within the cluster. The prediction results are visualized (as shown in fig. 12). FIG. 12 shows predicted drug-cluster interactions; the cluster consists of a binding site fragment, the first letter of which represents the central amino acid of the fragment and the letters in parentheses represent the dependent amino acids.
4. Mechanism of binding of drug-binding site fragments in predictive outcome
Analyzing the binding mechanism of the drug-binding site fragment in the prediction result. For example, the hydroxyl group in MRC (mupirocin) reacts with the carboxyl group of aspartic acid to form an aliphatic group and water. In some cases, the primary amino acid is not capable of significant chemical interaction with the drug, but rather interacts with the drug through hydrogen bonds. For example, a hydrogen atom is covalently bonded to an atom N having a large electronegativity in glycine, and when an atom F having a large electronegativity and a small radius is close to an atom F in VGH (seekery), hydrogen is mediated between N and F to form a hydrogen bond of N — H … F. Other interactions (as shown in fig. 13) may be similarly analyzed.
In the present invention, the interaction relationship is determined from the chemical properties of the fragment and the chemical properties of the drug, and the binding mechanism is predicted based on the existing chemical reaction mechanism, such as the reaction between hydrogen ions and hydroxyl radicals.
Conclusion
In the present invention, drug-DBPs interaction is described as the interaction of a drug and a segment of the binding site for DBPs. To analyze the physicochemical properties of drug binding to the binding site fragments, similar binding site fragments were clustered into clusters, forming a drug-cluster interaction relationship. Since only the drug-cluster interaction relationship is known and no other characteristic information is available, a link prediction method based on a network structure is selected to predict a new drug-cluster interaction relationship. By comparing the three link prediction methods, the CN method is used to perform link prediction.
In addition, the binding mechanisms of the 5 drug-binding site fragments in the prediction results were also analyzed.
Compared to traditional drug-protein networks, the proposed network prediction model enables the discovery of candidate binding site fragments. drug-DBPs binding sites are extracted from the drug-DBPs complexes and binding site fragments are used to describe the binding sites. In this way, the mechanism of interaction between the drug and the binding site fragment can be clearly understood.

Claims (3)

1. A method for predicting binding sites of drug-DBPs, comprising: the method comprises the following steps:
1) establishing a data set consisting of a drug and a DNA binding protein that binds to the drug; extracting binding sites which can be combined with the medicine on the DNA binding protein sequence, wherein every three binding sites are used as a binding site fragment;
2) dividing the binding site fragments into different clusters by a hierarchical clustering method according to the amino acid physicochemical properties of the binding site fragments, and constructing a binary network of the interaction between the drug and the clusters according to the binding relationship between the drug and the DNA binding protein binding site fragments;
3) calculating the degree of interaction relation between the drugs and the clusters according to a binary network, and selecting drug-cluster combinations with strong interaction relation as a prediction group, wherein in the prediction group, binding site fragments which are not combined with the drugs and contained in the clusters are used as predicted drug-DBPs binding sites;
three binding sites which are adjacent in sequence from the N end of the DNA binding protein sequence in the step 1) are taken as a binding site fragment;
the amino acid physicochemical properties of the binding site fragments in step 2) are quantified by the following method:
ɑtri=X(α00)+(X(α01)+X(α10))/k;
wherein alphatriRepresents the amino acid physicochemical mass value of the binding site fragment; x is the amino acid physicochemical mass value of the binding site fragment, and is alpha00、ɑ01Or alpha10,ɑ00Is a central amino acid, alpha01And alpha10Are each a flanking amino acid; k is a modification coefficient of the physical and chemical quality value of the amino acid at the side, and k is more than or equal to 2 and less than or equal to 6;
the calculation method of the X (#) is as follows:
Figure FDA0003311277180000011
wherein E1~EnCharacteristic values, lambda, for the physicochemical properties of the different amino acids of the binding site1~λnWeights, λ, representing the physicochemical properties of the different amino acids of the binding site1=1961.504,λ2=788.2,λ3=539.776,λ4=276.624,λ5244.10; wherein n is the nth position of the main component obtained by the analysis method of the main component on the physicochemical property of the amino acid, n is 5, and the first 5 main components of the physicochemical property of the amino acid are hydrophobicity, the size of the amino acid, the preference of the amino acid in an alpha-helix, the number of degenerate triplet codons and the occurrence frequency of amino acid residues in a beta-chain;
the number of clusters in step 2) is the same as the number of drug species in the data set;
when constructing a binary network of drug and cluster interaction in step 2), firstly finding clusters where each binding site fragment of a certain DNA binding protein is located, then establishing interaction relations between the drug combined with the DNA binding protein and the clusters, and constructing the binary network according to the interaction relations;
calculating the interaction relationship score between the medicine and the cluster by using a Common neighbor method in the step 3), wherein the calculation formula is as follows:
Figure FDA0003311277180000021
Figure FDA0003311277180000022
representing the number of paths for communicating the drug i and the protein j through the two nodes; wherein
Figure FDA0003311277180000023
Is defined as the set of clusters where cluster j is linked by a drug, and Γ (i) represents the set of clusters on which drug i acts.
2. The method for predicting binding sites of drug-DBPs according to claim 1, wherein: and 3) selecting the medicine-cluster with the score larger than the average standard error as a prediction group.
3. The method for predicting binding sites of drug-DBPs according to claim 1, wherein: the drug-cluster with the top 20% of the drug-cluster is selected in step 3) as the prediction group.
CN201910616620.0A 2019-07-09 2019-07-09 Prediction method of drug-DBPs binding sites Active CN110335640B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910616620.0A CN110335640B (en) 2019-07-09 2019-07-09 Prediction method of drug-DBPs binding sites

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910616620.0A CN110335640B (en) 2019-07-09 2019-07-09 Prediction method of drug-DBPs binding sites

Publications (2)

Publication Number Publication Date
CN110335640A CN110335640A (en) 2019-10-15
CN110335640B true CN110335640B (en) 2022-01-25

Family

ID=68144986

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910616620.0A Active CN110335640B (en) 2019-07-09 2019-07-09 Prediction method of drug-DBPs binding sites

Country Status (1)

Country Link
CN (1) CN110335640B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015113063A1 (en) * 2014-01-27 2015-07-30 Georgia Tech Research Corporation Methods and systems for identifying crispr/cas off-target sites
CN106709272A (en) * 2016-12-26 2017-05-24 西安石油大学 Method and system for predicting drug-target protein interaction relationship based on decision template
CN107563150A (en) * 2017-08-31 2018-01-09 深圳大学 Forecasting Methodology, device, equipment and the storage medium of protein binding site
CN107609352A (en) * 2017-11-02 2018-01-19 中国科学院新疆理化技术研究所 A kind of Forecasting Methodology of protein self-interaction
CN108763865A (en) * 2018-05-21 2018-11-06 成都信息工程大学 A kind of integrated learning approach of prediction DNA protein binding sites
CN108959841A (en) * 2018-04-16 2018-12-07 华南农业大学 A kind of drug targeting albumen effect prediction technique based on DBN algorithm
CN109979533A (en) * 2019-03-14 2019-07-05 华中师范大学 A kind of prediction technique of the nucleotide binding site in conjunction with protein or small molecule

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6453242B1 (en) * 1999-01-12 2002-09-17 Sangamo Biosciences, Inc. Selection of sites for targeting by zinc finger proteins and methods of designing zinc finger proteins to bind to preselected sites
EP1482434A3 (en) * 2001-08-10 2006-07-26 Xencor, Inc. Protein design automation for protein libraries
CN101591705B (en) * 2009-06-13 2011-11-16 徐州医学院 High-sensitivity high-flux DNA binding protein detection method
CN102175747A (en) * 2010-12-29 2011-09-07 王荣 Method for studying interaction of DNA (Deoxyribose Nucleic Acid) and protein
CN104131093B (en) * 2014-07-23 2015-12-09 哈尔滨工程大学 The DNase high pass order-checking detection signal treatment process of DNA protein binding site
CN104725465A (en) * 2015-03-31 2015-06-24 首都医科大学 Method for separating DNA (deoxyribonucleic acid) binding protein and accurately positioning DNA binding site

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015113063A1 (en) * 2014-01-27 2015-07-30 Georgia Tech Research Corporation Methods and systems for identifying crispr/cas off-target sites
CN106709272A (en) * 2016-12-26 2017-05-24 西安石油大学 Method and system for predicting drug-target protein interaction relationship based on decision template
CN107563150A (en) * 2017-08-31 2018-01-09 深圳大学 Forecasting Methodology, device, equipment and the storage medium of protein binding site
CN107609352A (en) * 2017-11-02 2018-01-19 中国科学院新疆理化技术研究所 A kind of Forecasting Methodology of protein self-interaction
CN108959841A (en) * 2018-04-16 2018-12-07 华南农业大学 A kind of drug targeting albumen effect prediction technique based on DBN algorithm
CN108763865A (en) * 2018-05-21 2018-11-06 成都信息工程大学 A kind of integrated learning approach of prediction DNA protein binding sites
CN109979533A (en) * 2019-03-14 2019-07-05 华中师范大学 A kind of prediction technique of the nucleotide binding site in conjunction with protein or small molecule

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Link prediction in drug-target interactions network using similarity indices;Yiding Lu;《BMC Bioinformatics》;20170117;1-9 *
Predicting target-ligand interactions using protein ligand-binding site and ligand substructures;Wang Caihua;《BMC SYSTEMS BIOLOGY》;20150121;1-10 *
蛋白质和核酸与活性药物小分子相互作用的研究;王公轲;《中国博士学位论文全文数据库 医药卫生科技辑》;20121015;E079-13 *

Also Published As

Publication number Publication date
CN110335640A (en) 2019-10-15

Similar Documents

Publication Publication Date Title
Liang et al. Prediction of drug side effects with a refined negative sample selection strategy
Chen et al. Protein-protein interaction prediction using a hybrid feature representation and a stacked generalization scheme
Bakhtiarizadeh et al. Neural network and SVM classifiers accurately predict lipid binding proteins, irrespective of sequence homology
Alaimo et al. Recommendation techniques for drug–target interaction prediction and drug repositioning
CN109637579B (en) Tensor random walk-based key protein identification method
EP4202725A1 (en) Joint personalized search and recommendation with hypergraph convolutional networks
WO2015054266A1 (en) Predictive optimization of network system response
CN110021341A (en) A kind of prediction technique of GPCR drug based on heterogeneous network and targeting access
Liu et al. Drug-target interaction prediction via an ensemble of weighted nearest neighbors with interaction recovery
Sampathkumar et al. Gene selection using parallel lion optimization method in microarray data for cancer classification
Aziz et al. A machine learning based approach to detect the Ethereum fraud transactions with limited attributes
CN113488104A (en) Cancer driver gene prediction method and system based on local and global network centrality analysis
CN115527627A (en) Drug relocation method and system based on hypergraph convolutional neural network
Beltran et al. Predicting protein-protein interactions based on biological information using extreme gradient boosting
Abu Zaher et al. An adaptive memetic algorithm for feature selection using proximity graphs
CN110335640B (en) Prediction method of drug-DBPs binding sites
Liu et al. Multi-task learning with domain knowledge for molecular property prediction
Charan et al. FGFR1Pred: an artificial intelligence-based model for predicting fibroblast growth factor receptor 1 inhibitor
Maljković et al. Prediction of structural alphabet protein blocks using data mining
Han et al. Quality assessment of protein docking models based on graph neural network
Szymczak et al. HydrAMP: a deep generative model for antimicrobial peptide discovery
Li et al. Evolving spatial clusters of genomic regions from high-throughput chromatin conformation capture data
Yang et al. CETSA feature based clustering for protein outlier discovery by protein-to-protein interaction prediction
Le et al. Towards the identification of disease associated protein complexes
Yang et al. IUP: intrinsically unstructured protein predictor-a software tool for analyzing polypeptide sequences

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant