CN104992078B - A kind of protein network complex recognizing method based on semantic density - Google Patents

A kind of protein network complex recognizing method based on semantic density Download PDF

Info

Publication number
CN104992078B
CN104992078B CN201510338321.7A CN201510338321A CN104992078B CN 104992078 B CN104992078 B CN 104992078B CN 201510338321 A CN201510338321 A CN 201510338321A CN 104992078 B CN104992078 B CN 104992078B
Authority
CN
China
Prior art keywords
protein
semantic
proteins
network
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201510338321.7A
Other languages
Chinese (zh)
Other versions
CN104992078A (en
Inventor
周红芳
段文聪
郭杰
王心怡
何馨依
刘杰
李锦�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN201510338321.7A priority Critical patent/CN104992078B/en
Publication of CN104992078A publication Critical patent/CN104992078A/en
Application granted granted Critical
Publication of CN104992078B publication Critical patent/CN104992078B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of protein network complex recognizing method based on semantic density, specifically implement according to following steps:For a protein-protein interaction network data set without weight, in the GO of gene ontology storehouse in Network Search data set all proteins attribute;Based on lookup result, the similitude existed between the protein connected is concentrated come the network data calculated based on the semantic similarity calculation method of gene ontology using a kind of;According to obtained correlation result, by given protein-protein interaction network data set be converted into one have the right, Undirected networks data set, wherein node on behalf protein, side represents the interaction between protein, and the similarity between protein is the weight on side;Protein complex can be identified from protein-protein interaction network, and recognition accuracy is higher, time complexity is relatively low.

Description

Protein network compound identification method based on semantic density
Technical Field
The invention belongs to the technical field of data mining methods, and relates to a protein network compound identification method based on semantic density.
Background
Empirical studies and theoretical simulations of complex networks have a long history and many related techniques and methods derived from statistical physics and applied mathematics have been proposed. The concept of system networking has also been successfully applied to molecular biology related research. Proteins in biological systems interact with each other to perform a wide variety of molecular biological functions, these Interactions being referred to as PPIs (Protein-Protein Interactions). A biological system composed of proteins and interactions can be formally depicted as an undirected graph, i.e., a protein interaction network (PPI networks), or simply a protein network. In a protein network, each node represents a protein and the edges represent interactions between proteins. Through the analysis of protein networks, researchers can further understand the structure and properties of molecular biological systems. For example, the identification of protein complexes and assessment of protein criticality.
Proteins having the same molecular function at a specific time and place are considered to constitute a single biomolecule, i.e., a Protein Complex (Protein Complex), if there are many interactions between them. In the past, protein complexes were found primarily by means of biochemical experiments, such as Mass Spectrometry after Affinity Purification (AP/MS). However, most experimental methods are not very reliable and are inefficient. In recent years, a number of data mining methods based on clustering techniques have been proposed and successfully applied to the identification of protein complexes. These methods allow the identification of numerous protein complexes in the protein interaction network, which could not have been found experimentally. According to different characteristics, these clustering methods can be classified into: hierarchical Clustering (Hierarchical Clustering), objective Function Clustering (Objective Function Clustering), and Density-based Clustering.
Hierarchical clustering techniques have been widely applied to analyze various types of complex networks, such as online social networks and protein interaction networks. The main idea of such a method is to divide the network into several sub-networks based on the similarities between the connected nodes in the network. Hierarchical clustering can be further divided into: coacervation (agglutination) and cleavage (cleavage). The most well-known method of the split class is the GN algorithm, while the most representative method of the condensed class is the CNM algorithm.
Both the objective function clustering method and the density clustering method are based on graph partitioning techniques (graphical partitions). The former divides the graph by optimizing an objective function, and the latter determines a subgraph with the maximum density, such as a clique (clique), based on the topological characteristics of the network. The well-known RNSC algorithm identifies complexes in the protein interaction network by optimizing a specific cost function. In recent years, many similar multi-objective methods have been proposed, which mostly solve the problem of multi-objective optimization with methods of evolutionary computation, such as genetic algorithms and firefly algorithms. RANCoC is a co-clustering method for searching dense subgraphs in protein-interaction networks. Given a weightless map for a protein interaction network, a dense subgraph is defined as a submatrix with a higher quality. RANCoC is essentially the discovery of dense subgraphs by optimization of a quality function, which requires several conditions to be met. Furthermore, a new heuristic is applied by RANCoC to prevent local optimization.
It is generally recognized that a dense subgraph should have denser internal connecting edges, i.e., there are connecting edges between most nodes inside the subgraph. The denser a subgraph is, the more likely it is a community in a social network or a complex in a protein interaction network. The goal of the density-based approach is to find dense regions in the graph and treat the connected dense regions as dense subgraphs. To quantitatively calculate the degree of denseness, various methods have different definitions for denseness. Classical MCODE algorithms use k-cores (k-core) and core aggregation coefficients to discover complexes. The k-core is a subgraph in which the degree of each node is greater than or equal to k. The k-kernel with the largest k is considered as the most dense subgraph. Another well-known definition of dense subgraphs is the lineage (clique). All nodes in a cluster are edge-connected to each other. Two derivatives with k nodes can be considered adjacent to each other if they contain k-1 common nodes. A k-party community is a set of contiguous k-parties. The Pseudo-derivatives (Pseudo Cliques) employed by the DME algorithm are extensions of the derivatives obtained by removing a certain number of edges from the derivatives. A pseudo-derivative is a subgraph in which the number of edges is slightly less than the derivative with the same number of nodes, and the proportion should be greater than a given threshold. In addition, there are many other types of methods for finding dense subgraphs, such as flow simulation (flow simulation) employed by the MCL algorithm.
Both hierarchical clustering methods and objective function-based clustering methods require optimization of one or more functions. However, the evaluation function for the network structure often has certain limitations, for example, the modularity adopted by the hierarchical method has a problem of decomposition limit. And the computation of the global optimization function may increase the time complexity of the algorithm. The multi-objective function optimization problem is also a difficult point of algorithm research. The density-based approach does not require multiple function optimizations and is less time-complex. Comprehensive evaluation, the density-based method is superior to the first two methods.
Disclosure of Invention
The invention aims to provide a protein network complex identification method based on semantic density, which can identify a protein complex from a protein interaction network, and has high identification accuracy and low time complexity.
The invention adopts the technical scheme that a protein network compound identification method based on semantic density is implemented according to the following steps:
step 1, searching attributes of all proteins in a network data set in a gene ontology base GO for a weightless protein interaction network data set;
step 2, based on the search result in the step 1, adopting a semantic similarity calculation method based on a gene ontology to calculate the similarity between the connected proteins in the network data set in the step 1;
step 3, converting the protein interaction network data set given in the step 1 into a weighted undirected network data set according to the similarity result obtained in the step 2, wherein nodes represent proteins, edges represent the interaction between the proteins, and the similarity between the proteins is the weight of the edges;
and 4, finding dense subgraphs from the weighted undirected graph obtained in the step 3 by adopting a density-based graph partitioning algorithm, wherein the graph partitioning algorithm is called DBGPWN, and the obtained dense subgraphs are the protein network compounds identified based on semantic density.
The present invention is also characterized in that,
wherein the semantic similarity calculation method based on the gene ontology in the step 2 comprises the following specific steps,
step 2.1, setting a protein A and a protein B as analysis objects, constructing three combined DAGs by the protein A and the protein B respectively by adopting three GO attributes including a biological process P, a molecular function F and a cell component C, and respectively calculating semantic contributions S-values of the attributes of the three combined DAGs in each protein to the corresponding protein;
step 2.2, respectively calculating the similarity between the same type of attributes in the three types of GO attributes respectively adopted by the protein A and the protein B according to the S-value obtained in the step 2.1, namely respectively calculating Sim p (a,b)、Sim f (a,b)、Sim c (a,b);
And 2.3, solving a mean square value of the similarity calculation result obtained in the step 2.2, and obtaining the similarity between the connected proteins in the network data set in the step 1 according to the solved mean square value.
Wherein the specific process of step 2.1 is to obtain the semantic contribution S-value of the attributes of the three joint DAGs in protein A to protein A and the semantic contribution S-value of the attributes of the three joint DAGs in protein B to protein B through the following formula (1),
wherein, w e Semantic genetic weight representing edge E, E ∈ E ε E connects the attribute t with its child attribute t'.
Wherein the specific process of step 2.2 is to obtain Sim according to the following formula (2) p (a,b)、 Sim f (a,b)、Sim c (a,b):
Wherein s is A (t) and s B (t) denotes the S-value of attribute t for A and B.
Wherein the specific process of the step 2.3 is,
the mean square value is found by the following formula (4):
wherein the specific process of the density-based graph partitioning algorithm in the step 4 is,
step 4.1, calculating semantic aggregation coefficients among all the edge-connected proteins in the weighted undirected network data set obtained in the step 3;
step 4.2, all the proteins in the authorized undirected network data set obtained in the step 3 are marked as 'not clustered';
step 4.3, selecting one protein p from the marking results of the step 4.2, selecting all proteins which can be connected with the protein p in density from the authorized and undirected network obtained in the step 3 according to the calculation result of the step 4.1, and taking the selected proteins and the proteins p as 1 cluster C, wherein the proteins in the cluster C are marked as 'clustered';
and 4.4, repeating the step 4.3 until all the proteins meeting the clustering requirements are marked as 'clustered', and deleting clusters of the proteins not meeting the clustering requirements from the authorized undirected network as noise points, wherein the clustered clusters are dense subgraphs found by the DBGPWN algorithm.
The specific process of step 4.1 is that, the semantic clustering coefficient is obtained by the following formula (3):
wherein A is i,k >0&A k,j > 0 indicates that node k is connected to node i or j by an edge, Σ k∈V {A i,k +A k,j |A i,k >0&A k,j > 0 represents the sum of the weights of the nodes i or j and the nodes connected by the two points.
The method has the advantages that the graph partitioning algorithm provided by the invention belongs to a clustering algorithm based on density, is insensitive to the shape and size of a cluster in a data set, and does not need to optimize a function. The definition of the quantization density of the semantic clustering coefficient is adopted, and the definition is suitable for the weighted graph. The algorithm time complexity is low, and the clustering result precision is high. Compared with the original method, the improved semantic similarity calculation method based on the gene ontology is lower in time consumption and can be used for large-scale networks.
Drawings
FIG. 1 is a graph of the GO attribute DAG relationship in a semantic density based protein network complex identification method of the present invention;
FIG. 2 is a joint DAG diagram of annotated attributes of protein P56524 in a protein network complex identification method based on semantic density according to the present invention;
FIG. 3 is an experimental effect of the DBGPWN and the comparison method in the protein network complex recognition method based on semantic density on human network;
FIG. 4 shows the experimental effect of the DBGPWN and the comparison method in the protein network complex identification method based on semantic density on the yeast network;
FIG. 5 is the number of highly matched complexes found by DBGPWN and DME in a protein network complex identification method based on semantic density according to the invention;
FIG. 6 shows the aggregation scores of human networks and yeast networks respectively by DBGPWN and a comparison algorithm in the protein network complex identification method based on semantic density.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
The invention relates to a protein network compound identification method based on semantic density, which relates to the following related concepts and definitions:
the Gene Ontology database (GO) is a large-scale biological information database, and the aim of constructing the Gene Ontology database is to unify the expression modes of all biological genes and protein attributes. GO collects biological as well as biochemical properties describing the function of the corresponding protein or its location in the cell. All attributes are classified into three categories: biological processes, molecular functions, and cellular components, abbreviated as P, F, C, respectively. As an ontology, GO contains two semantic relations, "is _ a" and "part _ of". GO is typically represented using a Directed Acyclic Graph (DAG) in a data structure, with nodes in the graph representing individual attributes and edges representing semantic relationships between the attributes. "is _ a" indicates a child-belonging relationship, and "part _ of" indicates a part of all relationships. FIG. 1 shows a DAG consisting of GO attribute 'Membrane-bound Organelle' within a cell (Intracellular Membrane-bound Organelle) and its semantically relevant attributes, with the dashed arrow representing "part _ of" and the solid arrow representing "is _ a".
The similarity of two GO attributes in a DAG graph can be deduced according to the position information of the two GO attributes in the DAG. If two GO attributes have the same parent attribute, i.e., both GO attributes are subclasses of a certain GO attribute, then the two GO attributes can be considered to be relatively similar. E.g. 'intracellular' in FIG. 1Organelles ' (GO: 0043229) and ' membrane-bound organelles ' (GO: 0043227), which are subclasses of ' organelles ' (GO: 0043226), define a semantic value to represent DAG A Each ancestor attribute in (1) is semantically inherited to attribute a. DAG A The closer the attribute a is to the attribute a, the more inherited. The Semantic inheritance of a GO attribute t to an attribute a can be quantitatively expressed as the Semantic value of t to a (Semantic value), abbreviated as S-value.
Definition 1, S-value of a property set: for a joint DAG μ =(A,T μ ,E μ ) Wherein the attributes are setThen T μ The S-value of any one attribute t for the attribute set μ can be calculated by equation (1),
wherein w e Semantic genetic weight representing edge E, E ∈ E ε E is connected with the attribute t and the sub-attribute t', and the S-value of the attribute t to the mu is basically calculated to obtain the shortest path between t and the mu in the DAG graph, wherein the mu has a plurality of attributes, and the shortest path between each attribute of the t and the mu needs to be obtained and then the shortest one is selected. w is a e The optimal weight values for the semantic relationships "is _ a" and "part _ of" are 0.8 and 0.6, respectively, and the semantic similarity of the two GO attributes can be calculated according to their semantic inheritance by the ancestor attributes of the two attributes. Taking FIG. 1 as an example, since there is a relationship of "is _ a", the GO attribute 'membrane-bound organelle' (GO: 0043227) has an S-value of 0.8, i.e., S, for 'intracellular membrane-bound organelle' (GO: 0043231) GO:0043231 (GO: 0043227) = 0.8. Since 'membrane-bound organelles' are a subclass of 'organelles', it can be calculated from equation (1) that the S-value of 'organelle' (GO: 0043226) to 'membrane-bound organelle within cell' (GO: 0043231) is S GO:0043231 (GO:0043226)=S GO:0043231 (GO:0043227)×S GO:0043227 (GO:0043226)=0.64。
Definition 2, semantic similarity of two proteins: given two proteins A and B, their associated DAGs are DAG A =(A,T A ,E A ) And DAG B =(B,T B ,E B ) Then, the similarity between these two proteins is calculated as follows:
wherein S is A (t) and S B (t) represents the S-value of attribute t for A and B, calculated according to equation (1).
Equation (2) is essentially a DAG solution A And DAG B The ratio of the S-value of the common attribute to the sum of all S-values, their joint DAG for two proteins A and B, respectively DAG A =(A,T A ,E A ) And DAG B =(B,T B ,E B ) If DAG A And DAG B The proteins A and B are considered to be similar if there are a plurality of common attributes, and the semantic inheritance S-value of the attributes is large, and the ratio of the semantic inheritance S-value to the sum of all S-values is large. It is generally considered that if the similarity between two proteins is greater than 0.5, the two proteins are similar.
Definition 3, semantic aggregation coefficient: given a semantically weighted protein interaction network G (E, V, W), and its adjacency matrix A i,j (node i is connected to j by a side in G, A i,j Weight W equal to the edge i,j If i and j have no edges connected, A i,j Is 0; if i = j, A i,j =1; A i,i =A j,j = 0), the semantic aggregation coefficients of the two proteins can be calculated according to equation (3),
wherein A is i,k >0&A k,j > 0 indicates that node k has edges connecting to either node i or j,∑ k∈V {A i,k +A k,j |A i,k >0&A k,j > 0 represents the sum of the weights of the nodes i or j and the nodes connected by the two points. The weight of the connecting edge of the nodes i and j also needs to be determined in consideration of the similarity, so that A is added to the numerator i,j And A j,i . The denominator of equation (3) is the sum of the edge weights of all connected nodes i or j. If more proteins are present in the network similar to both proteins i and j, then it is likely that proteins i and j are similar to each other.
Definition 4, the density is directly reachable: given a semantically weighted protein interaction network G (E, V, W), two proteins i ∈ V and j ∈ V, one parameterIf SCC (i, j) ≧ θ, then proteins i and j can be considered directly reachable by density, with parameter θ being the only parameter in the density-based graph partitioning algorithm (DBGPWN) algorithm.
Definition 5, the density can be reached: given a semantically weighted protein interaction network G (E, V, W), two proteins i E V and j E V if there is a protein sequence p in V 1 ,...,p n (p 1 =i,p n = j), wherein p i+1 And p i Is directly density reachable, then proteins i and j can be considered density reachable.
Definition 6, density can link: given a semantically weighted protein interaction network G (E, V, W), two proteins i E V and j E V, if there is one protein k that is density-reachable to both proteins i and j, then the proteins i and j can be considered density-linkable. Furthermore, if proteins i and j are directly density-reachable to each other, there is no third protein that is density-reachable for both i and j, and proteins i and j can also be considered density-connectable.
The invention relates to a protein network compound identification method based on semantic density, which is implemented according to the following steps:
step 1, for a weightless protein interaction network data set, searching attributes of all proteins in the network data set in a gene ontology base GO;
and 2, calculating the similarity between the connected proteins in the network data set in the step 1 by adopting a semantic similarity calculation method based on the gene ontology based on the search result in the step 1.
A GO attribute A and its related information in the whole GO can be represented as a directed acyclic graph DAG A =(A,T A ,E A ),T A Is a collection of attributes a and its ancestor attributes. E A Is a collection of edges in the graph, i.e. the connection T A The semantic inheritance of a GO attribute t to attribute a can be quantitatively expressed as S-value of t to a. As shown in fig. 2. The annotated attribute of protein P56524 is represented in fig. 2 by the thick black arrows, whose basic functional semantics are 'binding' (binding). Comparing the combined DAGs of the two proteins allows the inference of the similarity in molecular function. If a protein also has binding-related properties, it can be considered to be more similar in molecular function to P56524.
The specific steps of the step 2 are as follows,
step 2.1, setting a protein A and a protein B as analysis objects, respectively constructing three joint DAGs for the protein A and the protein B by adopting three types of GO attributes (a biological process P, a molecular function F and a cell assembly C), and calculating semantic contributions of the attributes of the three joint DAGs in each protein to the corresponding protein, namely S-value, through a formula (1), wherein the larger the S-value of the GO attribute of a certain type is, the higher the semantic contribution is;
and 2.2, respectively calculating the similarity between the same type of attributes in the three types of GO attributes adopted by the protein A and the protein B according to the S-value obtained in the step 2.1 by using a formula (2). I.e. the similarity of the attribute of the 'biological process' class in protein A to the attribute of the 'biological process' class in protein B, the similarity of the attribute of the 'molecular function' class in protein A to the attribute of the molecular function class in protein B, and the phase of the attribute of the 'cell component' class in protein A to the attribute of the 'cell component' class in protein BSimilarity, i.e. separately determining Sim p (a,b)(Sim p (a, B) indicates the similarity of class P attributes protein A and protein B for biological processes), sim f (a,b) (Sim f (a, B) similarity of protein A and protein B for class F attributes of molecular function), sim c (a,b)(Sim c (a, B) indicates the attributes of class C protein A and protein B for the cell component);
step 2.3, solving the mean square value of the similarity calculation result obtained in the step 2.3 through a formula (4), namely obtaining the similarity between the connected proteins in the network data set in the step 1, wherein the formula (4) is as follows:
step 3, according to the protein similarity obtained in the step 2 (namely the mean square value obtained in the step 2.3), converting the protein interaction network data set given in the step 1 into a weighted undirected network data set, wherein nodes represent proteins, edges represent the interaction between the proteins, and the similarity between the proteins is taken as the weight of the edges;
we assume that the three classes of properties of proteins in the same complex are considered to be similar. Some previous complex recognition algorithms rely solely on the similarity in molecular function of the proteins in the complex, which is not the case. The similarity between proteins in a known complex, 13S synusin complex, was calculated using the calculation method of the present invention, equations (1), (2) and (4). This complex has five proteins, and the mean square values of semantic similarity between them are shown in table 1.
TABLE 1 mean square value of similarity between proteins in 13S condensed protein
Q15003 Q15021 Q9BPX3 O95347 Q9NTJ3
Q15003 1 0.88 0.97 0.69 0.69
Q15021 0.88 1 0.87 0.72 0.72
Q9BPX3 0.97 0.87 1 0.67 0.67
O95347 0.69 0.72 0.67 1 0.98
Q9NTJ3 0.69 0.72 0.67 0.98 1
As shown in Table 1, the proteins in the complex can all be considered similar to each other based on the combined similarity measure. In the present invention, the joint similarity metric will be used to measure the strength of each interaction in the protein interaction network, transforming the network into a weighted graph, which will help the graph partitioning algorithm identify the complex.
And 4, finding dense subgraphs from the Weighted undirected graph obtained in the step 3 by adopting a Density-Based graph partitioning algorithm, wherein the graph partitioning algorithm is called DBGPWN (Density Based Graphical partitioning for Weighted Networks), and the obtained dense subgraphs are the compounds identified by the invention. In the density-based clustering algorithm, two data points with directly reachable densities are divided into the same cluster. Therefore, in the invention, the semantic aggregation coefficient is used as a basis for judging whether the densities of the two proteins are directly reachable in the DBGPWN algorithm. If the semantic aggregation coefficients of two proteins are greater than a given threshold, then the two proteins will be considered as being directly reachable in density.
Step 4.1, calculating semantic aggregation coefficients among all the edge-connected proteins in the weighted undirected network data set obtained in the step 3 through a formula (3);
step 4.2, all the proteins in the authorized undirected network data set obtained in the step 3 are marked as 'not clustered';
step 4.3, selecting one protein p from the marking results of the step 4.2, selecting all proteins which can be connected with the protein p in density (judging whether the density can be connected according to definitions 4, 5 and 6) from the authorized and undirected network obtained in the step 3 according to the calculation result of the step 4.1, and taking the selected proteins and the protein p as 1 cluster C, wherein the proteins in the cluster C are marked as 'clustered';
and 4.4, repeating the step 4.3 until all the proteins meeting the clustering requirements are marked as 'clustered' (the clustering requirements of the proteins are that proteins which can be connected with the protein density can be found from the weighted and undirected network data set obtained in the step 3), deleting the clusters of the proteins which do not meet the clustering requirements from the weighted and undirected network as noise points, wherein the clustered clusters are dense subgraphs discovered by the DBGPWN algorithm, and the dense subgraphs are identified protein network compounds based on semantic density.
The DBGPWN algorithm pseudo-code is shown below.
Inputting a protein interaction network G (E, V, W) with semantic weight and a parameter theta;
output cluster C 1 ,...,C n
1:Begin
2, calculating semantic aggregation coefficients among all the proteins with edge connection
3 protein is represented by i, j
4:For(i=0;i<V;i++)
5 (protein i does not belong to any cluster) if (c) the protein i
6, creating a new cluster C;
7:For(j=0;j<V;j++)
8 (protein j does not belong to any cluster) then
9 (i and j are density connectable) then
10, dividing j into C;
11:End For
12:End;
to verify the ability of the present invention to identify complexes, the identified complexes were compared to complexes known in MIPS. The MIPS database contains 1140 yeast complexes and 1845 personal complexes. Some of these composites are overlapping and most are smaller in size. It is noted that only half of the proteins in the two protein interaction network datasets can be found in these known complexes. To measure the degree of match of the identified clusters to known complexes, we used a variety of test indices. The relevant definitions are as follows.
Definition 7, overlap ratio: given a cluster P and a known complex K, the calculation of the overlap ratio of P to K is shown in equation (5).
Wherein | V P ∩V K I is the number of proteins shared in cluster P in common with known complex K, | V P I is the total number of proteins in the cluster P, V K L is the total number of proteins in complex K.
Setting a threshold valueFor cluster P and known complex K, if OS (P, K) ≧ σ, the two can be considered a match. It should be noted that when P and K have only one protein in common, the overlap ratio of the two does not have any significance. However, if | V P | =2 and | V K |=3,|V P ∩V K If | =1, then OS (P, K) =0.16. Previous experiments have shown that a reasonable threshold value is 0.2. This situation is likely to mislead the determination of complex matching. Therefore, a condition is added on the basis of the original definition: if | V P ∩V K |=1,OS(P,K)=0。
On the basis of the overlap ratio, the ability of one method to find complexes can be assessed using the following four metrics.
Definition 8, sensitivity and specificity: given a matching threshold σ, TP (true positive) represents the known complex of the identified cluster usedThe number of clusters to which the compound matches, i.e., OS (P, K) ≧ σ. FP (false positive) is equal to the total number of identified clusters used minus TP. FN (false negatives) represents the total number of known complexes that did not match in the identified cluster. Sensitivity (S) n ) And specificity (S) p ) Can be expressed as:
sensitivity is actually the proportion of correct true predictions among all true predictions, and specificity is the proportion of correct true predictions among all predictions.
For further comprehensive measurement comparison, two comprehensive measurement indexes are adopted: f-mean (F-measure) and Ma Xiusi correlation coefficient (MCC). F-means integration takes sensitivity and specificity into account. MCC is essentially a coefficient that measures the correlation between observations and prediction classes. They can be formally represented as:
in addition, we used p-vlaues to measure the biological relevance of individual clusters. Given a set of proteins and their GO attribute annotations, p-vlaues can be used to calculate the statistical significance of these proteins sharing the same GO attribute. If a group of proteins has a very high p-vlaues for some property, then these proteins can be considered biologically homogeneous and can be more easily identified as a complex by each method based on functional enrichment analysis. The lower the p-vlaues found for a composite by one method, the better the performance proved.
Definition 9,p-vlaue: given a cluster of n proteins, where m proteins share an annotation attribute x, the probability of observing m or more of the n proteins having annotation attribute x is p-vlaue. If a hyper-geometric distribution is used, the calculation method is as shown in equation (14).
Where N is the total number of all proteins in the database, inside which are M shared attributes x. Therefore, the lower the value of p-value, the more pronounced the representation of the GO attribute x. In general, the threshold for determining whether the value of p-value is significant is 0.05.p-value is essentially the probability of random generation in estimating a cluster. To measure the ability of a method to identify complexes using p-value, analysis can be performed using the Cluster score function.
Definition 10, aggregate score: given a clustering result, and the p-values of each cluster, the result aggregation score is calculated as shown in equation (15).
Wherein min (p) i ) Is the minimum p-value of cluster i, cutoff is the threshold used to determine whether p-value is significant, n S Denotes the number of significant clusters, n I Indicating the number of significant clusters. In the experiment, cutoff was set to 0.05. The aggregate score reflects the probability that the clustering results generated by one method occur randomly. Lower values indicate better clustering accuracy for the method.
We set the parameter θ of DBGPWN (parameter θ in definition 4) to 0.4 when we performed experiments on yeast networks and 0.25 for human networks. The results of the experiments with DBGPWN were compared to those obtained with other excellent methods, RANCoC, RNSC, MCODE, MCL and DME. These comparative methods have all proven to be effective in identifying complexes from protein interaction networks, and their inventors have provided source codes. The parameters of each method are set according to the previous experiment, and the obtained experimental result is consistent with the previous experiment.
We performed experiments on DBGPWN, MCL and DME on weighted networks, where the weights were calculated using a joint similarity metric. RANCoC, MCODE and RNSC will be performed over weightless networks. The complexes identified by each method are compared to complexes in the MIPS database. Experimental performance will be assessed using sensitivity, specificity, F-means, MCC, number of highly matched complexes, and aggregation scores. The experimental results are shown in the figure.
a. Sensitivity: as shown in fig. 3 (a) and 4 (a), the DBGPWN algorithm is more sensitive than the MCODE algorithm for both data sets.
b. Specificity: as shown in fig. 3 (b) and 4 (b), the DBGPWN algorithm is more specific to the human network than the MCL, RANCoC, and RNSC algorithms. For yeast networks, when σ OS Equal to 0.35 and 0.4, the DBGPWN algorithm is more specific than the other comparison algorithms.
c.F mean: when σ is shown in FIG. 3 (c) OS Above 0.25, the DBGPWN algorithm has a higher F-means than the other comparison algorithms. Note that σ OS The larger the value of (b), the higher the reliability of each evaluation index. And the F mean represents the comprehensive performance of the algorithm, so the experimental result shown in FIG. 3 (c) reflects the superiority of the DBGPWN algorithm. As shown in FIG. 4 (c), for the yeast network, when σ OS Equal to 0.35 and 0.4, the DBGPWN algorithm has an F-means equivalent to that of the MCL algorithm, which is higher than that of the MOCDE, RANCoC and RNSC algorithms.
d, MCC: as shown in FIGS. 3 (d) and 4 (d), for a human network, when σ is OS The DBGPWN algorithm outperforms the MCL, RANCoC and RNSC algorithms > 0.2. For yeast networks, when σ OS When the expression is more than or equal to 0.3, the expression of DBGPWN is best. Because the higher the overlap ratio, the more meaningful the experimental effect is, the higher the overlap ratio is, the more meaningful it can beDBGPWN is thought to be able to efficiently identify potential complexes in the protein interaction network.
e. Number of high-matching complexes: we further compared DBGPWN to DME to find the ability to highly match complexes. The clusters found by DMEs are mostly overlapping with each other, and tens or even hundreds of clusters may match only one known complex. Since the databases in MIPS are mostly overlapping, DME finds a much larger number of matching complexes (TNs) than other algorithms. To find as many TNs as possible, we choose a lower density threshold. When the threshold was 0.88, DME could find 3137 clusters from the yeast network in 10.7 hours. As shown in fig. 5, DBGPWN can match to more complexes for the yeast network.
f. Aggregate scoring: we calculated p-values for clusters found by each method with cutoff set to 0.05, and further compared the aggregation scores for each method. Lower values of the aggregate score indicate better performance of the algorithm. As can be seen from fig. 6, (a) is the aggregate score of each algorithm on the human network, and (b) is the aggregate score of each algorithm on the yeast network. As can be seen from FIG. 6, DBGPWN has better clustering accuracy, is better than MCL, RANCoC and RNSC algorithms, and can ensure higher biological significance.

Claims (6)

1. A protein network compound identification method based on semantic density is characterized in that: the method is implemented according to the following steps:
step 1, searching attributes of all proteins in a network data set in a gene ontology base GO for a weightless protein interaction network data set;
step 2, based on the search result in the step 1, adopting a semantic similarity calculation method based on a gene ontology to calculate the similarity between the connected proteins in the network data set in the step 1;
step 3, converting the protein interaction network data set given in the step 1 into a weighted undirected network data set according to the similarity result obtained in the step 2, wherein nodes represent proteins, edges represent the interaction between the proteins, and the similarity between the proteins is the weight of the edges;
step 4, adopting a density-based graph partitioning algorithm to find dense subgraphs from the weighted undirected graph obtained in the step 3, wherein the graph partitioning algorithm is called DBGPWN, and the obtained dense subgraphs are protein network compounds identified based on semantic density;
the semantic similarity calculation method based on the gene ontology in the step 2 specifically comprises the following steps:
step 2.1, setting a protein A and a protein B as analysis objects, constructing three joint DAGs by the protein A and the protein B respectively by adopting GO attributes of three types including a biological process P, a molecular function F and a cell component C, and respectively calculating semantic contributions S-values of the attributes of the three joint DAGs in each protein to the corresponding protein;
step 2.2, respectively calculating the similarity between the same type of attributes in the three types of GO attributes respectively adopted by the protein A and the protein B according to the S-value obtained in the step 2.1, namely respectively calculating Sim p (a,b)、Sim f (a,b)、Sim c (a,b);
And 2.3, solving a mean square value of the similarity calculation result obtained in the step 2.2, and obtaining the similarity between the connected proteins in the network data set in the step 1 according to the solved mean square value.
2. The method for recognizing protein network complexes based on semantic density as claimed in claim 1, wherein the semantic density-based protein network complex recognition method comprises the following steps: the specific process of the step 2.1 is as follows: the semantic contribution S-value of the attributes of the three joint DAGs in the protein A to the protein A and the semantic contribution S-value of the attributes of the three joint DAGs in the protein B to the protein B are obtained through the formula (1),
wherein, w e Semantic genetic weight representing edge E, E ∈ E ε E connecting the attribute t and its sub-attribute t'; s μ (t): t represents a GO genusProperty, S μ (t) represents the semantic genetic value of a GO attribute t for the attribute group μ, i.e., the S-value; μ represents a set of attributes; child renoff represents a child attribute; ee represents the set of edges in the graph; s μ (t'): t' represents a sub-attribute of the attribute t, S μ (t ') represents the semantic genetic value of the sub-attribute t' for the attribute μ, i.e., the S-value.
3. The protein network complex identification method based on semantic density as claimed in claim 2, wherein: the specific process of the step 2.2 is as follows: sim is obtained according to the following formula (2) p (a,b)、Sim f (a,b)、Sim c (a,b):
Wherein S is A (t) and S B (t) represents the S-value of attribute t for A and B; t is A : representing a collection of attributes A and its ancestor attributes; t is B : representing the collection of attribute B and its ancestor attributes.
4. The protein network complex identification method based on semantic density as claimed in claim 3, wherein: the specific process of step 2.3 is to solve the mean square value by the following formula (4):
5. the method for recognizing protein network complexes based on semantic density as claimed in claim 1, wherein the semantic density-based protein network complex recognition method comprises the following steps: the specific process of the density-based graph partitioning algorithm in the step 4 is as follows:
step 4.1, calculating semantic aggregation coefficients among all the edge-connected proteins in the weighted undirected network data set obtained in the step 3;
step 4.2, all the proteins in the authorized undirected network data set obtained in the step 3 are marked as 'not clustered';
step 4.3, selecting one protein p from the marking results of the step 4.2, selecting all proteins which can be connected with the protein p in density from the authorized and undirected network obtained in the step 3 according to the calculation result of the step 4.1, and taking the selected proteins and the proteins p as 1 cluster C, wherein the proteins in the cluster C are marked as 'clustered';
and 4.4, repeating the step 4.3 until all the proteins meeting the clustering requirements are marked as 'clustered', and deleting the clusters of the proteins not meeting the clustering requirements as noise points from the authorized and undirected networks, wherein the clustered clusters are dense subgraphs discovered by the DBGPWN algorithm.
6. The method for recognizing protein network complexes based on semantic density as claimed in claim 5, wherein: the specific process of the step 4.1 is as follows: the semantic aggregation coefficient is obtained by the following formula (3):
wherein A is i,k >0&A k,j > 0 indicates that node k is connected to node i or j by an edge, Σ k∈V {A i,k +A k,j |A i,k >0&A k,j 0 represents the sum of the weights of the node i or j and the node connected with the two points together;
given a semantically weighted protein interaction network G, its adjacency matrix is denoted A i,j If the nodes i and j in G have edges connected, A i,j Equal to the weight of the edge, otherwise 0; if i = j, A i,j =1。
CN201510338321.7A 2015-06-17 2015-06-17 A kind of protein network complex recognizing method based on semantic density Expired - Fee Related CN104992078B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510338321.7A CN104992078B (en) 2015-06-17 2015-06-17 A kind of protein network complex recognizing method based on semantic density

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510338321.7A CN104992078B (en) 2015-06-17 2015-06-17 A kind of protein network complex recognizing method based on semantic density

Publications (2)

Publication Number Publication Date
CN104992078A CN104992078A (en) 2015-10-21
CN104992078B true CN104992078B (en) 2018-02-16

Family

ID=54303891

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510338321.7A Expired - Fee Related CN104992078B (en) 2015-06-17 2015-06-17 A kind of protein network complex recognizing method based on semantic density

Country Status (1)

Country Link
CN (1) CN104992078B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106779858B (en) * 2016-12-26 2019-10-25 西安理工大学 One kind being based on the associated product analysis method of multidimensional perception information semantic level
CN108319812B (en) * 2018-02-05 2021-07-23 陕西师范大学 Method for identifying key protein based on cuckoo search algorithm
CN108629159B (en) * 2018-05-14 2021-11-26 辽宁大学 Method for discovering pathogenic key protein of Alzheimer's disease
CN108932402A (en) * 2018-06-27 2018-12-04 华中师范大学 A kind of protein complex recognizing method
CN109697467A (en) * 2018-12-24 2019-04-30 宁波大学 A kind of summarization methods of complex network figure
CN110910952B (en) * 2019-11-21 2023-05-12 衡阳师范学院 Method for predicting basic protein by using chemical reaction strategy
CN113470738B (en) * 2021-07-03 2023-07-14 中国科学院新疆理化技术研究所 Overlapping protein complex identification method and system based on fuzzy clustering and gene ontology semantic similarity

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
"基于基因本体的蛋白质网络中蛋白质复合物识别算法研究";李灿;《中国优秀硕士学位论文全文数据库 基础科学辑》;20140715;第2014年卷(第7期);第A006-61页 *
"基于密度聚类算法及其模式评估方法的研究与实现";宋飞燕;《中国优秀硕士学位论文全文数据库 信息科技辑》;20071115;第2007年卷(第5期);第I138-29页 *
"基于蛋白质亲和密度扩展的蛋白质复合物挖掘";李鹏;《中国优秀硕士学位论文全文数据库 基础科学辑》;20140915;第2014年卷(第9期);第A006-53页 *
"基于蛋白质关系网络的复合物抽取研究";徐博;《万方数据》;20141028;第1.4节、第2.1节、第2.3.1节、第3.1.1节、第3.1.2节 *
"蛋白质网络中复合体和功能模块预测算法研究";鱼亮等;《计算机学报》;20110731;第34卷(第7期);第1239-1251页 *

Also Published As

Publication number Publication date
CN104992078A (en) 2015-10-21

Similar Documents

Publication Publication Date Title
CN104992078B (en) A kind of protein network complex recognizing method based on semantic density
Huang et al. Truss decomposition of probabilistic graphs: Semantics and algorithms
Peng et al. Identification of protein complexes using weighted pagerank-nibble algorithm and core-attachment structure
Yu et al. Predicting protein complex in protein interaction network-a supervised learning based method
CN108804870B (en) Markov random walk-based key protein identification method
CN107784196B (en) Method for identifying key protein based on artificial fish school optimization algorithm
Ding et al. Mining protein complexes from PPI networks using the minimum vertex cut
Xu et al. From function to interaction: A new paradigm for accurately predicting protein complexes based on protein-to-protein interaction networks
AlJadda et al. Pgmhd: A scalable probabilistic graphical model for massive hierarchical data problems
Su et al. Predicting drug-target interactions over heterogeneous information network
Dong et al. Predicting protein complexes using a supervised learning method combined with local structural information
Wang et al. An ensemble learning framework for detecting protein complexes from PPI networks
Agarwal et al. Chisel: Graph similarity search using chi-squared statistics in large probabilistic graphs
Zhao et al. Detecting overlapping protein complexes in weighted PPI network based on overlay network chain in quotient space
Pizzuti et al. An evolutionary restricted neighborhood search clustering approach for PPI networks
Wang et al. A supervised protein complex prediction method with network representation learning and gene ontology knowledge
CN106815653B (en) Distance game-based social network relationship prediction method and system
Gruca et al. Rule based functional description of genes–estimation of the multicriteria rule interestingness measure by the UTA method
Sikandar et al. Combining sequence entropy and subgraph topology for complex prediction in protein protein interaction (PPI) network
Tumuluru et al. A survey on identification of protein complexes in protein–protein interaction data: Methods and evaluation
Van et al. The conjunctive disjunctive graph node kernel for disease gene prioritization
Xu et al. Identifying protein complexes with fuzzy machine learning model
He et al. A novel proteins complex identification based on connected affinity and multi-level seed extension
Lu et al. Two new methods for identifying proteins based on the domain protein complexes and topological properties
Carter et al. Deployment and retrieval simulation of a single tether satellite system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180216

Termination date: 20200617

CF01 Termination of patent right due to non-payment of annual fee