CN104992078B

CN104992078B - A kind of protein network complex recognizing method based on semantic density

Info

Publication number: CN104992078B
Application number: CN201510338321.7A
Authority: CN
Inventors: 周红芳; 段文聪; 郭杰; 王心怡; 何馨依; 刘杰; 李锦�
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2015-06-17
Filing date: 2015-06-17
Publication date: 2018-02-16
Anticipated expiration: 2035-06-17
Also published as: CN104992078A

Abstract

The invention discloses a kind of protein network complex recognizing method based on semantic density, specifically implement according to following steps：For a protein-protein interaction network data set without weight, in the GO of gene ontology storehouse in Network Search data set all proteins attribute；Based on lookup result, the similitude existed between the protein connected is concentrated come the network data calculated based on the semantic similarity calculation method of gene ontology using a kind of；According to obtained correlation result, by given protein-protein interaction network data set be converted into one have the right, Undirected networks data set, wherein node on behalf protein, side represents the interaction between protein, and the similarity between protein is the weight on side；Protein complex can be identified from protein-protein interaction network, and recognition accuracy is higher, time complexity is relatively low.

Description

Protein network compound identification method based on semantic density

Technical Field

The invention belongs to the technical field of data mining methods, and relates to a protein network compound identification method based on semantic density.

Background

Empirical studies and theoretical simulations of complex networks have a long history and many related techniques and methods derived from statistical physics and applied mathematics have been proposed. The concept of system networking has also been successfully applied to molecular biology related research. Proteins in biological systems interact with each other to perform a wide variety of molecular biological functions, these Interactions being referred to as PPIs (Protein-Protein Interactions). A biological system composed of proteins and interactions can be formally depicted as an undirected graph, i.e., a protein interaction network (PPI networks), or simply a protein network. In a protein network, each node represents a protein and the edges represent interactions between proteins. Through the analysis of protein networks, researchers can further understand the structure and properties of molecular biological systems. For example, the identification of protein complexes and assessment of protein criticality.

Proteins having the same molecular function at a specific time and place are considered to constitute a single biomolecule, i.e., a Protein Complex (Protein Complex), if there are many interactions between them. In the past, protein complexes were found primarily by means of biochemical experiments, such as Mass Spectrometry after Affinity Purification (AP/MS). However, most experimental methods are not very reliable and are inefficient. In recent years, a number of data mining methods based on clustering techniques have been proposed and successfully applied to the identification of protein complexes. These methods allow the identification of numerous protein complexes in the protein interaction network, which could not have been found experimentally. According to different characteristics, these clustering methods can be classified into: hierarchical Clustering (Hierarchical Clustering), objective Function Clustering (Objective Function Clustering), and Density-based Clustering.

Hierarchical clustering techniques have been widely applied to analyze various types of complex networks, such as online social networks and protein interaction networks. The main idea of such a method is to divide the network into several sub-networks based on the similarities between the connected nodes in the network. Hierarchical clustering can be further divided into: coacervation (agglutination) and cleavage (cleavage). The most well-known method of the split class is the GN algorithm, while the most representative method of the condensed class is the CNM algorithm.

Both the objective function clustering method and the density clustering method are based on graph partitioning techniques (graphical partitions). The former divides the graph by optimizing an objective function, and the latter determines a subgraph with the maximum density, such as a clique (clique), based on the topological characteristics of the network. The well-known RNSC algorithm identifies complexes in the protein interaction network by optimizing a specific cost function. In recent years, many similar multi-objective methods have been proposed, which mostly solve the problem of multi-objective optimization with methods of evolutionary computation, such as genetic algorithms and firefly algorithms. RANCoC is a co-clustering method for searching dense subgraphs in protein-interaction networks. Given a weightless map for a protein interaction network, a dense subgraph is defined as a submatrix with a higher quality. RANCoC is essentially the discovery of dense subgraphs by optimization of a quality function, which requires several conditions to be met. Furthermore, a new heuristic is applied by RANCoC to prevent local optimization.

It is generally recognized that a dense subgraph should have denser internal connecting edges, i.e., there are connecting edges between most nodes inside the subgraph. The denser a subgraph is, the more likely it is a community in a social network or a complex in a protein interaction network. The goal of the density-based approach is to find dense regions in the graph and treat the connected dense regions as dense subgraphs. To quantitatively calculate the degree of denseness, various methods have different definitions for denseness. Classical MCODE algorithms use k-cores (k-core) and core aggregation coefficients to discover complexes. The k-core is a subgraph in which the degree of each node is greater than or equal to k. The k-kernel with the largest k is considered as the most dense subgraph. Another well-known definition of dense subgraphs is the lineage (clique). All nodes in a cluster are edge-connected to each other. Two derivatives with k nodes can be considered adjacent to each other if they contain k-1 common nodes. A k-party community is a set of contiguous k-parties. The Pseudo-derivatives (Pseudo Cliques) employed by the DME algorithm are extensions of the derivatives obtained by removing a certain number of edges from the derivatives. A pseudo-derivative is a subgraph in which the number of edges is slightly less than the derivative with the same number of nodes, and the proportion should be greater than a given threshold. In addition, there are many other types of methods for finding dense subgraphs, such as flow simulation (flow simulation) employed by the MCL algorithm.

Both hierarchical clustering methods and objective function-based clustering methods require optimization of one or more functions. However, the evaluation function for the network structure often has certain limitations, for example, the modularity adopted by the hierarchical method has a problem of decomposition limit. And the computation of the global optimization function may increase the time complexity of the algorithm. The multi-objective function optimization problem is also a difficult point of algorithm research. The density-based approach does not require multiple function optimizations and is less time-complex. Comprehensive evaluation, the density-based method is superior to the first two methods.

Disclosure of Invention

The invention aims to provide a protein network complex identification method based on semantic density, which can identify a protein complex from a protein interaction network, and has high identification accuracy and low time complexity.

The invention adopts the technical scheme that a protein network compound identification method based on semantic density is implemented according to the following steps:

step 1, searching attributes of all proteins in a network data set in a gene ontology base GO for a weightless protein interaction network data set;

step 2, based on the search result in the step 1, adopting a semantic similarity calculation method based on a gene ontology to calculate the similarity between the connected proteins in the network data set in the step 1;

step 3, converting the protein interaction network data set given in the step 1 into a weighted undirected network data set according to the similarity result obtained in the step 2, wherein nodes represent proteins, edges represent the interaction between the proteins, and the similarity between the proteins is the weight of the edges;

and 4, finding dense subgraphs from the weighted undirected graph obtained in the step 3 by adopting a density-based graph partitioning algorithm, wherein the graph partitioning algorithm is called DBGPWN, and the obtained dense subgraphs are the protein network compounds identified based on semantic density.

The present invention is also characterized in that,

wherein the semantic similarity calculation method based on the gene ontology in the step 2 comprises the following specific steps,

step 2.1, setting a protein A and a protein B as analysis objects, constructing three combined DAGs by the protein A and the protein B respectively by adopting three GO attributes including a biological process P, a molecular function F and a cell component C, and respectively calculating semantic contributions S-values of the attributes of the three combined DAGs in each protein to the corresponding protein;

step 2.2, respectively calculating the similarity between the same type of attributes in the three types of GO attributes respectively adopted by the protein A and the protein B according to the S-value obtained in the step 2.1, namely respectively calculating Sim _p (a,b)、Sim _f (a,b)、Sim _c (a,b)；

And 2.3, solving a mean square value of the similarity calculation result obtained in the step 2.2, and obtaining the similarity between the connected proteins in the network data set in the step 1 according to the solved mean square value.

Wherein the specific process of step 2.1 is to obtain the semantic contribution S-value of the attributes of the three joint DAGs in protein A to protein A and the semantic contribution S-value of the attributes of the three joint DAGs in protein B to protein B through the following formula (1),

wherein, w _e Semantic genetic weight representing edge E, E ∈ E _ε E connects the attribute t with its child attribute t'.

Wherein the specific process of step 2.2 is to obtain Sim according to the following formula (2) _p (a,b)、 Sim _f (a,b)、Sim _c (a,b)：

Wherein s is _A (t) and s _B (t) denotes the S-value of attribute t for A and B.

Wherein the specific process of the step 2.3 is,

the mean square value is found by the following formula (4):

wherein the specific process of the density-based graph partitioning algorithm in the step 4 is,

step 4.1, calculating semantic aggregation coefficients among all the edge-connected proteins in the weighted undirected network data set obtained in the step 3;

step 4.2, all the proteins in the authorized undirected network data set obtained in the step 3 are marked as 'not clustered';

step 4.3, selecting one protein p from the marking results of the step 4.2, selecting all proteins which can be connected with the protein p in density from the authorized and undirected network obtained in the step 3 according to the calculation result of the step 4.1, and taking the selected proteins and the proteins p as 1 cluster C, wherein the proteins in the cluster C are marked as 'clustered';

and 4.4, repeating the step 4.3 until all the proteins meeting the clustering requirements are marked as 'clustered', and deleting clusters of the proteins not meeting the clustering requirements from the authorized undirected network as noise points, wherein the clustered clusters are dense subgraphs found by the DBGPWN algorithm.

The specific process of step 4.1 is that, the semantic clustering coefficient is obtained by the following formula (3):

wherein A is _i,k ＞0&A _k,j > 0 indicates that node k is connected to node i or j by an edge, Σ _k∈V {A _i,k +A _k,j |A _i,k ＞0&A _k,j > 0 represents the sum of the weights of the nodes i or j and the nodes connected by the two points.

The method has the advantages that the graph partitioning algorithm provided by the invention belongs to a clustering algorithm based on density, is insensitive to the shape and size of a cluster in a data set, and does not need to optimize a function. The definition of the quantization density of the semantic clustering coefficient is adopted, and the definition is suitable for the weighted graph. The algorithm time complexity is low, and the clustering result precision is high. Compared with the original method, the improved semantic similarity calculation method based on the gene ontology is lower in time consumption and can be used for large-scale networks.

Drawings

FIG. 1 is a graph of the GO attribute DAG relationship in a semantic density based protein network complex identification method of the present invention;

FIG. 2 is a joint DAG diagram of annotated attributes of protein P56524 in a protein network complex identification method based on semantic density according to the present invention;

FIG. 3 is an experimental effect of the DBGPWN and the comparison method in the protein network complex recognition method based on semantic density on human network;

FIG. 4 shows the experimental effect of the DBGPWN and the comparison method in the protein network complex identification method based on semantic density on the yeast network;

FIG. 5 is the number of highly matched complexes found by DBGPWN and DME in a protein network complex identification method based on semantic density according to the invention;

FIG. 6 shows the aggregation scores of human networks and yeast networks respectively by DBGPWN and a comparison algorithm in the protein network complex identification method based on semantic density.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

The invention relates to a protein network compound identification method based on semantic density, which relates to the following related concepts and definitions:

the Gene Ontology database (GO) is a large-scale biological information database, and the aim of constructing the Gene Ontology database is to unify the expression modes of all biological genes and protein attributes. GO collects biological as well as biochemical properties describing the function of the corresponding protein or its location in the cell. All attributes are classified into three categories: biological processes, molecular functions, and cellular components, abbreviated as P, F, C, respectively. As an ontology, GO contains two semantic relations, "is _ a" and "part _ of". GO is typically represented using a Directed Acyclic Graph (DAG) in a data structure, with nodes in the graph representing individual attributes and edges representing semantic relationships between the attributes. "is _ a" indicates a child-belonging relationship, and "part _ of" indicates a part of all relationships. FIG. 1 shows a DAG consisting of GO attribute 'Membrane-bound Organelle' within a cell (Intracellular Membrane-bound Organelle) and its semantically relevant attributes, with the dashed arrow representing "part _ of" and the solid arrow representing "is _ a".

The similarity of two GO attributes in a DAG graph can be deduced according to the position information of the two GO attributes in the DAG. If two GO attributes have the same parent attribute, i.e., both GO attributes are subclasses of a certain GO attribute, then the two GO attributes can be considered to be relatively similar. E.g. 'intracellular' in FIG. 1Organelles ' (GO: 0043229) and ' membrane-bound organelles ' (GO: 0043227), which are subclasses of ' organelles ' (GO: 0043226), define a semantic value to represent DAG _A Each ancestor attribute in (1) is semantically inherited to attribute a. DAG _A The closer the attribute a is to the attribute a, the more inherited. The Semantic inheritance of a GO attribute t to an attribute a can be quantitatively expressed as the Semantic value of t to a (Semantic value), abbreviated as S-value.

Definition 1, S-value of a property set: for a joint DAG _μ ＝(A,T _μ ,E _μ ) Wherein the attributes are setThen T _μ The S-value of any one attribute t for the attribute set μ can be calculated by equation (1),

wherein w _e Semantic genetic weight representing edge E, E ∈ E _ε E is connected with the attribute t and the sub-attribute t', and the S-value of the attribute t to the mu is basically calculated to obtain the shortest path between t and the mu in the DAG graph, wherein the mu has a plurality of attributes, and the shortest path between each attribute of the t and the mu needs to be obtained and then the shortest one is selected. w is a _e The optimal weight values for the semantic relationships "is _ a" and "part _ of" are 0.8 and 0.6, respectively, and the semantic similarity of the two GO attributes can be calculated according to their semantic inheritance by the ancestor attributes of the two attributes. Taking FIG. 1 as an example, since there is a relationship of "is _ a", the GO attribute 'membrane-bound organelle' (GO: 0043227) has an S-value of 0.8, i.e., S, for 'intracellular membrane-bound organelle' (GO: 0043231) _GO:0043231 (GO: 0043227) = 0.8. Since 'membrane-bound organelles' are a subclass of 'organelles', it can be calculated from equation (1) that the S-value of 'organelle' (GO: 0043226) to 'membrane-bound organelle within cell' (GO: 0043231) is S _GO:0043231 (GO:0043226)＝S _GO:0043231 (GO:0043227)×S _GO:0043227 (GO:0043226)＝0.64。

Definition 2, semantic similarity of two proteins: given two proteins A and B, their associated DAGs are DAG _A ＝(A,T _A ,E _A ) And DAG _B ＝(B,T _B ,E _B ) Then, the similarity between these two proteins is calculated as follows:

wherein S is _A (t) and S _B (t) represents the S-value of attribute t for A and B, calculated according to equation (1).

Equation (2) is essentially a DAG solution _A And DAG _B The ratio of the S-value of the common attribute to the sum of all S-values, their joint DAG for two proteins A and B, respectively DAG _A ＝(A,T _A ,E _A ) And DAG _B ＝(B,T _B ,E _B ) If DAG _A And DAG _B The proteins A and B are considered to be similar if there are a plurality of common attributes, and the semantic inheritance S-value of the attributes is large, and the ratio of the semantic inheritance S-value to the sum of all S-values is large. It is generally considered that if the similarity between two proteins is greater than 0.5, the two proteins are similar.

Definition 3, semantic aggregation coefficient: given a semantically weighted protein interaction network G (E, V, W), and its adjacency matrix A _i,j (node i is connected to j by a side in G, A _i,j Weight W equal to the edge _i,j If i and j have no edges connected, A _i,j Is 0; if i = j, A _i,j ＝1； A _i,i ＝A _j,j = 0), the semantic aggregation coefficients of the two proteins can be calculated according to equation (3),

wherein A is _i,k ＞0&A _k,j > 0 indicates that node k has edges connecting to either node i or j,∑ _k∈V {A _i,k +A _k,j |A _i,k ＞0&A _k,j > 0 represents the sum of the weights of the nodes i or j and the nodes connected by the two points. The weight of the connecting edge of the nodes i and j also needs to be determined in consideration of the similarity, so that A is added to the numerator _i,j And A _j,i . The denominator of equation (3) is the sum of the edge weights of all connected nodes i or j. If more proteins are present in the network similar to both proteins i and j, then it is likely that proteins i and j are similar to each other.

Definition 4, the density is directly reachable: given a semantically weighted protein interaction network G (E, V, W), two proteins i ∈ V and j ∈ V, one parameterIf SCC (i, j) ≧ θ, then proteins i and j can be considered directly reachable by density, with parameter θ being the only parameter in the density-based graph partitioning algorithm (DBGPWN) algorithm.

Definition 5, the density can be reached: given a semantically weighted protein interaction network G (E, V, W), two proteins i E V and j E V if there is a protein sequence p in V ₁ ,...,p _n (p ₁ ＝i，p _n = j), wherein p _i+1 And p _i Is directly density reachable, then proteins i and j can be considered density reachable.

Definition 6, density can link: given a semantically weighted protein interaction network G (E, V, W), two proteins i E V and j E V, if there is one protein k that is density-reachable to both proteins i and j, then the proteins i and j can be considered density-linkable. Furthermore, if proteins i and j are directly density-reachable to each other, there is no third protein that is density-reachable for both i and j, and proteins i and j can also be considered density-connectable.

The invention relates to a protein network compound identification method based on semantic density, which is implemented according to the following steps:

step 1, for a weightless protein interaction network data set, searching attributes of all proteins in the network data set in a gene ontology base GO;

and 2, calculating the similarity between the connected proteins in the network data set in the step 1 by adopting a semantic similarity calculation method based on the gene ontology based on the search result in the step 1.

A GO attribute A and its related information in the whole GO can be represented as a directed acyclic graph DAG _A ＝(A,T _A ,E _A )，T _A Is a collection of attributes a and its ancestor attributes. E _A Is a collection of edges in the graph, i.e. the connection T _A The semantic inheritance of a GO attribute t to attribute a can be quantitatively expressed as S-value of t to a. As shown in fig. 2. The annotated attribute of protein P56524 is represented in fig. 2 by the thick black arrows, whose basic functional semantics are 'binding' (binding). Comparing the combined DAGs of the two proteins allows the inference of the similarity in molecular function. If a protein also has binding-related properties, it can be considered to be more similar in molecular function to P56524.

The specific steps of the step 2 are as follows,

step 2.1, setting a protein A and a protein B as analysis objects, respectively constructing three joint DAGs for the protein A and the protein B by adopting three types of GO attributes (a biological process P, a molecular function F and a cell assembly C), and calculating semantic contributions of the attributes of the three joint DAGs in each protein to the corresponding protein, namely S-value, through a formula (1), wherein the larger the S-value of the GO attribute of a certain type is, the higher the semantic contribution is;

and 2.2, respectively calculating the similarity between the same type of attributes in the three types of GO attributes adopted by the protein A and the protein B according to the S-value obtained in the step 2.1 by using a formula (2). I.e. the similarity of the attribute of the 'biological process' class in protein A to the attribute of the 'biological process' class in protein B, the similarity of the attribute of the 'molecular function' class in protein A to the attribute of the molecular function class in protein B, and the phase of the attribute of the 'cell component' class in protein A to the attribute of the 'cell component' class in protein BSimilarity, i.e. separately determining Sim _p (a,b)(Sim _p (a, B) indicates the similarity of class P attributes protein A and protein B for biological processes), sim _f (a,b) (Sim _f (a, B) similarity of protein A and protein B for class F attributes of molecular function), sim _c (a,b)(Sim _c (a, B) indicates the attributes of class C protein A and protein B for the cell component);

step 2.3, solving the mean square value of the similarity calculation result obtained in the step 2.3 through a formula (4), namely obtaining the similarity between the connected proteins in the network data set in the step 1, wherein the formula (4) is as follows:

step 3, according to the protein similarity obtained in the step 2 (namely the mean square value obtained in the step 2.3), converting the protein interaction network data set given in the step 1 into a weighted undirected network data set, wherein nodes represent proteins, edges represent the interaction between the proteins, and the similarity between the proteins is taken as the weight of the edges;

we assume that the three classes of properties of proteins in the same complex are considered to be similar. Some previous complex recognition algorithms rely solely on the similarity in molecular function of the proteins in the complex, which is not the case. The similarity between proteins in a known complex, 13S synusin complex, was calculated using the calculation method of the present invention, equations (1), (2) and (4). This complex has five proteins, and the mean square values of semantic similarity between them are shown in table 1.

TABLE 1 mean square value of similarity between proteins in 13S condensed protein

	Q15003	Q15021	Q9BPX3	O95347	Q9NTJ3
						Q15003	1	0.88	0.97	0.69	0.69
Q15021	0.88	1	0.87	0.72	0.72
						Q9BPX3	0.97	0.87	1	0.67	0.67
O95347	0.69	0.72	0.67	1	0.98
						Q9NTJ3	0.69	0.72	0.67	0.98	1

As shown in Table 1, the proteins in the complex can all be considered similar to each other based on the combined similarity measure. In the present invention, the joint similarity metric will be used to measure the strength of each interaction in the protein interaction network, transforming the network into a weighted graph, which will help the graph partitioning algorithm identify the complex.

And 4, finding dense subgraphs from the Weighted undirected graph obtained in the step 3 by adopting a Density-Based graph partitioning algorithm, wherein the graph partitioning algorithm is called DBGPWN (Density Based Graphical partitioning for Weighted Networks), and the obtained dense subgraphs are the compounds identified by the invention. In the density-based clustering algorithm, two data points with directly reachable densities are divided into the same cluster. Therefore, in the invention, the semantic aggregation coefficient is used as a basis for judging whether the densities of the two proteins are directly reachable in the DBGPWN algorithm. If the semantic aggregation coefficients of two proteins are greater than a given threshold, then the two proteins will be considered as being directly reachable in density.

Step 4.1, calculating semantic aggregation coefficients among all the edge-connected proteins in the weighted undirected network data set obtained in the step 3 through a formula (3);

step 4.3, selecting one protein p from the marking results of the step 4.2, selecting all proteins which can be connected with the protein p in density (judging whether the density can be connected according to definitions 4, 5 and 6) from the authorized and undirected network obtained in the step 3 according to the calculation result of the step 4.1, and taking the selected proteins and the protein p as 1 cluster C, wherein the proteins in the cluster C are marked as 'clustered';

and 4.4, repeating the step 4.3 until all the proteins meeting the clustering requirements are marked as 'clustered' (the clustering requirements of the proteins are that proteins which can be connected with the protein density can be found from the weighted and undirected network data set obtained in the step 3), deleting the clusters of the proteins which do not meet the clustering requirements from the weighted and undirected network as noise points, wherein the clustered clusters are dense subgraphs discovered by the DBGPWN algorithm, and the dense subgraphs are identified protein network compounds based on semantic density.

The DBGPWN algorithm pseudo-code is shown below.

Inputting a protein interaction network G (E, V, W) with semantic weight and a parameter theta;

output cluster C ₁ ,...,C _n ；

1:Begin

2, calculating semantic aggregation coefficients among all the proteins with edge connection

3 protein is represented by i, j

4:For(i＝0；i<V；i++)

5 (protein i does not belong to any cluster) if (c) the protein i

6, creating a new cluster C;

7:For(j＝0；j<V；j++)

8 (protein j does not belong to any cluster) then

9 (i and j are density connectable) then

10, dividing j into C;

11:End For

12:End；

to verify the ability of the present invention to identify complexes, the identified complexes were compared to complexes known in MIPS. The MIPS database contains 1140 yeast complexes and 1845 personal complexes. Some of these composites are overlapping and most are smaller in size. It is noted that only half of the proteins in the two protein interaction network datasets can be found in these known complexes. To measure the degree of match of the identified clusters to known complexes, we used a variety of test indices. The relevant definitions are as follows.

Definition 7, overlap ratio: given a cluster P and a known complex K, the calculation of the overlap ratio of P to K is shown in equation (5).

Wherein | V _P ∩V _K I is the number of proteins shared in cluster P in common with known complex K, | V _P I is the total number of proteins in the cluster P, V _K L is the total number of proteins in complex K.

Setting a threshold valueFor cluster P and known complex K, if OS (P, K) ≧ σ, the two can be considered a match. It should be noted that when P and K have only one protein in common, the overlap ratio of the two does not have any significance. However, if | V _P | =2 and | V _K |＝3，|V _P ∩V _K If | =1, then OS (P, K) =0.16. Previous experiments have shown that a reasonable threshold value is 0.2. This situation is likely to mislead the determination of complex matching. Therefore, a condition is added on the basis of the original definition: if | V _P ∩V _K |＝1，OS(P,K)＝0。

On the basis of the overlap ratio, the ability of one method to find complexes can be assessed using the following four metrics.

Definition 8, sensitivity and specificity: given a matching threshold σ, TP (true positive) represents the known complex of the identified cluster usedThe number of clusters to which the compound matches, i.e., OS (P, K) ≧ σ. FP (false positive) is equal to the total number of identified clusters used minus TP. FN (false negatives) represents the total number of known complexes that did not match in the identified cluster. Sensitivity (S) _n ) And specificity (S) _p ) Can be expressed as:

sensitivity is actually the proportion of correct true predictions among all true predictions, and specificity is the proportion of correct true predictions among all predictions.

For further comprehensive measurement comparison, two comprehensive measurement indexes are adopted: f-mean (F-measure) and Ma Xiusi correlation coefficient (MCC). F-means integration takes sensitivity and specificity into account. MCC is essentially a coefficient that measures the correlation between observations and prediction classes. They can be formally represented as:

in addition, we used p-vlaues to measure the biological relevance of individual clusters. Given a set of proteins and their GO attribute annotations, p-vlaues can be used to calculate the statistical significance of these proteins sharing the same GO attribute. If a group of proteins has a very high p-vlaues for some property, then these proteins can be considered biologically homogeneous and can be more easily identified as a complex by each method based on functional enrichment analysis. The lower the p-vlaues found for a composite by one method, the better the performance proved.

Definition 9,p-vlaue: given a cluster of n proteins, where m proteins share an annotation attribute x, the probability of observing m or more of the n proteins having annotation attribute x is p-vlaue. If a hyper-geometric distribution is used, the calculation method is as shown in equation (14).

Where N is the total number of all proteins in the database, inside which are M shared attributes x. Therefore, the lower the value of p-value, the more pronounced the representation of the GO attribute x. In general, the threshold for determining whether the value of p-value is significant is 0.05.p-value is essentially the probability of random generation in estimating a cluster. To measure the ability of a method to identify complexes using p-value, analysis can be performed using the Cluster score function.

Definition 10, aggregate score: given a clustering result, and the p-values of each cluster, the result aggregation score is calculated as shown in equation (15).

Wherein min (p) _i ) Is the minimum p-value of cluster i, cutoff is the threshold used to determine whether p-value is significant, n _S Denotes the number of significant clusters, n _I Indicating the number of significant clusters. In the experiment, cutoff was set to 0.05. The aggregate score reflects the probability that the clustering results generated by one method occur randomly. Lower values indicate better clustering accuracy for the method.

We set the parameter θ of DBGPWN (parameter θ in definition 4) to 0.4 when we performed experiments on yeast networks and 0.25 for human networks. The results of the experiments with DBGPWN were compared to those obtained with other excellent methods, RANCoC, RNSC, MCODE, MCL and DME. These comparative methods have all proven to be effective in identifying complexes from protein interaction networks, and their inventors have provided source codes. The parameters of each method are set according to the previous experiment, and the obtained experimental result is consistent with the previous experiment.

We performed experiments on DBGPWN, MCL and DME on weighted networks, where the weights were calculated using a joint similarity metric. RANCoC, MCODE and RNSC will be performed over weightless networks. The complexes identified by each method are compared to complexes in the MIPS database. Experimental performance will be assessed using sensitivity, specificity, F-means, MCC, number of highly matched complexes, and aggregation scores. The experimental results are shown in the figure.

a. Sensitivity: as shown in fig. 3 (a) and 4 (a), the DBGPWN algorithm is more sensitive than the MCODE algorithm for both data sets.

b. Specificity: as shown in fig. 3 (b) and 4 (b), the DBGPWN algorithm is more specific to the human network than the MCL, RANCoC, and RNSC algorithms. For yeast networks, when σ _OS Equal to 0.35 and 0.4, the DBGPWN algorithm is more specific than the other comparison algorithms.

c.F mean: when σ is shown in FIG. 3 (c) _OS Above 0.25, the DBGPWN algorithm has a higher F-means than the other comparison algorithms. Note that σ _OS The larger the value of (b), the higher the reliability of each evaluation index. And the F mean represents the comprehensive performance of the algorithm, so the experimental result shown in FIG. 3 (c) reflects the superiority of the DBGPWN algorithm. As shown in FIG. 4 (c), for the yeast network, when σ _OS Equal to 0.35 and 0.4, the DBGPWN algorithm has an F-means equivalent to that of the MCL algorithm, which is higher than that of the MOCDE, RANCoC and RNSC algorithms.

d, MCC: as shown in FIGS. 3 (d) and 4 (d), for a human network, when σ is _OS The DBGPWN algorithm outperforms the MCL, RANCoC and RNSC algorithms > 0.2. For yeast networks, when σ _OS When the expression is more than or equal to 0.3, the expression of DBGPWN is best. Because the higher the overlap ratio, the more meaningful the experimental effect is, the higher the overlap ratio is, the more meaningful it can beDBGPWN is thought to be able to efficiently identify potential complexes in the protein interaction network.

e. Number of high-matching complexes: we further compared DBGPWN to DME to find the ability to highly match complexes. The clusters found by DMEs are mostly overlapping with each other, and tens or even hundreds of clusters may match only one known complex. Since the databases in MIPS are mostly overlapping, DME finds a much larger number of matching complexes (TNs) than other algorithms. To find as many TNs as possible, we choose a lower density threshold. When the threshold was 0.88, DME could find 3137 clusters from the yeast network in 10.7 hours. As shown in fig. 5, DBGPWN can match to more complexes for the yeast network.

f. Aggregate scoring: we calculated p-values for clusters found by each method with cutoff set to 0.05, and further compared the aggregation scores for each method. Lower values of the aggregate score indicate better performance of the algorithm. As can be seen from fig. 6, (a) is the aggregate score of each algorithm on the human network, and (b) is the aggregate score of each algorithm on the yeast network. As can be seen from FIG. 6, DBGPWN has better clustering accuracy, is better than MCL, RANCoC and RNSC algorithms, and can ensure higher biological significance.

Claims

1. A protein network compound identification method based on semantic density is characterized in that: the method is implemented according to the following steps:

step 4, adopting a density-based graph partitioning algorithm to find dense subgraphs from the weighted undirected graph obtained in the step 3, wherein the graph partitioning algorithm is called DBGPWN, and the obtained dense subgraphs are protein network compounds identified based on semantic density;

the semantic similarity calculation method based on the gene ontology in the step 2 specifically comprises the following steps:

step 2.1, setting a protein A and a protein B as analysis objects, constructing three joint DAGs by the protein A and the protein B respectively by adopting GO attributes of three types including a biological process P, a molecular function F and a cell component C, and respectively calculating semantic contributions S-values of the attributes of the three joint DAGs in each protein to the corresponding protein;

step 2.2, respectively calculating the similarity between the same type of attributes in the three types of GO attributes respectively adopted by the protein A and the protein B according to the S-value obtained in the step 2.1, namely respectively calculating Sim _p (a，b)、Sim _f (a，b)、Sim _c (a，b)；

2. The method for recognizing protein network complexes based on semantic density as claimed in claim 1, wherein the semantic density-based protein network complex recognition method comprises the following steps: the specific process of the step 2.1 is as follows: the semantic contribution S-value of the attributes of the three joint DAGs in the protein A to the protein A and the semantic contribution S-value of the attributes of the three joint DAGs in the protein B to the protein B are obtained through the formula (1),

wherein, w _e Semantic genetic weight representing edge E, E ∈ E _ε E connecting the attribute t and its sub-attribute t'; s _μ (t): t represents a GO genusProperty, S _μ (t) represents the semantic genetic value of a GO attribute t for the attribute group μ, i.e., the S-value; μ represents a set of attributes; child renoff represents a child attribute; ee represents the set of edges in the graph; s _μ (t'): t' represents a sub-attribute of the attribute t, S _μ (t ') represents the semantic genetic value of the sub-attribute t' for the attribute μ, i.e., the S-value.

3. The protein network complex identification method based on semantic density as claimed in claim 2, wherein: the specific process of the step 2.2 is as follows: sim is obtained according to the following formula (2) _p (a，b)、Sim _f (a，b)、Sim _c (a，b)：

Wherein S is _A (t) and S _B (t) represents the S-value of attribute t for A and B; t is _A : representing a collection of attributes A and its ancestor attributes; t is _B : representing the collection of attribute B and its ancestor attributes.

4. The protein network complex identification method based on semantic density as claimed in claim 3, wherein: the specific process of step 2.3 is to solve the mean square value by the following formula (4):

5. the method for recognizing protein network complexes based on semantic density as claimed in claim 1, wherein the semantic density-based protein network complex recognition method comprises the following steps: the specific process of the density-based graph partitioning algorithm in the step 4 is as follows:

and 4.4, repeating the step 4.3 until all the proteins meeting the clustering requirements are marked as 'clustered', and deleting the clusters of the proteins not meeting the clustering requirements as noise points from the authorized and undirected networks, wherein the clustered clusters are dense subgraphs discovered by the DBGPWN algorithm.

6. The method for recognizing protein network complexes based on semantic density as claimed in claim 5, wherein: the specific process of the step 4.1 is as follows: the semantic aggregation coefficient is obtained by the following formula (3):

wherein A is _i,k ＞0&A _k,j > 0 indicates that node k is connected to node i or j by an edge, Σ _k∈V {A _i,k +A _k,j |A _i,k ＞0&A _k,j 0 represents the sum of the weights of the node i or j and the node connected with the two points together;

given a semantically weighted protein interaction network G, its adjacency matrix is denoted A _i,j If the nodes i and j in G have edges connected, A _i,j Equal to the weight of the edge, otherwise 0; if i = j, A _i,j ＝1。