CN109509509B

CN109509509B - Protein compound mining method based on dynamic weighted protein interaction network

Info

Publication number: CN109509509B
Application number: CN201811145616.2A
Authority: CN
Inventors: 毛伊敏; 朱海湾; 胡健
Original assignee: Jiangxi University of Science and Technology
Current assignee: Jiangxi University of Science and Technology
Priority date: 2018-09-29
Filing date: 2018-09-29
Publication date: 2020-12-22
Anticipated expiration: 2038-09-29
Also published as: CN109509509A

Abstract

The present disclosure provides a protein complex mining method based on a dynamic weighted protein interaction network, comprising the steps of: filtering inactive proteins by using gene expression profile data to construct a dynamic protein interaction network, weighting the dynamic protein interaction network by using comprehensive weight measurement and adding new interaction, thereby constructing a dynamic weighted protein interaction network; utilizing the criticality of the protein and the intrinsic properties of the complex to construct a protein complex core; the picking-up rule of the ant colony algorithm is improved by adopting a similarity function of fuzzy granularity, and the laying-down rule is optimized by utilizing compactness, so that the mining of a protein compound is realized; the local weight updating strategy is utilized to realize the transmission of the optimal solution information among different ant colonies, and the global weight updating strategy is utilized to realize the transmission of the function information among the dynamic weighted protein interaction networks at adjacent moments; and outputting the excavated protein complex.

Description

Protein compound mining method based on dynamic weighted protein interaction network

Technical Field

The disclosure relates to the field of system biology, in particular to a protein complex mining method based on a dynamic weighted protein interaction network.

Background

Proteins are the basis for maintaining all vital activities, and their functions are generally expressed by the interaction between proteins. In a living body, a network formed by the interaction of several protein complexes is called a protein-protein interaction (PPI) network, and the protein complexes are a collection of proteins that together perform a certain function in the same space and at the same time. The research on the interaction of proteins and the identification of modules with significance in the PPI network, such as protein complexes and functional modules, can help people to understand the process of life activities and predict proteins with unknown functions, and also provides a theoretical basis for disease diagnosis and drug development, so that the efficient protein complex detection method is still one of the most important challenges in the post-genome era on the background that interaction data generally has higher false positive and false negative, and the rapid and effective mining of protein complexes has very important significance in disclosing the basic principle of cell composition and function, researching the position of proteins in the metabolic pathways of organisms, deeply understanding the behaviors of organisms, drug design and the like.

Currently, biological assay methods for identifying protein complexes are time consuming, costly, and not universal for all species. Therefore, an effective protein complex mining method based on a calculation method is urgently needed to reduce the cost of the experiment and improve the experiment efficiency.

With the increasing perfection of high-throughput PPI data and protein data, a large number of learners gradually turn to the research of complex mining based on calculation, and a plurality of traditional mining algorithms, such as a density-based molecular complex prediction algorithm MCODE, a partition-based proximity search clustering algorithm RNSC, a hierarchy-based jerera algorithm and the like, are also provided. However, these algorithms have certain disadvantages, some algorithms have poor effects on sparse networks, some algorithms cannot detect overlapped compounds, some algorithms are sensitive to noise, and the like. In recent years, researchers have proposed new complex detection methods, such as a detection method based on flow simulation, a detection method based on a core-attachment structure, a spectral clustering algorithm, a group intelligence algorithm, and the like. However, the clustering result of the functional flow algorithm is greatly influenced by given parameters, the clustering method based on the core-auxiliary structure has high complexity and is not suitable for a large-scale PPI network, and the spectral clustering algorithm returns to the traditional clustering method after the dimension reduction of data. The swarm intelligence optimization algorithm has strong global optimization capability and strong robustness. Especially, the ant colony algorithm has unique advantages compared with other colony intelligent algorithms, the ant colony algorithm can directly cluster without other clustering algorithms, and the advantages of the colony intelligent algorithm can be fully exerted. At present, the ant colony algorithm is successfully applied to PPI network complex and functional module mining, and becomes a new research hotspot in the field. Liu Shi et al proposes an ant colony optimized PPI network function module detection algorithm NACO-FMD, and the method designs a more purposeful function to guide ant colony optimization and obtain a better clustering effect. Liuhongxin provides an ant colony clustering functional module detection algorithm ACC-FMD, the method clusters nodes by picking up and putting down a model, updates a similarity function by an optimal solution, enables a clustering result to tend to be optimal by continuous iteration, and finally merges and filters the clustering result. The ant colony clustering algorithms are applied to a large-scale PPI network, and a large amount of operations such as picking up, putting down, merging, filtering and the like are required, so that the convergence speed is low, and the solving time is too long. Lujiawei et al propose an ant colony optimization algorithm MGRACO-FMD based on a multi-granularity model, and try to improve convergence speed, but the accuracy of a clustering result is not high. Lei et al propose a PPI network ant colony optimization clustering algorithm based on connection strength, which reduces the time overhead and has a low recall ratio. The algorithms improve the time performance, and simultaneously reduce the accuracy and the recall ratio.

The prediction accuracy of the above algorithms depends on the reliability of the PPI network, however, the currently obtained protein interaction data contains a large amount of false positive and false negative data, and in addition, the PPI network is regarded as static and unchangeable by the algorithms, but the static PPI network cannot truly reflect the dynamic change in the cell, so that the protein complex mining based on the dynamic PPI network is more reasonable. With the increase of protein biological data and sequence data, recently, some researchers tried to build more reliable dynamic PPI networks in combination with their biological information, and further, to mine more reliable protein complexes.

Tang et al, using gene expression data and static PPI networks, constructs a Time series protein interaction network (TC-PIN) with a specified uniform threshold, and successfully applies it to protein functional module mining. Due to the fact that the gene expression levels of all proteins are inconsistent, the set unified threshold value can cause the built PPI network to be inaccurate, and further the clustering effect is influenced. Hu et al cancels the uniform threshold, uses the average expression level of each protein as a standard for judging whether the protein is active, constructs a dynamic weighting network by combining the complex information and the domain information, and proposes a protein function prediction method D-PIN. Su et al propose a dynamic weighting PPI network-based compound mining algorithm GECIuster, which firstly uses GO-Slim to weight a dynamic network and secondly mines a protein compound according to a seed node expansion strategy. The method measures the functional similarity between proteins only by adopting gene ontology information, and does not fuse various data, so that the interaction between the proteins cannot be well reflected. Yi et al propose a nuclear-dependent protein complex detection method DCA by weighting each protein by using edge aggregation coefficients and continuous co-expression length, and the weighting mode of the algorithm is integrated with the time sequence characteristics of complex evolution, so that the similarity between proteins can be better described. In the same year, Zhao et al propose a new complex recognition algorithm by using the time sequence function retention characteristics of the complex and combining ant colony clustering. The algorithm analyzes the mining method of the compound from a new visual angle, and is not only innovative on a clustering method. The clustering accuracy of the method is high, but the recall rate of the algorithm is general and may be related to weight measurement and an ant colony searching mode. Although the dynamic PPI network-based protein complex mining has achieved a certain success, it is still necessary to study how to effectively filter false positive data by using gene expression profiles, how to reasonably integrate PPI data and multivariate biological information, and provide an effective weighting method to reduce the gap between the constructed network and the real network. In addition, the ant colony algorithm is applied to the large-scale PPI network clustering problem, a large amount of picking-up, putting-down and filtering operations are required, so that the convergence speed is low, meanwhile, due to the high randomness of the algorithm, the accuracy and the recall rate are generally not high, and the problems still need to be solved urgently.

Disclosure of Invention

To address at least one of the above technical problems, the present disclosure provides a protein complex mining method based on a dynamically weighted protein interaction network.

According to one aspect of the present disclosure, a protein complex mining method based on a dynamically weighted protein interaction network includes the steps of:

constructing a dynamic weighted protein interaction network: inputting protein interaction data, gene expression profile data and gene body information, carrying out duplication removal operation on the protein interaction network data, filtering inactive proteins by using the gene expression profile data so as to construct a dynamic protein interaction network, weighting the dynamic protein interaction network by using comprehensive weight measurement and adding new interaction, thereby constructing the dynamic weighted protein interaction network;

constructing a protein complex core: inputting a dynamic weighted protein interaction network and a key protein set at each moment, optimizing selection of seed nodes by adopting a point-edge aggregation coefficient, and constructing a protein composite nucleus by utilizing the key properties of the protein and the internal properties of a compound;

ant colony clustering: improving a picking rule of an ant colony algorithm by adopting a similarity function of fuzzy granularity, continuously loading protein nodes to form an initial clustering result, and correcting the initial clustering result by utilizing a compactness optimization putting-down rule so as to realize the mining of a protein compound;

global and local weight updating: the local weight updating strategy is utilized to realize the transmission of the optimal solution information among different ant colonies, and the global weight updating strategy is utilized to realize the transmission of the function information among the dynamic weighted protein interaction networks at adjacent moments; and

and outputting a result: outputting the excavated protein complex.

According to at least one embodiment of the present disclosure, the step of constructing a dynamically weighted protein interaction network comprises:

the 36 time points of the gene expression profile data were combined into 12 time points by the following formula 1:

wherein, T_u(i) Represents the gene expression value of the protein u at the moment i, i is more than or equal to 1 and less than or equal to 12;

non-co-expressed proteins were filtered according to the following formula 2:

wherein, T'_uRepresents the mean gene expression value of protein u;

add interactions for each dynamic subnetwork: assuming that the proteins u, v are interacting and co-expressed on a static protein interaction network, a set of interactions is added to the network at that moment; assuming that the proteins u, v are not interacting but co-expressed on the static protein interaction network, whether or not an interaction is added is judged by the following formula 3:

wherein CWM (u, v) represents the integrated weight metric of the proteins u, v, CE_cc(u, v) represents a point-edge clustering coefficient, FS (u, v) represents gene ontology functional similarity, Pcc (u, v) represents a Pearson correlation coefficient;

adding a set of interactions when the CWM (u, v) is greater than 0, otherwise not adding;

according to the formula 3, the 12 dynamic subnetworks are weighted by adopting the comprehensive weight measurement, and then the dynamic weighted protein interaction network is obtained.

According to at least one embodiment of the present disclosure, the point-edge clustering coefficient CE_cc(u, v) is calculated by the following formula 4:

wherein, tan_u,vRepresenting the number of triangles jointly formed by network nodes u, v, d_u,d_vDegree, C, representing network nodes u, v, respectively_u,C_vPoint aggregation coefficients representing network nodes u, v, respectively;

the gene ontology functional similarity FS (u, v) was calculated using the following formula 5:

wherein, | f_u∩f_vI denotes the number of gene ontology terms common to proteins u and v, | f_u|,|f_v| denotes the number of gene ontology terms for proteins u and v, respectively;

the pearson correlation coefficient Pcc (u, v) is calculated using the following formula 6:

wherein k is the number of samples, i is the number of times in the gene expression data, E_xp(u,i),E_xp(v, i) represents the expression values of proteins u and v at time i, respectively,

and σ (u), σ (v) representing the mean expression value and standard deviation of proteins u and v, respectively, at all times, Pcc (u)_,v)∈[-1,1]。

According to at least one embodiment of the present disclosure, the step of constructing the protein complex core includes:

b1 calculating the sum SoCE of the point-edge aggregation coefficients of all the associated edges of the nodes of each key protein_ccAnd put into an ordered queue Q in descending order₁；

B2 Slave queue Q₁Initializing a compound core C by the key protein node with the maximum sum of the median-taken point edge aggregation coefficients, and enabling the key protein node to meet an interaction threshold eta and have continuous co-expression times of more than or equal to mAdding a composite core C adjacent to the adjacent nodes;

b3 judging whether the composite core C meets the density threshold d, if not, recursively deleting SoCE_ccThe nodes with small values until the composite core C meets the density threshold d;

b4 when the composite core C satisfies the density threshold d, storing the composite core C into the result queue Q₂From the ordered queue Q₁Deleting all nodes in the composite core C;

b5 repeating steps B2, B3 and B4 until ordered queue Q₁Is empty.

According to at least one embodiment of the present disclosure, the sum of the point-edge aggregation coefficients of all associated edges of the nodes of key proteins is SoCE_ccCalculated by the following formula 7:

wherein, SoCE_cc(u) represents the sum of the point-edge clustering coefficients of all the associated edges of the key protein node u.

According to at least one embodiment of the present disclosure, the step of ant colony clustering includes:

c1 at result queue Q₂Randomly selecting a composite core C as the initial position of the ant;

c2 calculating fuzzy granularity of node u in ant neighborhood range, picking up neighbor node satisfying condition, proceeding to the neighbor node, updating composite core and ant neighborhood range; if no neighbor node meeting the condition exists, skipping the step C3 and directly entering the step C4;

c3 judging whether the ant load capacity reaches the maximum, if not, repeating the step C2, continuing clustering the nodes in the new neighborhood range of the ants, if so, performing the step C4;

c4 obtaining the initial clustering result corresponding to the composite core C, and queuing Q from the result₂Deleting composite core C and judging result queue Q₂If not, randomly selecting a composite core as the initial position of the ant and returning toStep C2 begins a new round of search; if result queue Q₂If it is empty, go to step C5;

c5 calculates the compactness of node u and compound PC, cuts off the nodes with compactness less than 1 to obtain compound PC, and outputs compound set CS.

According to at least one embodiment of the present disclosure, the haze particle size is calculated by the following equation 8:

wherein,_A(u) represents the fuzzy granularity of the node u in the ant neighborhood range, | C | is the node number in the composite kernel C, and alpha is the dissimilarity factor.

The compactness is calculated by the following formula 9:

where CD (u, PC) represents the closeness of node u to complex PC, dⁱⁿ(u,v₁) Indicates that the protein u is complexed with other proteins v in the PC₁Weight of the connecting edge, d^out(u,v₂) Indicates that the protein u is complexed with a protein v other than the PC₂The weight of the connecting edge.

According to at least one embodiment of the present disclosure, the local weight update is performed according to the following equation 10:

CWM(u,v)＝(1+PC_uv) CWM (u, v) formula 10

Wherein, PC_uvThe probability that the proteins u, v share the complex in the optimal solution of the last iteration is shown as an enhancement factor.

According to at least one embodiment of the present disclosure, the coefficient of enhancement PC_uvCalculated by the following equation 11:

wherein, C_u,C_vRespectively, a collection of complexes to which the proteins u, v belong, C_u∩C_vRepresents a complex set comprising both proteins u, v.

According to at least one embodiment of the present disclosure, global weight update is performed according to the following equation 12:

wherein,

and

are respectively shown at T_i-1And T_iThe times of the occurrence of the proteins u and v in the same compound in the optimal solution of the instantaneous network at the moment is that alpha is more than or equal to 0 and beta is more than or equal to 1,

and β is a constant.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.

Fig. 1 is a schematic diagram of the construction of a dynamically weighted protein interaction network according to at least one embodiment of the present disclosure.

Fig. 2 is a flow diagram of a protein complex mining method based on a dynamically weighted protein interaction network in accordance with at least one embodiment of the present disclosure.

FIG. 3 is a graph comparing clustering results of algorithms on a dynamic protein interaction network, in accordance with at least one embodiment of the present disclosure.

FIG. 4 is a graph comparing the results of DNA-directed RNA polymerase II complex detection according to various algorithms in at least one embodiment of the present disclosure.

Detailed Description

The present disclosure will be described in further detail with reference to the drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the present disclosure. It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

The disclosure provides an ant colony clustering dynamic weighted PPI network complex mining method (FGCDACC-DPC) based on fuzzy granularity and compactness based on the mining of protein complexes by ant colony clustering. Firstly, constructing a dynamic protein interaction network by using gene expression profile data, weighting each dynamic subnet by adopting comprehensive weight measurement CWM and adding new interaction, and further constructing a dynamic weighting network; then, a group of dense and highly co-expressed composite nuclei is constructed by using the basic characteristics of the protein composite, and the mining of the protein composite is realized by adopting a picking-up and laying-down model (GCM) with fuzzy granularity and compactness; and meanwhile, local and global weight updating strategies are adopted to realize the transfer of optimal solution function information between different ant colonies and networks at different moments.

In an alternative embodiment of the present disclosure, data analysis and experimental validation is preferably performed using yeast proteins as an example.

Constructing a dynamic weighted PPI network:

the yeast protein interaction network was derived from the DIP database, which, after deduplication, contained 5093 proteins and 24734 sets of interactions. Gene expression profiling data the data numbered GSE3431 was selected to include expression values for 6777 genes at 36 times, of which only 4981 genes were in the yeast PPI network. The standard protein complex information was derived from the CYC2008 module set containing 408 standard complexes with a maximum scale of 81 and a minimum scale of 3. Gene Ontology (GO) functional annotation information is downloaded from a gene ontology library. Key protein data were obtained by integrating the data in 4 databases of MIPS, SGD, DEG and SGDP, which contained 1285 key proteins, with only 1167 key proteins in the yeast PPI network. Given the limitations imposed by experimental detection conditions and the "non-scale" and "small-world" nature of PPI networks, some biological data in protein interaction networks and bioinformatics present inaccuracies, and the accuracy of detecting protein complexes is susceptible to false positives and false negatives. In order to reduce the influence of false positive and false negative data on the experimental result, a dynamic weighting PPI network is constructed by combining the topological characteristic and the biological characteristic of the network based on the static PPI network, so that the accuracy of protein compound mining is improved. And continuously adjusting and optimizing the static PPI network according to the gene expression profile data to further construct a dynamic PPI network, weighting the dynamic PPI network by comprehensively utilizing the point-edge aggregation coefficient, the Pearson correlation coefficient and the GO functional similarity, and adding new interaction to construct the dynamic weighted PPI network. The detailed process of constructing a dynamically weighted PPI network is as follows:

from the gene expression profile data, the 36 time points were combined into 12 time points by the following formula 1:

wherein, T_u(i) Represents the gene expression value of protein u at time i, i is 1. ltoreq. i.ltoreq.12.

Non-co-expressed proteins were filtered according to the following formula 2:

wherein, T'_uRepresents the average gene expression value of protein u.

wherein CWM (u, v) represents the integrated weight metric of the proteins u, v, CE_cc(u, v) represents the point-edge clustering coefficient, FS (u, v) represents the functional similarity of gene ontology, and Pcc (u, v) represents the Pearson correlation coefficient.

Further, the point-edge clustering coefficient CE_cc(u, v) is calculated by the following formula 4:

wherein, tan_u,vRepresenting the number of triangles jointly formed by network nodes u, v, d_u,d_vDegree, C, representing network nodes u, v, respectively_u,C_vPoint aggregation coefficients for network nodes u, v, respectively.

Pearson's correlation coefficient P_cc(u, v) is calculated by the following formula 6:

and σ (u), σ (v) representing the mean expression value and standard deviation of the proteins u and v, respectively, at all times, Pcc (u, v) e [ -1,1]。

When the combined weight metric CWM (u, v) for the proteins u, v is greater than 0, then a set of interactions is added, otherwise not. According to the formula 3, the 12 dynamic subnetworks are weighted by adopting the comprehensive weight measurement, and then the dynamic weighted protein interaction network is obtained.

The method for constructing the dynamic weighted PPI network fully considers the situations that a large amount of false positive and false negative data exist in the PPI network due to the limitation of experimental conditions and the 'scale-free' and 'small-world' characteristics of the protein network, can effectively reduce the influence of noise data on the clustering result of protein complex mining, and can fuse the biological information of the protein to improve the accuracy of the protein complex mining.

Fig. 1 is a schematic diagram of the construction of a dynamic weighted PPI network, which embodies the dynamic characteristics of a yeast PPI network. As can be seen from fig. 1, the activity of the network and the interaction between proteins are very different at different times for different proteins. Since the actual protein network is constantly changing and proteins must be in an active state to interact with other proteins, the interacting proteins in the transient network should be in an active state. Although the construction of the dynamic network can reduce a large amount of false positive data, false negative increase is inevitably caused, and in order to reduce the negative influence of the false negative on the clustering result, the comprehensive weight measurement is adopted to weight the dynamic PPI network and add new interaction, so that the reliability of the network is improved. The results of the analysis showed that: the construction of the dynamic weighted PPI network can be closer to a real yeast PPI network, so that the clustering accuracy is improved. Meanwhile, the distribution of the protein functional modules in each dynamic weighted PPI network has obvious statistical properties, and the protein functional modules are mainly enriched in certain interaction subnetworks, which shows that the comprehensive weight measurement of the protein of all dynamic weighted PPI networks is not useful for mining protein complexes in cells.

(II) constructing a protein complex core:

the intrinsic property and biological property of the protein complex are utilized to construct a more real and reliable complex core. Firstly, all key proteins in the subnet at each moment are selected as a seed node set, and then whether the constructed composite nucleus meets the conditions of an interaction threshold, a density threshold and continuous co-expression times is judged, so that the composite nucleus is constructed. The detailed process for constructing the protein complex core is as follows:

1) first, calculate the sum SoCE of the point-edge aggregation coefficients of all the related edges of the nodes of each key protein_ccAnd put into an ordered queue Q in descending order₁；

SoCE_ccCalculated by the following formula 7:

wherein, SoCE_cc(u) the sum of the point edge clustering coefficients of all the associated edges representing the key protein node u;

2) slave queue Q₁Initializing a composite kernel C by the key protein node with the maximum sum of the median-taken point edge aggregation coefficients, and adding a direct neighbor node which meets an interaction threshold eta and has continuous co-expression times of more than or equal to m into the composite kernel C, wherein the value range of m can be determined according to actual needs;

3) judging whether the composite core C meets the density threshold d, if not, recursively deleting the SoCE_ccThe nodes with small values until the composite core C meets the density threshold d;

4) when the composite core C meets the density threshold d, the composite core C is stored in a result queue Q₂From the ordered queue Q₁Deleting all nodes in the composite core C;

5) repeating steps 2), 3) and 4) until the ordered queue Q is reached₁Is empty.

And (III) ant colony clustering based on fuzzy granularity and compactness:

and continuously loading data to form an initial clustering result by adopting a fuzzy granularity-based pick-up rule (FGP), and correcting the initial clustering result by utilizing the compactness. Specifically, ants randomly select a composite core and initialize a cluster, search for nodes in a visual range, and pick up the nodes and advance to the positions of the nodes if the fuzzy granularity similarity is larger than the initial granularity P. And when the ants traverse all nodes meeting the conditions in the neighborhood of the current composite core or reach the maximum loading capacity, the ants randomly select the next composite core to start the next round of search. And repeating the process until all the composite cores are traversed by the ants, and obtaining an initial clustering result. And correcting the initial clustering result according to a closeness put-down rule (CDD), and discarding some nodes with tight external connection and loose internal connection, thereby realizing the mining of the protein compound. The detailed process of ant colony clustering is as follows:

1) in the result queue Q₂Randomly selecting a composite core C as the initial position of the ant;

2) calculating the fuzzy granularity of the node u in the range of the ant neighborhood (direct neighbor), picking up the neighbor node meeting the condition, advancing to the neighbor node, and updating the ranges of the composite core and the ant neighborhood; if no neighbor node meeting the condition exists, skipping the step 3) and directly entering the step 4); the haze particle size is calculated by the following formula 8:

3) Judging whether the ant load capacity (the maximum scale of the standard compound) reaches the maximum, if not, repeating the step 2), continuing clustering nodes in the new neighborhood range of the ants, and if so, performing the step 4);

4) obtaining the initial clustering result corresponding to the composite kernel C, and obtaining the resultQueue Q₂Deleting composite core C and judging result queue Q₂If the ant is not empty, randomly selecting a composite core as the initial position of the ant, and returning to the step 2 to start a new round of search; if result queue Q₂If the result is empty, entering the step 5);

5) calculating the compactness of the node u and the compound PC, eliminating the nodes with the compactness less than 1 to obtain the compound PC, and outputting a compound set CS;

the compactness is calculated by the following formula 9:

(IV) global and local weight updating:

and local weight value updating is carried out by utilizing a function information transfer mechanism and the optimal solution information in the population, and the optimal solution information of the previous iteration is transferred through the weight value through information transfer among different ant colonies, so that the probability that similar data is distributed to the same cluster in the next iteration is increased, and the probability that dissimilar data is distributed to the same cluster is reduced.

Local weight update is performed according to the following equation 10:

CWM(u,v)＝(1+PC_uv) CWM (u, v) formula 10

PC_uvCalculated by the following equation 11:

And realizing weight updating between PPI networks at adjacent moments by utilizing a global weight updating strategy based on time sequence correlation and functional transitivity. The strategy transmits the clustering result of the network at the previous moment to the network at the next moment through the positive feedback of the CWM, so that the interaction degree between two proteins belonging to the same cluster can be effectively increased, and the convergence speed is accelerated.

The global weight update formula is shown in equation 12 below:

wherein,

and

and β is a constant. Preferably, are respectively provided with

And β is 0.1 and 0.2.

(V) outputting the result: all protein complexes excavated by the above method are exported.

FIG. 2 shows a flow chart of FGCDACC-DPC method. The above method can be summarized according to fig. 2 as follows: firstly, a dynamic weighting model based on a static PPI network and combined with gene expression profile data and gene body information is adopted to construct a more real and reliable dynamic weighting protein interaction network; secondly, constructing a group of dense and highly co-expressed composite kernels, then adopting a model (FGCDM) based on fuzzy granularity and compactness to pick up and put down to realize the excavation of protein compounds, and evaluating the solution quality according to the modularity M after clustering is finished; and finally, in order to improve the clustering accuracy and accelerate the clustering speed, updating the interaction between the proteins by adopting a global and local weight updating strategy based on functional information transfer and time sequence function correlation, and outputting all the excavated protein compounds.

In order to verify the effectiveness and performance superiority of the FGCDACC-DPC method compared to other methods, the FGCDACC-DPC method was compared with MCODE, RNSC, MCL, COACH, JSACO, ACC-FDM, and ACC-DPC methods in terms of accuracy and recall of the mined protein complex, clustering performance of functional module mining, and execution efficiency. Preferably, the various methods described above are applied to yeast protein interaction networks for experimental validation.

1) Comparing the FGCDACC-DPC with protein functional modules mined by other methods in the accuracy, recall rate and F-measure metric value:

in order to verify the effectiveness of the FGCDACC-DPC algorithm in the dynamic PPI network, the clustering performance of the FGCDACC-DPC is evaluated by adopting a correct rate, a recall rate and an F-measure metric value. The FGCDACC-DPC method and the other 7 methods are independently operated for 20 times, and the average value of the experimental results is taken for analysis and comparison. As shown in FIG. 3, the comparison results of the three metrics of the algorithms show that the FGCDACC-DPC algorithm has the highest F-measure value, and is improved by 144.3%, 61.06%, 19.24%, 37.58%, 17.49%, 42.161% and 25.52% compared with the MCODE, MCL, COACH, RNSC, ACC-DPC, JSACO and ACC-FMD algorithms, respectively. The main reasons for this result are: the dynamic weighted PPI network constructed by the FGCDACC-DPC algorithm is closer to a real PPI network, and the influence of false positive and false negative on clustering accuracy is reduced; and on the other hand, the F-measure metric value of the algorithm can be effectively improved by picking up the improved strategy and the weight value updating strategy. The FGCDACC-DPC algorithm is bitwise second in precision (accuracy) next to the JSACO algorithm, which indicates that the dynamic network constructed by the FGCDACC-DPC algorithm contains fewer false positives. The FGCDACC-DPC algorithm has better performance on the recall rate, and is respectively improved by 252.2 percent, 38.025 percent, 7.08 percent, 14.01 percent, 27.17 percent, 95.758 percent and 40.157 percent compared with the MCODE, MCL, COACH, RNSC, ACC-DPC, JSACO algorithm and ACC-FMD. Although the dynamic network constructed by the FGCDACC-DPC algorithm is lack of a certain amount of protein, which may cause the recall rate to be reduced, the effectiveness of the weighting mode causes the network to contain fewer false negatives, so that the recall rate is improved as a whole. The FGCDACC-DPC algorithm has better performance by comprehensively measuring three index values of the accuracy, the recall rate and the F-measure.

2) Comparison of clustering performance of FGCDACC-DPC with protein complexes mined by other methods:

in order to further evaluate the clustering performance of the FGCDACC-DPC algorithm, the four aspects of the number of complexes identified by each algorithm, the average size of clusters, the number of coverage proteins and the running time are respectively analyzed.

As can be seen from Table 1 below, the FGCDACC-DPC algorithm recognizes that the average size and the coverage protein of the complex are closer to the standard class than other algorithms recognize; although the number of the identified complexes is 637, which is second only to the MCL algorithm, the MCL algorithm covers 4096 proteins, so the accuracy is lower than that of the FGCDACC-DPC algorithm.

To verify the time efficiency of the FGCDACC-DPC algorithm, it was subjected to comparative experiments with various ant colony clustering based algorithms. From Table 1, it can be seen that the FGCDACC-DPC algorithm has better time performance. Firstly, because the FGCDACC-DPC algorithm is based on small-scale dynamic weighted PPI network clustering, the problem that the convergence speed of the ant colony algorithm applied to a large-scale PPI network is low is solved; and secondly, the effectiveness of the improved pick-up and drop-down rule and the weight value updating can effectively reduce the calculated amount and the times of accessing but not picking up, thereby shortening the clustering time. The FGCDACC-DPC algorithm is therefore more time efficient than the ACC-DPC and ACC-FMD algorithms. Although the runtime of the FGCDACC-DPC algorithm is slightly inferior to the JSACO algorithm, other indicators of the algorithm are higher than the JSACO algorithm.

TABLE 1 comparison of Performance of various algorithms for mining protein complexes

The protein complexes identified by the FGCDACC-DPC algorithm, whether the average size, number of clusters, or the number of proteins covered, are very close to the standard class and also low at clustering time, second only to the JSACO algorithm. In general, the clustering performance of the FGCDACC-DPC algorithm is high, and a good optimization effect is achieved.

3) Comparison of method FGCDACC-DPC with clustering results of protein complexes mined by other methods:

the clustering results of the FGCDACC-DPC algorithm were analyzed and table 2 shows the 6 protein complexes identified using the algorithm. Evaluating the clustering effect of the FGCDACC-DPC algorithm by analyzing correct and wrong clustering results in the predicted compound.

As can be seen from Table 2, the predicted

complexes

2, 3, 5 and 6 are perfect matches with the standard complexes, indicating that the protein complexes detected by the FGCDACC-DPC algorithm are closer to the true protein complexes and more biologically significant.

To more intuitively analyze the clustering result, the detection result of the DNA-directed RNA polymerase II complex was visualized. FIG. 4 shows the predicted results of detecting DNA-directed RNA polymerase II complex using different algorithms, where the grey nodes represent the proteins with clustering errors. FIG. 4(a) is a standard complex; FIG. 4(b) shows the results of FGCDACC-DPC algorithm, correctly detecting all proteins of the complex; FIG. 4(c) shows the results of the ACC-DPC algorithm, 11 proteins were correctly detected, and only protein YHR143W-A was not detected because the node is linked to only the in-cluster YIL021W and is linked to the out-cluster more tightly; FIG. 4(d) shows the results of ACC-FMD algorithm with 10 proteins detected and two non-complexed proteins misdetected, where YPL203W wrongly replaced YHR143W-A, because YPL203W was tightly linked to all proteins in the cluster. As can be seen from the clustering results of fig. 4(c) and (d), the compound based on dynamic network mining is more accurate in the case of using the same algorithm; FIGS. 4(e) and (f) are the results of the MCL and MCODE algorithms, both of which correctly detected only 9 proteins, wherein the YPR110C in the results of the MCL algorithm wrongly replaced YPR187W, and the MCODE algorithm wrongly detected two proteins. Therefore, the detection result of the FGCDACC-DPC algorithm based on the dynamic weighted PPI network is closer to the standard compound, and the effectiveness of the algorithm is further illustrated.

TABLE 2 analysis of the results of 6 complexes identified by the FGCDACC-DPC algorithm

In conclusion, the accuracy of the protein complex mined by the ant colony clustering-based dynamic weighted PPI network protein complex mining method and the matching precision, recall rate, clustering effect and the like of the protein complex with the standard protein complex are remarkably improved.

Compared with the existing protein complex identification method based on the dynamic PPI network, the technical scheme disclosed by the invention is obviously improved in the aspects of prediction accuracy, recall rate, matching rate with known protein complexes and the like, and is helpful for providing valuable reference information for the prediction experiment and further research of unknown functions of proteins for biologists.

It will be understood by those skilled in the art that the foregoing embodiments are merely for clarity of illustration of the disclosure and are not intended to limit the scope of the disclosure. Other variations or modifications may occur to those skilled in the art, based on the foregoing disclosure, and are still within the scope of the present disclosure.

Claims

1. A protein complex mining method based on a dynamic weighting protein interaction network is characterized by comprising the following steps:

constructing a dynamic weighted protein interaction network: inputting protein interaction data, gene expression profile data and gene body information, carrying out duplication removal operation on protein interaction network data, filtering inactive proteins by using the gene expression profile data so as to construct a dynamic protein interaction network, weighting the dynamic protein interaction network by using comprehensive weight measurement and adding new interaction, thereby constructing the dynamic weighted protein interaction network;

ant colony clustering: improving a picking rule of an ant colony algorithm by adopting a similarity function of fuzzy granularity, continuously loading protein nodes to form an initial clustering result, and correcting the initial clustering result by utilizing a compactness optimization putting-down rule so as to realize the mining of a protein compound; the picking rule of the ant colony algorithm is that ants randomly select a composite core and initialize a cluster to search nodes in a visual range, if the fuzzy granularity similarity is larger than the initial granularity P, the nodes are picked up and move to the positions of the nodes, when the ants traverse all nodes meeting the conditions in the neighborhood of the current composite core or reach the maximum loading capacity, the ants randomly select the next composite core to start the next round of search, the process is repeated until all the composite cores are traversed by the ants, and the initial clustering result is obtained; correcting the initial clustering result by using a compactness optimization putting-down rule, discarding some nodes with tight external connection and loose internal connection, and further realizing the mining of the protein compound;

global and local weight updating: the transmission of optimal solution information among different ant colonies is realized by utilizing a local weight updating strategy, and the transmission of function information among the dynamic weighted protein interaction networks at adjacent moments is realized by utilizing a global weight updating strategy; and

and outputting a result: outputting the excavated protein complex.

2. The method of claim 1, wherein the step of constructing a dynamically weighted protein interaction network comprises:

non-co-expressed proteins were filtered according to the following formula 2:

wherein, T'_uRepresents the mean gene expression value of protein u;

3. The method of claim 2,

the point-edge clustering coefficient CE_cc(u, v) is calculated by the following formula 4:

the pearson correlation coefficient Pcc (u, v) is calculated using the following equation 6:

4. The method of claim 1, wherein the step of constructing the protein complex core comprises:

b1 calculating the sum of the point-edge aggregation coefficients of all the associated edges of the nodes of each key protein_ccAnd put into an ordered queue Q in descending order₁；

B2 Slave queue Q₁Initializing a composite core C by the key protein node with the maximum sum of the aggregation coefficients of the middle-fetching point edges, and adding a direct neighbor node which meets an interaction threshold eta and has continuous co-expression times of more than or equal to m into the composite core C;

b3 judging whether the composite core C meets the density threshold d, if not, recursively deleting SoCE_ccNodes with small values until the composite kernel C satisfies a density threshold d;

b4 when the composite core C meets the density threshold d, storing the composite core C into a result queue Q₂From the ordered queue Q₁Deleting all nodes in the composite core C;

b5 repeating steps B2, B3 and B4 until ordered queue Q₁Is empty.

5. The method according to claim 4, wherein the sum of the point-edge aggregation coefficients of all associated edges of the nodes of the key protein SoCE_ccCalculated by the following formula 7:

6. The method of claim 4, wherein the step of ant colony clustering comprises:

c2 calculating fuzzy granularity of node u in the ant neighborhood range, picking up the neighbor node satisfying the condition, advancing to the neighbor node, and updating the composite core and the ant neighborhood range; if no neighbor node meeting the condition exists, skipping the step C3 and directly entering the step C4;

c4 obtaining the initial clustering result corresponding to the composite core C, and queuing Q from the result queue₂Deleting composite core C and judging result queue Q₂If not, randomly selecting a composite core as the initial position of the ant, and returning to the step C2 to start a new round of search; if result queue Q₂If it is empty, go to step C5;

7. The method of claim 6,

the haze particle size is calculated by the following formula 8:

wherein CMW (u, v) represents the integrated weight measurement of the protein u, v,_A(u) represents the fuzzy granularity of a node u in the ant neighborhood range, | C | is the number of nodes in the composite kernel C, and alpha is a dissimilarity factor;

the compactness is calculated by the following formula 9:

8. The method according to claim 1 or 7,

local weight update is performed according to the following equation 10:

CWM(u,v)＝(1+PC_uv) CWM (u, v) formula 10

Wherein CMW (u, v) represents the integrated weight measurement of the proteins u, v, PC_uvThe probability that the proteins u, v share the complex in the optimal solution of the last iteration is shown as an enhancement factor.

9. The method of claim 8,

the enhancement factor PC_uvCalculated by the following equation 11: