CN102176223B - Protein complex identification method based on key protein and local adaptation - Google Patents

Protein complex identification method based on key protein and local adaptation Download PDF

Info

Publication number
CN102176223B
CN102176223B CN 201110006179 CN201110006179A CN102176223B CN 102176223 B CN102176223 B CN 102176223B CN 201110006179 CN201110006179 CN 201110006179 CN 201110006179 A CN201110006179 A CN 201110006179A CN 102176223 B CN102176223 B CN 102176223B
Authority
CN
China
Prior art keywords
protein
node
bunch
key
interaction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 201110006179
Other languages
Chinese (zh)
Other versions
CN102176223A (en
Inventor
王建新
刘彬彬
李敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN 201110006179 priority Critical patent/CN102176223B/en
Publication of CN102176223A publication Critical patent/CN102176223A/en
Application granted granted Critical
Publication of CN102176223B publication Critical patent/CN102176223B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a protein complex identification method based on key protein and local adaptation. Based on the importance of the key protein to life activities of organisms and the topological property of a protein interaction network, the invention discloses the protein complex identification method (EPOF) based on key protein and locally adapted protein by using the key protein as a seed. The protein complex identification method not only can be applied to a non-weighted protein interaction network, but also can be applied to a weighted protein interaction network. The protein complex identification method can be used for recognizing the protein complex more accurately only according to the protein interaction information and the key protein information, and predicting a large quantity of protein complexes in one step, and solves the problems of high cost, high time consumption and the like of the chemical experiment method.

Description

Protein complex recognizing method based on key protein matter and local adaptation
Technical field
The invention belongs to the systems biology field, relate to a kind of protein complex recognizing method based on key protein matter and local adaptation.
Background technology
Genome times afterwards comprehensively, systematically how analysis and complete understanding bio-networks topological structure, intracellular biological chemistry process and protein finish vital movement by interaction becomes a very challenging research topic.Protein is to constitute all cells and the requisite composition of institutional framework, and it is the executor of physiological function, also is the direct agent of biological phenomena.Each protein in the cell is not isolated the existence, and the generation of biological phenomena is multifactor impact often, must relate to a plurality of protein.We can say that nearly all bioprocess all is accurately to carry out by the interaction between protein.Therefore, each protein in the cell is not independently to finish the function that is endowed, but by forming big compound with other protein interaction, in specific time and space, finish specific function, and the function of some protein only forms afterwards at compound, and competence exertion comes out.So, identify these protein complexes effectively predicted protein matter is interacted, explains specific biological processes, explain that protein function has very important meaning.
At present, the method that is used for the identification of protein compound mainly contains the chemical experiment assay method and based on the clustering method of protein interaction information.
The chemical experiment assay method mainly comprises APMS (Affinity Purification techniques using Mass Spectrometry), TAP (Tandem Affinity Purification), iTAP (TAP and RNAi) and HMS-PCI methods such as (High-throughput Mass Spectromic Protein Complex Identification).Can measure protein complex, particularly those more stable compounds under a certain environment exactly by chemical experiment.But the unstable compound that still has some in the environment, the interaction between the protein in the compound is instantaneous, dynamic change, be difficult to capture these protein complexes based on the research method of experiment, and experimental cost is very expensive.
At present, general way is based on protein interaction information and carries out cluster analysis, main implementation method is that the protein interaction information table is shown as a non-directed graph, protein complex is corresponding to wherein dense subgraph, use then various clustering methods identify these dense subgraphs (be called again " and bunch ", i.e. protein complex).Up to the present, some occur and be used for excavating the method for protein complex, for example RNSC method, G-N method, MoNet method, MCODE method, LCMA method, DPClus method, CPM method and STM method etc.
RNSC be one based on the figure division methods of cost.The RNSC method is divided into the protein interaction network several independent bunch at first randomly, define a cost function, constantly a bunch of internal protein is moved to another bunch then and reduce whole cost, surpass prior preset threshold and whole cost is descended up to this mobile number of times.The defective of RNSC method is that the relationship between quality that outcome quality and method begin to generate bunch is close, and each protein only belongs to bunch of fact that may participate in a plurality of compounds with protein and is not inconsistent in addition.G-N method and MoNet method are two kinds of typical hierarchy clustering methods.The G-N method that Girvan in 2002 and Newman propose is a kind of typical division hierarchy clustering method, and it is counted the limit and come splitting network by constantly removing height Jie in the network.To count cost very high owing to calculate limit Jie, and Radicchi in 2004 etc. have proposed a kind of self-contained G-N method.2007, the MoNet method that Luo etc. propose on G-N method basis was exactly a kind of typical coacervation.They give the definition of protein complex with the end condition in the clear and definite coacervation process.Be to excavate the protein complex of arbitrary shape based on the advantage of hierarchy clustering method, and can present the stratification tissue of whole protein network with tree structure.But hierarchy clustering method is very responsive to noise, and all there is noise inevitably in the protein interaction information that can obtain.In addition, the same with the method for dividing based on figure, hierarchy clustering method also is difficult to excavate overlapping protein complex, namely allows a protein node to belong to a plurality of protein complexes.MCODE is a kind of local search approach based on density.The MCODE method is at first calculated all vertex weights according to the densitometer on each respective neighbours summit, summit, and with the summit of weight maximum as seed, from the outwards expansion constantly of kind of son vertex, the condition of expansion is that the vertex weights that is expanded is greater than given threshold value.But because the big summit connection each other of weight might not be dense, what the MCODE method can not guarantee to obtain bunch is dense, also needs further processing for those sparse subgraphs.LCMA is a method that merges based on local bolus.The LCMA method at first expands each protein summit and becomes a group, merges according to the substantial connection between these groups again.DPClus and MCODE are similar, also are a kind of local search approach based on density.The DPClus method at first calculates possible protein complex seed, begins constantly outwards to expand from kind of a son vertex then, and the requirement of density and bunch character need be satisfied in the summit that is expanded.CPM is a kind of method based on an infiltration, this method is rolled into a ball protein complex corresponding to the some k-that are interconnected among the figure set.Because CPM need enumerate all the very big groups in the network, so become the bottleneck that is difficult to overcome for bigger its efficient of network of scale.STM is a kind of stream method, the shortest path between any two summits in the computational grid at first, and calculate signal conduct the relation between any two summits on this basis, select a bunch of expression for each summit then, and carry out preliminary cluster on this basis, at last to these preliminary bunch merging of obtaining.Advantage based on the local search approach of density is and can allows certain protein to repeat in the process of extension hunting, and then realizes that same protein belongs to the target of a plurality of different composite things.
In above-mentioned these methods, based on the clustering method of Local Search and optimization more be applicable to identification smaller bunch, and the most protein compound is all smaller, difficult point based on the clustering method of Local Search and optimization is the selection of seed and the formulation of the condition of expansion, and the condition that particularly expands has very big influence to the quality that produces protein complex.Thereby use conventional methods the identification of carrying out protein complex, there is bigger difficulty.
Summary of the invention
Technical matters to be solved by this invention is to propose a kind of protein complex recognizing method based on key protein matter and local adaptation, should only need according to protein interaction information and key protein matter information identification of protein compound more exactly just based on protein complex recognizing method of key protein matter and local adaptation, and can once dope the amounts of protein compound, solve problems such as chemical experiment method cost is expensive and consuming time.。
Technical solution of the present invention is as follows:
A kind of protein complex recognizing method based on key protein matter and local adaptation may further comprise the steps:
Step 1: set up the protein interaction non-directed graph: import a histone matter interaction information, filter wherein repeat interact and self-interaction, set up protein interaction non-directed graph G; Wherein, the set of the related and related reliability score of protein interaction information finger protein matter-protein; Each member to protein-protein association is two protein with direct interaction;
Step 2: searched key protein node and ordering: according to the key protein matter node among the key protein matter information search protein interaction non-directed graph G of one group of input, and put into a candidate seed formation S after sorting from big to small according to its degree in protein interaction non-directed graph G qIn; Described degree is the limit number that is associated with this key protein matter node among the non-directed graph G;
Step 3: according to candidate seed formation S qCarry out protein complex identification:
If candidate seed formation S qBe not empty, choose S qFirst node A, first node A is initialized as a bunch of H (as initial cluster, current among bunch H only comprise a key protein matter node A), then bunch H is carried out a bunch expansion; Intact one bunch of every expansion, with bunch in all key protein matter nodes from candidate seed formation S qIn deletion, and should bunch deposit the C of result queue in;
Candidate seed formation S qDuring for sky, the output result, export namely that all are identified among the C of result queue bunch, the identification of protein compound of wanting bunch namely that all are identified, entire identification process end.
Point out that according to early stage biomedical research key protein matter is the necessary protein of the activity that earns a bare living in the organism, if it is rejected then can produce the mortality influence to organic existence and growth.Key protein matter information is exactly one group of key protein matter node set that obtains based on biomedicine experiment, and it derives from current disclosed biomedical data storehouse.
In the step 3, node A is candidate seed formation S qThe key protein matter node of moderate maximum.
The detailed process of described bunch of expansion is:
According to the node fitness function
Figure BDA0000043600480000041
The node fitness value of all neighbours' nodes of compute cluster H, in the formula, f H+ (A)Represent subgraph H adapts to function when comprising node A value, f H-(A)Represent subgraph H adapts to function when not comprising node A value, the adaptation function of subgraph H
Figure BDA0000043600480000042
Figure BDA0000043600480000043
Represent degree in the weighting,
Figure BDA0000043600480000044
Represent weighting and spend outward, for containing N vThe weighting network figure G (N on individual node and E bar limit v, the node v among subgraph H E), degree in its weighting
Figure BDA0000043600480000045
For link to each other with node v and belong to all limits of subgraph H weight w (u, v) sum, namely
Figure BDA0000043600480000046
Weighting is spent outward
Figure BDA0000043600480000047
For link to each other with node v and do not belong to all limits of subgraph H weight w (u, v) sum, namely For weighting network, (u v) is based on various Biochemistry Experiments the numerical value of protein interaction reliability is weighed the weight w on limit; For non-weighting network, all limit weights are 1;
Neighbours' node of bunch H is the node that with the node that belongs to bunch H direct interaction is arranged and do not belong to bunch H; If the node fitness value of all neighbours' nodes of bunch H is for negative, so bunch H for finally be identified bunch, and deposit the C of result queue in, and with key protein matter nodes all among bunch H from candidate seed formation S qMiddle deletion;
If bunch H be not the node fitness value of all neighbours' nodes for negative, then carry out following expansion process:
Steps A: select the highest neighbours' node of node fitness value as current expansion node, it is added bunch H, form a bigger bunch H ';
Step B: for a bunch H ', recomputate the fitness value of its inner all nodes;
Step C: be negative node if the node fitness value appears in bunch H ' inside, then this node fitness value deleted from bunch H ' for negative node, and returned step B; If the inner node fitness value that do not occur of bunch H ' is negative node, then turn back to beginning most of bunch expansion step, namely according to the node fitness function
Figure BDA0000043600480000049
The node fitness value of all neighbours' nodes of compute cluster H.
For weighting protein interaction network, related reliability score is based on various Biochemistry Experiments the numerical value of protein interaction reliability is weighed; For non-weighting protein interaction network, related reliability score then all gets 1.
Beneficial effect:
Early stage biomedical research points out that key protein matter is to keep its biogenic protein in the organism, if it is rejected then can produce the mortality influence to organic existence or growth.In addition, it is the specific properties of protein complex that correlative study is found key, and key protein matter also accumulates in the protein complex to a great extent.Based on the importance of key protein confrontation organism vital movement and the relation between key protein matter and the protein complex, the present invention is on the basis of the topological property of considering the protein interaction network, as seed, the protein complex recognizing method (EPOF) based on key protein matter and local adaptation has been proposed with key protein matter.This method can be used for non-weighting protein interaction network, also can be used for weighting protein interaction network.This method only needs according to protein interaction information and key protein matter information identification of protein compound more exactly just, and can once dope the amounts of protein compound, has solved chemical experiment method cost costliness and problem such as consuming time.
The present invention, constantly expands as Rule of judgment with the node fitness as seed with key protein matter on the basis of the topological property of considering the protein interaction network, finally identifies overlapping protein complex.This method is the identification of protein compound effectively, for the biologist carries out the experiment of protein complex identification and further study valuable reference information is provided.
Experimental results show that, EPOF than additive method aspect the biological function enrichment of predicted protein matter compound, precision aspect, comprehensive evaluation aspect (especially the performance on the weighting protein network is than the aspect) etc. aspect all has significant superiority, concrete experiment parameter and correlation curve see embodiment for details.
Description of drawings
Fig. 1: the process flow diagram of EPOF of the present invention;
Fig. 2: protein interaction non-directed graph example;
Fig. 3: EPOF identification of protein compound instance graph;
Protein complex and the matching ratio illustration of known protein matter compound under different matching thresholds that Fig. 4: method EPOF and additive method are known out: a) non-weighting protein interaction network result, b) weighting protein interaction network result;
Fig. 5: the importance checking result schematic diagram of key protein matter in the identification of protein compound: a) with known protein matter compound matching ratio; B) susceptibility, specificity and comprehensive evaluation are relatively; C) the function concentration ratio; D) ratio of precision.
Embodiment
Below with reference to the drawings and specific embodiments the present invention is described in further details:
Embodiment 1:
One, based on the protein complex model of cognition (EPOF) of key protein matter and local adaptation
The present invention is defined as protein complex:, constantly expand as Rule of judgment with the node fitness as kind of a child node with key protein matter, and maximum local adaptation is worth corresponding bunch.
For the protein complex model of cognition of clear description based on key protein matter and local adaptation, the inventor is as follows with the related definition of this model:
The inventor has proposed a local adaptation function f consistent with weak modular model H, its expression-form is as follows:
f H = d w in ( H , v ) d w in ( H , v ) + d w out ( H , v ) - - - ( 1 )
Wherein,
Figure BDA0000043600480000062
Represent degree in the weighting,
Figure BDA0000043600480000063
Representing weighting spends outward.For weighting network figure G (N v, the node v among subgraph H E), degree in its weighting For linking to each other with node v and belonging to the weights sum on all limits of subgraph H; Its weighting is spent outward
Figure BDA0000043600480000065
For linking to each other with node v and not belonging to the weights sum on all limits of subgraph H.For non-weighting network, all limit weights are 1.
According to above-mentioned local adaptation function f H, provide the definition of node fitness: a given node A and a subgraph H, the fitness of the subgraph H of node A
Figure BDA0000043600480000066
Be subgraph H fitness function f when comprising and not comprising node A HDifference, that is:
f H A = f H + ( A ) - f H - ( A ) - - - ( 2 )
Wherein, f H+ (A)Represent subgraph H and when comprising node A, adapt to function f HValue, f H-(A)Represent subgraph H and when not comprising node A, adapt to function f HValue.
Based on the target of the protein complex model of cognition of key protein matter and local adaptation be identification can represent protein complex bunch.It is initialized as one bunch with key protein matter as kind of a child node, with the node fitness as Rule of judgment constantly to a bunch expansion.For example, can improve the functional value f of local adaptation when neighbours' node HThe time, then it is added bunch; Node when bunch inside can not improve the functional value f of local adaptation HWhen it is reduced, then with its from bunch in deletion.So iterate finally determine can represent protein complex bunch.
Based on the whole flow process of the protein complex recognizing method EPOF of key protein matter and local adaptation as shown in Figure 1.At first import a histone matter interaction information (set of the related and related reliability score of protein-protein) and one group of key protein matter information (set of key protein matter node).Method EPOF can be divided into 5 subprocess: set up the protein interaction non-directed graph, kind of a child node is selected in searched key protein node and ordering, expands bunch and the output result.
Subprocess 1: set up the protein interaction non-directed graph: import a histone matter interaction information, filter wherein repeat interact and self-interaction, set up protein interaction non-directed graph G (as shown in Figure 2).Wherein, the set of the related and related reliability score of protein interaction information finger protein matter-protein.Each member to protein-protein association is two protein with direct interaction.For weighting protein interaction network, related reliability score is based on various Biochemistry Experiments the numerical value of protein interaction reliability is weighed; For non-weighting protein interaction network, related reliability score then all gets 1.When containing n node and m bar limit among the non-directed graph G, the time complexity of this process is O (m).
Subprocess 2: searched key protein node and ordering: the kind child node is the key protein matter node in the protein interaction information among the present invention.At first, according to the key protein matter node among key protein matter information (current existing key protein matter node set) the search graph G of one group of input.Then, according to the degree (the limit number that with this node be associated) of key protein matter node in figure G it is pressed ordering from big to small, and put into a candidate seed formation S qIn.The first step is O (nn according to the time complexity of the key protein matter node among the key protein matter information search figure G E), n is the nodal point number of figure G, n EBe key protein matter nodal point number in the key protein matter information.Second step was O (n according to the degree of key protein matter node by the time complexity that sorts from big to small eLogn e), n wherein eBe key protein matter nodal point number among the figure G.
Subprocess 3: select kind of a child node: if candidate seed formation S qBe not empty, select formation S qFirst node A (be candidate seed formation S qThe key protein matter node of moderate maximum), and is initialized as a bunch of H (as initial cluster, current among bunch H only comprise a key protein matter node A), then a bunch H expanded; Intact one bunch of every expansion, with bunch in all key protein matter nodes from candidate seed formation S qIn deletion, and should bunch deposit the C of result queue in; Candidate seed formation S qDuring for sky, entire identification process finishes.
Subprocess 4: expand bunch: according to the node fitness value of all neighbours' nodes of node fitness function (formula (2)) compute cluster H, neighbours' node of bunch H is the node that with the node that belongs to bunch H direct interaction is arranged and do not belong to bunch H; If the node fitness value of all neighbours' nodes of bunch H is for negative, so bunch H be finally be identified bunch, and deposit the C of result queue in, otherwise carry out following expansion process: (1) selects the highest neighbours' node of node fitness value as current expansion node, it is added bunch H, form a bigger bunch H '; (2) for a bunch H ', recomputate the fitness value of its inner all nodes; (3) if bunch H ' inside the node fitness value occurs for negative node, then this node is deleted from bunch H ', generated new bunch of H "; (4) if the described situation of step (3) takes place, the expansion process jumps to step (2), namely recomputate the fitness value of all nodes in new bunch, otherwise be that the expansion process is carried out on the basis again with this bunch, node fitness value until all neighbours' nodes of working as prevariety is negative, will work as prevariety then and deposit the C of result queue in.
Expand bunch is the critical process of protein complex in the identification of protein interactive network.Because trace-back process, beginning to identify a scale time complexity that is the protein complex of s from key protein matter kind child node is O (s 2).Therefore, the quantity of the protein complex that identifies in the protein interaction network is n eThe time, the time complexity that expands bunch subprocess approximately is O (n e<s 2).When the node set of nodal point number n in the protein interaction network and the protein complex of identifying was the same order of magnitude, this time complexity was about O (n 2).
Subprocess 5: output result: all are identified among the output C of result queue bunch, i.e. protein complex.
According to the flow process of EPOF, provided a simplified example (referring to Fig. 3) herein.Fig. 3 is a protein interaction non-directed graph, and every pair of interactional weights are 1 among the figure.Suppose that key protein matter node is S by its degree in the drawings by ordering from big to small q={ A, G, J}.At first, select node A as kind of a child node, and be initialized as a bunch of { A}.Then, calculate its neighbours' node { B, C, D, E, F, the node fitness value of G} according to node fitness definition (formula (2)).Select the highest neighbours' node B of node fitness value, it is added work as prevariety, form bunch { A, B}.For bunch { A, B} is because the adding of node has caused its change in topology, so recomputate the node fitness value of its inner all nodes.Because bunch { A, the node fitness value of the inner node of B} be for just, then continues so that bunch { A, B} expands as working as prevariety.When bunch be extended for A, B, C, D, E, F, G, H, during I}, { J, K, L, M, N, the node fitness value of O} all are negative when neighbours' node of prevariety.So, be a protein complex of EPOF identification when prevariety.With all the key protein matter nodes in this bunch from S qMiddle deletion (S q=J}), and deposit this bunch in result queue.After this, select node J to repeat above expansion process as kind of a child node.
Two, based on the validation verification of the protein complex recognizing method (EPOF) of key protein matter and local adaptation
For the validity of verification method EPOF, the inventor is applied to yeast protein interactive network in the DIP database with the EPOF method.With method EPOF and 10 methods such as EAGLE, NFC, HC-PIN, MCODE, DPClus, IPCA, CPM, MCL, CMC and Core-Attachment the function enrichment of predicted protein matter compound, precision, and the aspects such as susceptibility, specificity and comprehensive evaluation of known protein matter compound coupling and predicted protein matter compound compare.In addition, in the present embodiment, also analyzed key protein matter as the importance of kind of child node when the identification of protein compound.
1. method EPOF and additive method are in the function concentration ratio of predicted protein matter compound
The function enrichment is the most important assessment measure of the biological function intensity of assess proteins compound, and it assesses by the P-value that calculates based on hypergeometric distribution whether the protein complex that is identified is the set of protein node at random.A protein complex P-value is more little, illustrates that then it is that the probability of protein set at random is more little, shows that also the biological function of the protein complex that this is identified is more remarkable simultaneously.The computing formula of P-value is:
P - value = 1 - Σ i = 0 k - 1 F i N - F C - i N C - - - ( 3 )
Wherein N is node sum in the protein interaction network, and C is the scale of the protein complex that is identified, and F is the protein complex scale of known function, and k is the protein complex that is identified and the protein complex protein common factor scale of known function.
The scale of table 1 the whole bag of tricks prediction is not less than 3 protein complex biological function concentration ratio
Figure BDA0000043600480000082
Figure BDA0000043600480000091
Figure BDA0000043600480000101
As can be seen from Table 1, its P-value value of protein complex of method EPOF identification in the quantity in<E-15 interval and ratio apparently higher than other the whole bag of tricks.At quantitative aspects, be 3 times of additive method at least particularly.Its P-value value of protein complex of method EPOF identification also is higher than most additive methods in [E-15, E-10] interval quantity and ratio.Otherwise, its P-value value of protein complex of method EPOF identification in>0.01 interval the ratio of (inanimate object meaning protein complex) well below other the whole bag of tricks.Aspect the enrichment of protein complex biological function, the performance of method EPOF on the weighting protein network is than better on non-weighting protein network.By the analysis discussion explanation of table 1, EPOF has significant superiority than additive method aspect the biological function enrichment of predicted protein matter compound.
2. the ratio of precision of method EPOF and additive method predicted protein matter compound
Recall ratio (Recall) and precision ratio (Precision) be in the information retrieval theory for assessment of the basic tool of search method correctness, in the protein complex identification of protein interaction network, be used to estimate the correctness of the protein complex of identification.The computing formula of recall ratio and precision ratio is as follows:
recall = | C ∩ F i | | F i | - - - ( 7 )
precision = | C ∩ F i | | C | - - - ( 8 )
Wherein, C represents certain protein complex of identifying, F iA histone matter that has a certain function in the expression protein network.Only consider the minimum P-value functional information corresponding of this protein complex under one situation.The correctness of the protein complex of an identification of assessment should be taken all factors into consideration recall ratio and precision ratio.The f-measure module that comes to this, its computing formula is as follows:
f - mesure = 2 * recall * precision recall + precision - - - ( 9 )
Because every kind of protein complex quantity difference that method produces, and the distribution of the recall ratio R of protein complex, precision ratio P and f-measure is also inequality.In order to take all factors into consideration these factors, the inventor is defined as the accuracy (Accuracy) of method identification of protein compound from protein network the mean value of f-measure of all significant protein complexes of method identification.The precision of a method comprises the checking under three kinds of annotation information of GO: BiologicalProcess (B.P.), Molecular Function (M.F.) and Cellular Component (C.C.).
Table 2 method EPOF and other the whole bag of tricks predicted protein matter compound ratios of precision are
Figure BDA0000043600480000111
As can be seen from Table 2, the checking under three kinds of annotation information of GO shows that the protein complex of method EPOF prediction obviously is superior to other all methods aspect precision.Aspect the biological precision of protein complex, the performance of method EPOF on the weighting protein network is than better on non-weighting protein network.
3. method EPOF and additive method predicted protein matter compound and known protein matter compound matching ratio are
Be the directly validity of evaluation method EPOF identification of protein compound, the inventor carries out The matching analysis with known protein matter compound in the protein complex of method EPOF and additive method identification and the MIPS database.Have 216 after removing the compound have only a protein for known protein matter compound data set in the MIPS database, minimum compound comprises 2 protein, maximum compound comprises 81 protein, and on average each compound comprises 6.31 protein.The matching degree OS of the compound that identifies (Pc) and known compound (Kc) (Pc, being calculated as follows Kc):
OS ( Pc , Kc ) = i 2 | V Pc | × | V Kc | - - - ( 10 )
Wherein | V Pc| and | V Kc| represent the scale of recognition complex and known compound respectively, i represents the scale that they occur simultaneously.
If (Pc Kc) surpasses given threshold value to the matching degree OS of two compounds, then claims this two compound couplings.For the known compound of standard compound data centralization, (Pc Kc) surpasses given threshold value to the degree OS if the compound that exists one or more methods to identify matches, claim that then this known compound is identified, if (Pc Kc)=1, claims that then this known compound is identified fully to OS.
Among Fig. 4, (a) with (b) the matching ratio result of known protein matter compound under different matching thresholds protein complex that method EPOF and additive method know out from non-weighting protein interaction network and weighting protein interaction network and the MIPS database described respectively.As can be seen from Figure 4, aspect the coupling of known protein matter compound, method EPOF can show very obvious superiority when getting typical matching threshold OS=0.2.That is to say than other control methodss, to have higher matching ratio when known protein matter compound mates in the protein complex that method EPOF knows out and the MIPS database.
Susceptibility (Sensitivity, Sn) and specificity (Specificity Sp) is two important indicators of evaluating protein matter complex recognizing method.Susceptibility refers to the part proportion that identified out by certain method in the known protein matter compound; Specificity refers to the correct part proportion of identification in the protein complex of certain method identification.Susceptibility and specific computing formula are as follows:
Sn = TP TP + FN - - - ( 11 )
Sp = TP TP + FP - - - ( 12 )
Wherein in the protein complex of TP (True Positive) method for expressing identification with known protein matter compound matching degree OS (Pc, Kc) 〉=0.2 quantity, the protein complex sum that FP (False Positive) equals to identify deducts TP, does not have identified quantity in FN (False Negative) the expression known protein matter compound.Pertinent literature is taken all factors into consideration susceptibility and two aspects of specificity, has proposed comprehensive evaluation index F, and its computing formula is as follows:
F = 2 × Sp × Sn Sp + Sn - - - ( 13 )
Table 3 method EPOF and the comparison of additive method predicted protein matter compound aspect susceptibility, specificity and comprehensive evaluation
Figure BDA0000043600480000132
Figure BDA0000043600480000141
As can be seen from Table 3, the protein complex of each method prediction shows at the comparative result aspect susceptibility, specificity and the comprehensive evaluation, though method EPOF is worse than some method in the performance aspect susceptibility or the specificity, the performance of method EPOF aspect comprehensive evaluation obviously is better than additive method.Particularly, the performance of method EPOF on the weighting protein network is than better on non-weighting protein network.
4. key protein matter is as the importance checking of kind of child node when the identification of protein compound
In order to verify the importance of key protein matter in the identification of protein compound, the inventor has carried out different kinds of child node selection strategy results relatively.In the former research, the researchist finds that the height protein node in the network has guaranteed the integrality of network function, i.e. " centrality-lethal " rule.Therefore, height protein node is also through being often used as the kind child node of protein complex identification.At this, the inventor only with the selection strategy of degree centrality as kind of child node, is designated as EPOF (DC), and it only is the different of kind of child node selection strategy with the difference of EPOF method.Fig. 5 has described the comparison of aspect such as known protein matter compound coupling, susceptibility, specificity, comprehensive evaluation, function enriching and precision in protein complex that method EPOF and EPOF (DC) identify and the MIPS database.
From Fig. 5 (a) as can be seen, aspect the coupling of predicted protein matter compound and known protein matter compound, when typical matching threshold OS=O.2, the ratio that the protein complex of method EPOF prediction and known protein matter compound are complementary is higher than method EPOF (DC), about 25% far away.When getting other matching thresholds OS, the performance of method EPOF also presents superiority in various degree.Fig. 5 (b) illustrates that no matter at non-weighting protein interaction network or at weighting protein interaction network, method EPOF obviously is being better than method EPOF (DC) aspect susceptibility, specificity and the comprehensive evaluation.Find out that from Fig. 5 (c) the biological function conspicuousness meaning P-value value of method EPOF identification is higher than method EPOF (DC) far away in the protein complex ratio in<E-15 and [E-15, E-10] two intervals.Otherwise, the biological function conspicuousness meaning value P-value of method EPOF identification in protein complex (the inanimate object meaning protein complex) ratio in>0.01 interval well below method EPOF (DC).Fig. 5 (d) illustration method EPOF aspect precision apparently higher than EPOF (DC).In sum, key protein matter has important role in the identification of protein compound.

Claims (2)

1. the protein complex recognizing method based on key protein matter and local adaptation is characterized in that, may further comprise the steps:
Step 1: set up the protein interaction non-directed graph: import a histone matter interaction information, filter wherein repeat interact and self-interaction, set up protein interaction non-directed graph G; Wherein, the set of the related and related reliability score of protein interaction information finger protein matter-protein; Each member to protein-protein association is two protein with direct interaction;
Step 2: searched key protein node and ordering: according to the key protein matter node among the key protein matter information search protein interaction non-directed graph G of one group of input, and put into a candidate seed formation S after sorting from big to small according to its degree in protein interaction non-directed graph G qIn; Described degree is the limit number that is associated with this key protein matter node among the non-directed graph G;
Step 3: according to candidate seed formation S qCarry out protein complex identification:
If candidate seed formation S qBe not empty, choose S qFirst node A, first node A is initialized as a bunch of H, then bunch H is carried out a bunch expansion; Intact one bunch of every expansion, with bunch in all key protein matter nodes from candidate seed formation S qIn deletion, and should bunch deposit the C of result queue in;
Candidate seed formation S qDuring for sky, the output result, export namely that all are identified among the C of result queue bunch, the identification of protein compound of wanting bunch namely that all are identified, entire identification process end;
The detailed process of described bunch of expansion is:
According to the node fitness function
Figure FDA00003244931900011
The node fitness value of all neighbours' nodes of compute cluster H, in the formula, f H+ (A)Represent subgraph H adapts to function when comprising node A value, f H-(A)Represent subgraph H adapts to function when not comprising node A value, the adaptation function of subgraph H
Figure FDA00003244931900017
Represent degree in the weighting, Represent weighting and spend outward, for containing N vThe weighting network figure G (N on individual node and E bar limit v, the node v among subgraph H E), degree in its weighting
Figure FDA00003244931900018
For link to each other with node v and belong to all limits of subgraph H weight w (u, v) sum, namely
Figure FDA00003244931900014
Weighting is spent outward
Figure FDA00003244931900015
For link to each other with node v and do not belong to all limits of subgraph H weight w (u, v) sum, namely
Figure FDA00003244931900016
For weighting network, (u v) is based on various Biochemistry Experiments the numerical value of protein interaction reliability is weighed the weight w on limit; For non-weighting network, all limit weights are 1;
Neighbours' node of bunch H is the node that with the node that belongs to bunch H direct interaction is arranged and do not belong to bunch H; If the node fitness value of all neighbours' nodes of bunch H is for negative, so bunch H for finally be identified bunch, and deposit the C of result queue in, and with key protein matter nodes all among bunch H from candidate seed formation S qMiddle deletion;
If bunch H be not the node fitness value of all neighbours' nodes for negative, then carry out following expansion process:
Steps A: select the highest neighbours' node of node fitness value as current expansion node, it is added bunch H, form a bigger bunch H ';
Step B: for a bunch H ', recomputate the fitness value of its inner all nodes;
Step C: be negative node if the node fitness value appears in bunch H ' inside, then this node fitness value deleted from bunch H ' for negative node, and returned step B; If the inner node fitness value that do not occur of bunch H ' is negative node, then turn back to beginning most of bunch expansion step, namely according to the node fitness function The node fitness value of all neighbours' nodes of compute cluster H.
2. the protein complex recognizing method based on key protein matter and local adaptation according to claim 1, it is characterized in that, for weighting protein interaction network, related reliability score is based on various Biochemistry Experiments the numerical value of protein interaction reliability is weighed; For non-weighting protein interaction network, related reliability score then all gets 1.
CN 201110006179 2011-01-12 2011-01-12 Protein complex identification method based on key protein and local adaptation Expired - Fee Related CN102176223B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110006179 CN102176223B (en) 2011-01-12 2011-01-12 Protein complex identification method based on key protein and local adaptation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110006179 CN102176223B (en) 2011-01-12 2011-01-12 Protein complex identification method based on key protein and local adaptation

Publications (2)

Publication Number Publication Date
CN102176223A CN102176223A (en) 2011-09-07
CN102176223B true CN102176223B (en) 2013-09-11

Family

ID=44519407

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110006179 Expired - Fee Related CN102176223B (en) 2011-01-12 2011-01-12 Protein complex identification method based on key protein and local adaptation

Country Status (1)

Country Link
CN (1) CN102176223B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945333B (en) * 2012-12-04 2015-05-27 中南大学 Key protein predicating method based on prior knowledge and network topology characteristics
CN103514381B (en) * 2013-07-22 2016-05-18 湖南大学 Integrate the protein bio-networks motif discovery method of topological attribute and function
CN105138866A (en) * 2015-08-12 2015-12-09 广东顺德中山大学卡内基梅隆大学国际联合研究院 Method for identifying protein functions based on protein-protein interaction network and network topological structure features
CN105975804B (en) * 2016-04-29 2019-07-23 南京邮电大学 A kind of protein complex acquiring method based on bio-networks
CN106372458A (en) * 2016-08-31 2017-02-01 中南大学 Critical protein identification method based on NCCO (Neighbor Closeness Centrality and Orthology) information
CN106778063A (en) * 2016-12-02 2017-05-31 上海电机学院 A kind of protein complex recognizing method based on graph model
CN107885971B (en) * 2017-10-30 2021-01-15 陕西师范大学 Method for identifying key protein by adopting improved flower pollination algorithm
CN108319812B (en) * 2018-02-05 2021-07-23 陕西师范大学 Method for identifying key protein based on cuckoo search algorithm
CN108733976B (en) * 2018-05-23 2021-12-03 扬州大学 Key protein identification method based on fusion biology and topological characteristics
CN109033746B (en) * 2018-06-29 2020-01-14 大连理工大学 Protein compound identification method based on node vector
CN109509509B (en) * 2018-09-29 2020-12-22 江西理工大学 Protein compound mining method based on dynamic weighted protein interaction network
CN109801673B (en) * 2018-12-30 2022-09-06 南京理工大学 Key protein identification method based on enhanced interaction network
CN110517729B (en) * 2019-09-02 2021-05-04 吉林大学 Method for excavating protein compound from dynamic and static protein interaction network

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246520A (en) * 2008-03-18 2008-08-20 中南大学 Protein complex recognizing method based on range estimation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246520A (en) * 2008-03-18 2008-08-20 中南大学 Protein complex recognizing method based on range estimation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李敏等.基于极大团扩展的蛋白质复合物识别算法.《中南大学学报(自然科学版)》.2010,第41卷(第2期),560-565. *
李敏等.基于距离测定的蛋白质复合物识别算法.《吉林大学学报(工学版)》.2010,第40卷(第5期),1318-1323. *

Also Published As

Publication number Publication date
CN102176223A (en) 2011-09-07

Similar Documents

Publication Publication Date Title
CN102176223B (en) Protein complex identification method based on key protein and local adaptation
Gong et al. Community detection in dynamic social networks based on multiobjective immune algorithm
Fontana et al. Physical aspects of evolutionary optimization and adaptation
CN109887540A (en) A kind of drug targets interaction prediction method based on heterogeneous network insertion
CN110046810B (en) Multi-target scheduling method for workshop manufacturing system based on timed Petri network
CN106599230A (en) Method and system for evaluating distributed data mining model
CN104914835A (en) Flexible job-shop scheduling multi-objective method
CN107908536B (en) Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment
CN105183796A (en) Distributed link prediction method based on clustering
CN104298778A (en) Method and system for predicting quality of rolled steel product based on association rule tree
CN106372458A (en) Critical protein identification method based on NCCO (Neighbor Closeness Centrality and Orthology) information
Balasubramanian et al. A graph-theoretic approach to testing associations between disparate sources of functional genomics data
CN100557616C (en) Protein complex recognizing method based on range estimation
CN103902457A (en) Method for generating test data covering parallel program paths based on coevolution
CN105205052A (en) Method and device for mining data
CN108052743B (en) Method and system for determining step approach centrality
CN111916143B (en) Molecular activity prediction method based on multi-substructural feature fusion
Mo et al. Applications of machine learning in phylogenetics
Mehranfar et al. A Type-2 fuzzy data fusion approach for building reliable weighted protein interaction networks with application in protein complex detection
Chowdhury et al. UICPC: centrality-based clustering for scRNA-seq data analysis without user input
US8924918B2 (en) Evaluation apparatus, an evaluation method and an evaluation program storing medium
Keskin et al. Cohort fertility heterogeneity during the fertility decline period in Turkey
CN109992594A (en) Distributed based on precomputation optimization executes optimization method and system
CN110048886A (en) A kind of efficient cloud configuration selection algorithm of big data analysis task
Hu et al. Mining, modeling, and evaluation of subnetworks from large biomolecular networks and its comparison study

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130911