CN109727637B

CN109727637B - Method for identifying key proteins based on mixed frog-leaping algorithm

Info

Publication number: CN109727637B
Application number: CN201811643461.5A
Authority: CN
Inventors: 雷秀娟; 杨晓琴; 赵杰
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2023-09-05
Anticipated expiration: 2038-12-29
Also published as: CN109727637A

Abstract

The invention discloses a method for identifying key proteins based on a mixed frog-leaping algorithm, which comprises the steps of converting a protein interaction network into an undirected graph, obtaining subcellular localization information corresponding to proteins, protein complex participation information and functional annotation information, processing nodes and edges in the protein interaction network, initializing frog population according to local average connectivity of protein nodes, dividing population according to adaptation values of the frog, performing meta-evolution on the frog in the population, performing local search, performing global thought communication on all the frog, performing global search and generating key proteins. The method can accurately identify the key protein; the simulation experiment result shows that the indexes such as sensitivity, specificity, F measure, positive predictive value, negative predictive value and accuracy are better; compared with other key protein recognition methods, the method combines the optimization characteristics of the hybrid frog-leaping algorithm with the topological characteristics of the protein interaction network and the biological characteristics of the protein to recognize the key protein, so that the recognition accuracy of the key protein is improved.

Description

Method for identifying key proteins based on mixed frog-leaping algorithm

Technical Field

The invention belongs to the technical field of biological information, and particularly relates to a method for identifying key proteins based on a mixed frog-leaping algorithm.

Background

Proteins are important components of all cells and tissues of organisms, and are the main contributors to vital activities. Different proteins are involved in different vital activities in biological cells, and thus proteins are divided into two main classes, key proteins and non-key proteins. Critical proteins are also called lethal proteins, and the absence of a critical protein can cause the cells to fail to reproduce or die normally, thereby rendering the organism useless or even unable to survive. The identification of the key proteins is an important research content in life science, and the correct identification of the key proteins is not only helpful for understanding the operation mechanism of organisms, but also has very important application value for disease diagnosis and drug design.

In biology, the identification of key proteins is mainly performed by using methods of biological experiments, such as single gene knockout, RNA interference, conditional gene knockout and the like. However, these methods are time consuming, labor intensive, expensive, and of limited range of applicable species. With the development of high-throughput technology, a large amount of biological data is available, and the rapid development of computer technology has made the use of methods of computational biology to identify key proteins a new direction of development in this field. Currently, the identification of key proteins by computational methods can be largely divided into two categories: a method based on network topology and a method based on biological information fusion.

Numerous studies have shown whether a protein junction is critically related to the topological nature of the junction in the protein interaction network. Based on this, a series of methods have been proposed that use centrality measures of nodes to identify key proteins. Such as degree centrality (Degree Centrality, DC), medium centrality (Betweenness Centrality, BC), proximity centrality (Closeness Centrality, CC), feature vector centrality (Eigenvector Centrality, EC), information centrality (Information Centrality, IC), sub-graph centrality (Subgraph Centrality, SC), and the like. With the deep mining analysis of network topology characteristics, more key protein identification methods based on node topology characteristics are proposed. Wang et al propose a new centrality measurement method NC that predicts the criticality of a protein by calculating the edge aggregation factor while taking into account the characteristics of the node and the relationship between the node and its neighbors; li et al propose a local average connectivity method (LAC) that generates a new subgraph of neighbor nodes of each node, and identifies key proteins based on the degree of each node in the subgraph; qi et al propose a local interaction density method (LID) that identifies key proteins based on the interaction relationship between neighboring nodes of each node. These centrality measurement methods based on network topology depend to a large extent on the reliability of the protein interaction network, whereas the protein interaction network data obtained by the method of high-throughput biological experiments contains a large number of false positives, which greatly affects the accuracy of key protein recognition.

In order to overcome the defects of the network topology-based method for identifying the key proteins, some researchers have combined the biological significance of the proteins, and some methods based on biological information fusion are proposed for identifying the key proteins. Such as key protein recognition methods PeC and WDC combine network topology characteristics of protein nodes and gene expression data information; the UDoNC key protein recognition method integrates a protein interaction network and protein domain information; TEO fuses functional annotation information and gene expression information of the protein in a protein interaction network; SON combines subcellular localization information, ortholog information, and topological properties of protein interaction networks. In addition, studies have shown that there is a close relationship between protein complexes and key proteins, and Hart et al have experimentally demonstrated that key proteins are typically enriched in certain complexes with specific functions. Therefore, some key protein recognition methods based on protein complexes, such as UC, LIDC, LBCC, and the like, have been proposed. Experimental results show that the method for integrating the biological information data in the protein interaction network has better identification effect than the prior method based on the network topology structure, and effectively improves the identification accuracy of key proteins.

Although a certain progress is made by predicting key proteins through a calculation method based on network level at present, the recognition accuracy of most recognition methods is still low and the robustness is poor, mainly due to the incompleteness and unreliability of biological information data, the complexity of vital activities and the difference among species, and most methods do not consider the difference of the connection degree and the connection strength among network nodes, use few topological features or biological characteristics in isolation, and lack global and overall analysis on the key protein nodes.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a method for identifying key proteins based on a mixed frog-leaping algorithm, which utilizes the optimization characteristics of the mixed frog-leaping algorithm to identify the key proteins from a protein interaction network, so that the accuracy of identifying the key proteins is improved.

In order to achieve the above purpose, the invention is realized by adopting the following technical scheme:

the invention discloses a method for identifying key proteins based on a mixed frog-leaping algorithm, which comprises the following steps:

1) Conversion of protein interaction networks into undirected graphs

Converting a protein interaction network into an undirected graph g= (V, E), wherein v= { V _i I=1, 2, …, n } is node v _i E is the set of edges E, node v _i Represents proteins, and edge e represents interactions between proteins;

2) Processing edges and nodes in protein interaction networks

Calculating local average connectivity LAC of the protein nodes, subcellular localization scores SC and protein complex scores PC of the protein nodes, and calculating structural similarity SS and functional similarity FS of edges connecting the two protein nodes;

3) Randomly generated initial frog population

Letting F be the population scale of the frog, C be the number of candidate key proteins to be identified, namely the length of a frog individual, sorting all protein nodes according to the descending order of LAC values, and taking larger nodes in the first 2 XC LAC values to generate an initial population for reducing the searching range of the key proteins, wherein TopV is the protein node set;

4) Global searching process for dividing frog group into groups

Sorting the frog population in descending order according to the adaptation value Essentiality (F) of frog individuals, wherein f=1, 2 … F, recording frog Px with highest adaptation value, and distributing F frog to m groups Y ₁ ，Y ₂ ，…，Y _m In the middle, satisfy Y _k ＝[X(j)|X(j)＝X(k+m×(j-1))，j＝1,2,…,n，k＝1,2,…,m]Wherein X (j) represents the j-th frog in the sorted frog group;

5) Performing meta-evolution, i.e. local search, in each population: k, iter represents a population counter and a local evolution counter, respectively, for comparison with the total number of population m and the local maximum number of evolutions maxiter, k=1, iter=1, maxiter e [50, 100];

6) Subjecting all groups of frog toMixing, re-ordering all frog individuals according to new adaptation value and dividing groups, and recording new global optimum frog individual Px (new), if the difference between the adaptation values of Px (new) and Px is not less than 10 ^-4 Turning to step 5; otherwise, turning to step 7;

7) Production of key proteins

The proteins in the optimal frog individuals are exported as key proteins.

Preferably, in step 2), the local average connectivity LAC of the protein nodes is obtained by formula (1):

in the method, in the process of the invention,representing node v _i Neighbor node set of->Is composed of->A subgraph of the nodes in (a) is formed,representation set->Any node v of (a) _j In subgraph->The number of neighbor nodes in the network.

Preferably, the subcellular localization score SC of a protein node is derived from formula (2):

wherein C is _l Represents a subcellular fraction, l=1, 2 …, si (C _l ) Representing subcellular fraction C _l Is obtained from formula (3):

in the formula, num (l) represents C _l The number of key proteins contained in (a), tnum representing the total number of key proteins;

protein complex scores for protein nodes were calculated according to formula (4):

wherein F (v) _i ) Representing node v _i The frequency of occurrence in the known protein complex, obtained from formula (5), FM being the maximum frequency of occurrence in the known protein complex among all protein nodes;

wherein N represents the total number of known protein complexes if protein nodes are present in protein complex P _t In (C), then P _t (v _i ) =1, otherwise P _t (v _i )＝0；

The initial weight of each protein node is obtained from equation (6):

InW(v _i )＝SC(v _i )×PC(v _i ) Formula (6).

Preferably, in step 2), the structural similarity SS of the edges connecting the two protein nodes is calculated according to formula (7):

wherein Γ (i), Γ (j) each represents a node v _i ，v _j V is added to the neighbor node set of (2) _i ，v _j ；

The functional similarity of the edges connecting the two protein nodes is calculated according to formula (8):

wherein g (i), g (j) each represent an annotation node v _i And v _j GO term set of (a);

the weight of the edge connecting the two protein nodes is obtained by the formula (9):

We _ij ＝SS _ij ×FS _ij (9)

The final weight of each protein node is obtained from formula (10):

Preferably, the fitness value Essentiality (f) of the individual frog in step 4) is obtained from formula (11):

preferably, step 5) is specifically performed as follows:

5-1) carrying out local ideological communication on frogs in the kth frog family, namely carrying out local updating, wherein k=k+1;

5-2) in frog population Y _k In the method, s frog is selected to enter the sub-group sub-Y _k ，(s<n), selecting the frog in the sub-group based on a roulette method, namely, the larger the adaptation value of the frog individuals in the group is, the greater the probability that the frog is selected is, and enabling Pb and Pw to respectively represent the optimal frog and the worst frog in the sub-group, wherein iter=iter+1;

5-3) updating the position of the worst frog Pw according to the locally optimal frog Pb in the sub-population, and judging whether each dimension component protein of the worst frog individual Pw appears in the locally optimal conditionIn frog individual Pb, if present, the component protein is kept unchanged; otherwise, selecting one component protein in Pb to replace with a certain probability, namely, the position of the worst frog Pw is calculated according to the formula Pnl ₁ ＝update1(Pw，Pb，r ₁ ) Update, where r ₁ Probability of substitution of protein in Pw with component protein in Pb, pnl ₁ A new position after updating the worst frog Pw according to the locally optimal frog Pb;

5-4) if the position of the worst frog is improved by step 5-2), i.e. the adaptation value of the worst frog at the new position is higher than the adaptation value at the original position, using the newly generated position Pnl ₁ Replacing the original position Pw, otherwise, adopting the global optimal frog Px to renew the position of the worst frog individual, judging whether the component protein of each dimension of the worst frog individual Pw appears in the global optimal frog individual Px, and if so, keeping the component protein unchanged; otherwise, selecting one component protein in Px to replace with a certain probability, namely, the position of the worst frog Pw is calculated according to the formula Pnl ₂ ＝update2(Pw，Px，r ₂ ) Update, where r ₂ Probability of substitution of protein in Pw with component protein in Px, pnl ₂ A new position after updating the worst frog Pw according to the global optimal frog Px;

5-5) if the position of the worst frog is improved by step 5-3), i.e. the adaptation value of the worst frog in the new position is higher than the adaptation value in the original position, using the newly generated position Pnl ₂ Replacing original position Pw, otherwise randomly generating frog at any position in wetland to replace worst frog, i.e. the position of worst frog Pw is according to formula Pnl ₃ ＝update3(Pw，TopV，r ₃ ) Update, where r ₃ Pnl for each dimension in Pw the probability of the protein being replaced ₃ A new position after randomly updating the worst frog Pw;

the optimal individual Pb and the worst individual Pw of the present subgroup are recalculated as long as any one of the updates in the steps 5-3), 5-4) and 5-5 is performed;

5-6) if iter < = maxiter, go to step 5-2);

5-7) if k < = m, go to step 5-1), otherwise go to step 6.

Further preferably, in step 5-3), the new position Pnl obtained after updating the position of the worst frog Pw ₁ The calculation method of (1) adopts an algorithm update (Pw, pb, r) ₁ ) The specific method comprises the following steps:

step1: finding a protein pool Pset1 which appears in Pb and does not appear in Pw;

step2: for component protein v _i E, pw, judging whether the E appears in Pb;

step3: if it isAnd random number rand>r ₁ Then randomly selecting a protein v from the set Pset1 _j Replacement v _i And pset1=pset1- { v _j }；

Step4: step2-3 is repeated until all proteins in Pw are judged to be complete.

Further preferably, in step 5-4), the new position Pnl obtained after the position update of the worst frog Pw ₂ The calculation method of (1) adopts an algorithm update2 (Pw, px, r) ₂ ) The specific method comprises the following steps:

Step1: finding a protein set Pset2 which appears in Px and does not appear in Pw;

step2: for component protein v _i E, pw, judging whether the E appears in Px;

step3: if it isAnd random number rand>r ₂ Then randomly selecting a protein v from the set Pset2 _j Replacement v _i And pset2=pset2- { v _j }；

Step4: step2-3 is repeated until all proteins in Pw are judged to be complete.

Further preferably, in step 5-5), the new position obtained after the position update of the worst frog PwPnl set ₃ The calculation method of (1) adopts an algorithm update3 (Pw, topV, r) ₃ ) The specific method comprises the following steps:

step1: finding the protein pool Pset3 which appears in TopV and does not appear in Pw;

step2: for component protein v _i E, pw, judging whether the E appears in TopV;

step3: if it isAnd random number rand>r ₃ Then randomly selecting a protein v from the set Pset3 _j Replacement v _i And pset3=pset3- { v _j }；

Step4: step2-3 is repeated until all proteins in Pw are judged to be complete.

Compared with the prior art, the invention has the following beneficial effects:

1. when the initial weight is given to the protein node, the subcellular localization information and the protein complex information are utilized, the importance of the protein is measured through the subcellular localization characteristics of the protein and the participation condition in the complex, and the identification accuracy of the key protein is improved to a certain extent.

2. When the final weight is given to the protein nodes, the invention considers not only the characteristics of the proteins, but also the neighbor characteristics and the connectivity between the proteins, the connection strength between the proteins is obtained by calculating the topological connection structure similarity and the functional similarity between the two proteins, and simultaneously, the invention can more accurately and effectively identify the key proteins by considering the network topology and the biological information and the fusion of various characteristics.

3. The invention simulates the process of jumping among frog individuals through information exchange to find places with more food to identify key proteins, a group of candidate key proteins are regarded as a frog, a local search strategy is executed in a group, after each group evolves to a certain stage, the whole frog group carries out global information exchange, and finally when the algorithm is terminated, a group of proteins corresponding to the frog with the highest adaptation value are identified key proteins.

4. The invention can effectively identify key proteins from protein interaction networks, is beneficial to understanding the growth regulation process of cells and the operation mechanism of vital activities, helps people to know the basic requirements of organisms for maintaining the vital activities, can provide important theoretical basis for related researchers from genome and proteome levels, and has extremely important application value in the aspects of diagnosis and treatment of diseases, research and development and preparation of medicines and the like.

Drawings

FIG. 1 is a flow chart of a method of the present invention for identifying key proteins based on a hybrid frog-leaping algorithm;

FIG. 2 shows the frog population Y according to step 5) of the invention _k A method flow diagram of performing a local search.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The invention is described in further detail below with reference to the attached drawing figures:

as shown in FIG. 1, the method for identifying key proteins based on the mixed frog-leaping algorithm comprises the following steps:

1) Conversion of protein interaction networks into undirected graphs

2) Processing edges and nodes in protein interaction networks

Calculating the local average connectivity of protein nodes according to formula (1):

in the method, in the process of the invention,representing node v _i Neighbor node set of->Is composed of->A subgraph of the nodes in (a) is formed,representation set->Any node v of (a) _j In subgraph->The number of neighbor nodes in (a);

calculating subcellular localization scores for protein nodes according to formula (2):

wherein F (v) _i ) Representing node v _i The frequency of occurrence in the already protein complex, obtained from formula (5), FM is the maximum frequency of occurrence of all protein nodes in the known protein complex;

wherein N represents the total number of known protein complexes if protein nodes are present in protein complex P _t In the process, the liquid crystal display device comprises a liquid crystal display device,then P _t (v _i ) =1, otherwise P _t (v _i )＝0；

The initial weight of each protein node is obtained from equation (6):

InW(v _i )＝SC(v _i )×PC(v _i ) (6)

The structural similarity of the edges connecting the two protein nodes was calculated according to formula (7):

We _ij ＝SS _ij ×FS _ij (9)

The final weight of each protein node is obtained from formula (10):

3) Randomly generated initial frog population

4) Global searching process for dividing frog group into groups

The frog population is sorted in descending order according to the adaptation value Essentiality (F) of the frog individuals, wherein f=1, 2 … F, and the frog Px with the highest adaptation value is recorded. Assigning F frogs to m groups Y ₁ ，Y ₂ ，…，Y _m In the middle, satisfy Y _k ＝[X(j)|X(j)＝X(k+m×(j-1))，j＝1,2,…,n，k＝1,2,…,m]Wherein X (j) represents the j-th frog in the sorted frog group;

5) Performing meta-evolution, i.e. local search, in each population: k, iter represents a population counter and a local evolution counter, respectively, for comparison with the total number of population m and the local maximum number of evolutions maxiter, k=1, iter=1, maxiter e [50, 100]; referring to fig. 2, the method specifically comprises the following steps:

5-2) in frog population Y _k In the method, s frog is selected to enter the sub-group sub-Y _k ，(s<n) selecting the frog in the sub-group based on a roulette method, wherein the larger the adaptation value of the frog individual in the group is, the greater the probability that the frog is selected is, and Pb and Pw respectively represent the optimal frog and the worst frog in the sub-group, and the item=item+1;

5-3) updating the position of the worst frog Pw according to the locally optimal frog Pb in the sub-population, and for the worst frog individual Pw, judging whether each dimension of the component protein thereof appears in the locally optimal frog individual Pb, if so, keeping the component protein unchanged; otherwise, selecting a component protein in Pb (which is not present in Pw) to be replaced with a certain probability, namely, the position of the worst frog Pw is calculated according to the formula Pnl ₁ ＝update1(Pw，Pb，r ₁ ) Update, where r ₁ Probability of substitution of protein in Pw with component protein in Pb, pnl ₁ A new position after updating the worst frog Pw according to the locally optimal frog Pb;

5-4) if the position of the worst frog is improved by step 5-2, i.e., the worst frog is in the new positionIs higher than the adaptation value at the original position, the newly generated position Pnl is used ₁ Replacing the original position Pw, otherwise, adopting the global optimal frog Px to renew the position of the worst frog individual, judging whether the component protein of each dimension of the worst frog individual Pw appears in the global optimal frog individual Px, and if so, keeping the component protein unchanged; otherwise, selecting one component protein in Px to replace with a certain probability, namely, the position of the worst frog Pw is calculated according to the formula Pnl ₂ ＝update2(Pw，Px，r ₂ ) Update, where r ₂ Probability of substitution of protein in Pw with component protein in Px, pnl ₂ A new position after updating the worst frog Pw according to the global optimal frog Px;

5-5) if the position of the worst frog is improved by step 5-3, i.e. the adaptation value of the worst frog at the new position is higher than the adaptation value at the original position, using the newly generated position Pnl ₂ Replacing original position Pw, otherwise randomly generating frog at any position in wetland to replace worst frog, i.e. the position of worst frog Pw is according to formula Pnl ₃ ＝update3(Pw，TopV，r ₃ ) Update, where r ₃ Pnl for each dimension in Pw the probability of the protein being replaced ₃ A new position after randomly updating the worst frog Pw;

regardless of any one of the updates of 5-3, 5-4, and 5-5 above, the optimal individual Pb and worst individual Pw of the present subgroup need to be recalculated;

5-6) if iter < = maxiter, go to step 5-2;

5-7) if k < = m, turning to step 5-1, otherwise turning to step 6;

6) Mixing all the frog individuals, sorting and grouping according to new adaptive value, and recording new global optimum frog individual Px (new), if the difference between the adaptive values of Px (new) and Px is not less than 10 ^-4 Turning to step 5; otherwise, turning to step 7;

7) Production of key proteins

The proteins in the optimal frog individuals are exported as key proteins.

The fitness value Essentiality (f) of the individual frog in step 4) of the present invention is obtained from the formula (11):

in the step 5-3) of the present invention, the new position Pnl obtained after the position update of the worst frog Pw ₁ The calculation method of (1) adopts algorithm 1update1 (Pw, pb, r) ₁ ) The specific method comprises the following steps:

step2: for component protein v _i E, pw, judging whether the E appears in Pb;

Step4: step2-3 is repeated until all proteins in Pw are judged to be complete.

In the step 5-4) of the present invention, the new position Pnl obtained after the position update of the worst frog Pw ₂ The calculation method of (1) adopts an algorithm 2update2 (Pw, px, r) ₂ ) The specific method comprises the following steps:

step2: for component protein v _i E, pw, judging whether the E appears in Px;

Step4: step2-3 is repeated until all proteins in Pw are judged to be complete.

In the step 5-5) of the present invention, the new position Pnl obtained after the position update of the worst frog Pw ₃ The calculation method of (1) adopts an algorithm 3update3 (Pw, topV, r) ₃ ) The specific method comprises the following steps:

step2: for component protein v _i E, pw, judging whether the E appears in TopV;

Step4: step2-3 is repeated until all proteins in Pw are judged to be complete.

The invention is illustrated in further detail by the following examples:

the following is a method for identifying key proteins based on a mixed frog-leaping algorithm by taking a protein network as an example, and the specific operation is as follows:

the present example uses the Saccharomyces cerevisiae dataset (DIP version 2010.10.10) from the DIP database as the simulated dataset, eliminating self-interactions and duplicate interactions, comprising a total of 5093 proteins, 24743 edges. Subcellular localization data was downloaded from the COMPARTMENTS (20140830 version) database, including 6002 yeast proteins and 238657 subcellular location records. Known protein complex data were obtained by integrating data in four data sets CM270, CM425, CYC408 and CYC428, comprising a total of 745 protein complexes, covering 2167 proteins. The GO data is a reduced version of GOontologies. Key protein data were obtained by integrating the data in the four databases of MIPS, SGD, DEG and SGDP, containing 1285 key proteins in total. The experimental platform is a Windows 10 operating system, an Intel-Kui 5-6600 dual-core 3.31GHz processor, an 8GB physical memory, and Matlab R2014a software is used for realizing the method.

The method comprises the following specific steps:

1. conversion of protein interaction networks into undirected graphs

Converting a protein interaction network comprising 5093 proteins and 24743 interactions into an undirected graph g= (V, E), wherein v= { V _i I=1, 2, …,5093} is node v _i E is 24743 sets of edges E, node v _i Represents proteins, and edge e represents interactions between proteins.

2. Processing edges and nodes in protein interaction networks

To node v _i Pretreatment: i=1, 2, …,5093, and for each given i, the node v can be calculated _i Calculating the local average connectivity of protein nodes according to formula (1):

in the method, in the process of the invention,representing node v _i Neighbor node set of->Is composed of->A subgraph of the nodes in (a) is formed,representation set->Any node v of (a) _j In subgraph->In (a) is adjacent toThe number of living nodes; calculating subcellular localization scores for protein nodes according to formula (2):

in the formula, num (l) represents C _l The number of key proteins contained in the yeast, tnum represents the total number of key proteins of the yeast, tnum=1285; protein complex scores for protein nodes were calculated according to formula (4):

wherein N represents the total number of known protein complexes, n=745, if a protein node appears in protein complex P _t In (C), then P _t (v _i ) =1, otherwise P _t (v _i ) =0; the initial weight of each protein node is obtained from equation (6):

InW(v _i )＝SC(v _i )×PC(v _i ) (6)

wherein Γ (i), Γ (j) each represents a node v _i ，v _j V is added to the neighbor node set of (2) _i ，v _j The method comprises the steps of carrying out a first treatment on the surface of the The functional similarity of the edges connecting the two protein nodes is calculated according to formula (8):

We _ij ＝SS _ij ×FS _ij (9)

The final weight of each protein node is obtained from formula (10):

3. randomly generated initial frog population

Let F be the frog population size, f=100, C be the number of candidate key proteins to be identified, i.e. the length of a frog individual, sort all protein nodes in descending order of LAC values, to narrow the search range of key proteins, take the first 2×c nodes with larger LAC values to generate an initial population, topV be the set of protein nodes;

4. Global searching process for dividing frog group into groups

The frog population is sorted in descending order according to the adaptation value Essentiality (F) of the frog individuals, wherein f=1, 2 … F, and the frog Px with the highest adaptation value is recorded. Assigning F frogs to m groups Y ₁ ，Y ₂ ，…，Y _m In the middle, satisfy Y _k ＝[X(j)|X(j)＝X(k+m×(j-1))，j＝1,2,…,n，k＝1,2,…,m]Wherein m=10, n=10, and x (j) represents the ordered productThe j-th frog in the frog group, the fitness value Essentiality (f) is obtained by the formula (11):

5. performing meta-evolution, i.e. local search, in each population: k, iter represents a population counter and a local evolution counter, respectively, for comparison with the total number of population m and the local maximum number of evolutions maxiter, k=1, iter=1, maxiter e [50, 100];

5-1, carrying out local ideological communication on frogs in the kth frog family, namely carrying out local updating, wherein k=k+1;

5-2 in frog population Y _k In the method, s frog is selected to enter the sub-group sub-Y _k ，(s<n) selecting the frog in the sub-group based on a roulette method, wherein the larger the adaptation value of the frog individual in the group is, the greater the probability that the frog is selected is, and Pb and Pw respectively represent the optimal frog and the worst frog in the sub-group, and the item=item+1;

5-3, updating the position of the worst frog Pw according to the locally optimal frog Pb in the sub-population, and judging whether each dimension component protein of the worst frog Pw appears in the locally optimal frog Pb, if so, keeping the component protein unchanged; otherwise, selecting a component protein in Pb (which is not present in Pw) to be replaced with a certain probability, namely, the position of the worst frog Pw is calculated according to the formula Pnl ₁ ＝update1(Pw，Pb，r ₁ ) Update, where r ₁ Probability of substitution of protein in Pw with component protein in Pb, pnl ₁ To update the worst frog Pw according to the local optimum frog Pb and then to update the new position Pnl ₁ The algorithm 1 can obtain:

algorithm 1update1 (Pw, pb, r) ₁ )

step2: for component protein v _i E, pw, judging whether the E appears in Pb;

Step4: step2-3 is repeated until all proteins in Pw are judged to be complete.

5-4, if the position of the worst frog is improved by the step 5-2, i.e. the adaptation value of the worst frog at the new position is higher than the adaptation value at the original position, using the newly generated position Pnl ₁ Replacing the original position Pw, otherwise, adopting the global optimal frog Px to renew the position of the worst frog individual, judging whether the component protein of each dimension of the worst frog individual Pw appears in the global optimal frog individual Px, and if so, keeping the component protein unchanged; otherwise, selecting one component protein in Px to replace with a certain probability, namely, the position of the worst frog Pw is calculated according to the formula Pnl ₂ ＝update2(Pw，Px，r ₂ ) Update, where r ₂ Probability of substitution of protein in Pw with component protein in Px, pnl ₂ Is the new position Pnl after the worst frog Pw is updated according to the global optimum frog Px ₂ The algorithm 2 can obtain:

algorithm 2update2 (Pw, px, r ₂ )

step2: for component protein v _i E, pw, judging whether the E appears in Px;

Step4: step2-3 is repeated until all proteins in Pw are judged to be complete.

5-5) if the position of the worst frog is improved by step 5-3, i.e. the adaptation value of the worst frog at the new position is higher than the adaptation value at the original position, using the newly generated position Pnl ₂ Replacing original position Pw, otherwise randomly generating frog at any position in wetland to replace worst frog, i.e. the position of worst frog Pw is according to formula Pnl ₃ ＝update3(Pw，TopV，r ₃ ) Update, where r ₃ Pnl for each dimension in Pw the probability of the protein being replaced ₃ For new position after random update of worst frog Pw, pnl ₃ The algorithm 3 can obtain:

Algorithm 3update3 (Pw, topv, r ₃ )

step2: for component protein v _i E, pw, judging whether the E appears in TopV;

Step4: step2-3 is repeated until all proteins in Pw are judged to be complete.

5-6, if item < = maxiter, turning to step 5-2;

5-7, if k < = m, turning to step 5-1, otherwise turning to step 6;

6. mixing all the frog individuals, sorting and grouping according to new adaptive value, and recording new global optimum frog individual Px (new), if the difference between the adaptive values of Px (new) and Px is not less than 10 ^-4 Turning to step 5; otherwise, turning to step 7;

7. production of key proteins

The proteins in the optimal frog individuals are exported as key proteins.

In order to verify the effectiveness of the present invention, the inventors used the method of identifying key proteins by using the hybrid frog-leaping algorithm of example 1 of the present invention to identify key proteins in the protein network in the DIP database, analyzed the number (C) of candidate key proteins to be identified by 1%,5%,10%,15%,20% and 25% of the number of all protein nodes in the protein interaction network, and the results are shown in table 1 and table 2, and table 1 shows the comparison of the identification accuracy with the results of the current other methods for identifying key proteins, and table 2 shows the comparison of the other methods for identifying key proteins on the respective evaluation indexes.

TABLE 1 comparison of the accuracy of key proteins identified by the invention with other methods

TABLE 2 comparison of the invention and other methods at various evaluation indices

Table 1 shows the recognition accuracy of 1%,5%,10%,15%,20%,25% of the proteins identified by the method of the present invention as candidate key proteins compared to key proteins in the standard library, and the recognition results compared to other 9 key protein recognition methods. As can be seen from Table 1, the method of the present invention can identify key proteins more effectively than other methods, the number of candidate key proteins is from 1% to 25%, and the method of the present invention has the highest identification accuracy.

Table 2 shows the results of the evaluation and comparison of the method of the present invention and the other 9 methods on the evaluation indexes such as sensitivity, specificity, F measure, positive predictive value, negative predictive value, and accuracy when the number of the identified candidate key proteins is 25%. As can be seen from table 2, the present invention is able to predict more key proteins and the prediction accuracy is higher than other methods.

In summary, the method for identifying key proteins based on the hybrid frog-leaping algorithm comprises the steps of converting a protein interaction network into an undirected graph, obtaining subcellular localization information corresponding to proteins, protein complex participation information and functional annotation information, processing nodes and edges in the protein interaction network, initializing frog population according to local average connectivity of protein nodes, dividing population according to adaptation values of the frog, performing meta-evolution on the frog in the population, performing local search, performing global thought communication on all the frog, performing global search, and generating key proteins. The method can accurately identify the key protein; the simulation experiment result shows that the indexes such as sensitivity, specificity, F measure, positive predictive value, negative predictive value and accuracy are better; compared with other key protein recognition methods, the method combines the optimization characteristics of the hybrid frog-leaping algorithm with the topological characteristics of the protein interaction network and the biological characteristics of the protein to recognize the key protein, so that the recognition accuracy of the key protein is improved.

The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. The method for identifying the key protein based on the mixed frog-leaping algorithm is characterized by comprising the following steps of:

1) Conversion of protein interaction networks into undirected graphs

Converting the protein interaction network into an undirected graph g= (V, E) whichIn v= { V _i I=1, 2, …, n } is node v _i E is the set of edges E, node v _i Represents proteins, and edge e represents interactions between proteins;

2) Processing edges and nodes in protein interaction networks

We _ij ＝SS _ij ×FS _ij (9)

Wherein the SS _ij To join the structural similarity of the two protein nodes, the calculation was performed according to formula (7):

FS _ij To connect the functional similarity of the edges of two protein nodes, the calculation is performed according to formula (8):

wherein g (i), g (j) each represent an annotation node v _i And v _j GO term set of (a)

Final weight FnW (v) _i ) Obtained from the formula (10):

wherein InW (v) _i ) For each ofThe initial weights of the individual protein nodes are obtained from formula (6):

InW(v _i )＝SC(v _i )×PC(v _i ) Formula (6);

3) Randomly generated initial frog population

Letting F be the population scale of the frog, C be the number of candidate key proteins to be identified, namely the length of a frog individual, sorting all protein nodes according to the descending order of LAC values, and taking the large nodes in the first 2 XC LAC values to generate an initial population for reducing the searching range of the key proteins, wherein TopV is the protein node set;

4) Global searching process for dividing frog group into groups

the fitness value Essentiality (f) of individual frog is obtained from formula (11):

6) Mixing all the frog individuals, sorting and grouping according to new adaptive value, and recording new global optimum frog individual Px (new), if the difference between the adaptive values of Px (new) and Px is not less than 10 ^-4 Turning to step 5); otherwise, turning to step 7);

7) Production of key proteins

The proteins in the optimal frog individuals are exported as key proteins.

2. The method for identifying key proteins based on the hybrid frog-leaping algorithm of claim 1, wherein in step 2), the local average connectivity LAC of protein nodes is obtained by the formula (1):

in the method, in the process of the invention,representing node v _i Neighbor node set of->Is composed of->Sub-graph of nodes in (a), a->Representation set->Any node v of (a) _j In subgraph->The number of neighbor nodes in the network.

3. The method for identifying key proteins based on the mixed frog-leaping algorithm according to claim 1, wherein the subcellular localization scores SC of the protein nodes are obtained by formula (2):

in the formula, num () represents C _l The number of key proteins contained in (a), tnum representing the total number of key proteins;

wherein N represents the total number of known protein complexes if protein nodes are present in protein complex P _t In (C), then P _t ( _i ) =1, otherwise P _t ( _i )＝0；

The initial weight of each protein node is obtained from equation (6):

InW(v _i )＝C(v _i )×PC(v _i ) Formula (6).

4. The method for identifying key proteins based on the hybrid frog-leaping algorithm of claim 1, wherein step 5) specifically operates as follows:

5-2) in frog population Y _k In the method, s frog is selected to enter the sub-group sub-Y _k ，s<n, selecting frog in the sub-group based on roulette method, namely, the larger the adaptation value of frog individual in the group is, the larger the probability that the frog is selected is, and Pb and Pw respectively represent the optimal frog and the worst frog in the sub-group, wherein iter=iter+1;

5-3) updating the position of the worst frog Pw according to the locally optimal frog Pb in the sub-population, and for the worst frog individual Pw, judging whether each dimension of the component protein thereof appears in the locally optimal frog individual Pb, if so, keeping the component protein unchanged; otherwise, selecting one component protein in Pb to replace with a certain probability, namely, the position of the worst frog Pw is calculated according to the formula Pnl ₁ ＝update1(Pw，Pb，r ₁ ) Update, where r ₁ Probability of substitution of protein in Pw with component protein in Pb, pnl ₁ A new position after updating the worst frog Pw according to the locally optimal frog Pb;

5-4) if the position of the worst frog is improved by step 5-2), i.e. the adaptation value of the worst frog at the new position is higher than the adaptation value at the original position, using the newly generated position Pnl ₁ Replacing the original position Pw, otherwise, adopting the global optimal frog Px to renew the position of the worst frog individual, judging whether the component protein of each dimension of the worst frog individual Pw appears in the global optimal frog individual Px, and if so, keeping the component protein unchanged; otherwise, selecting one component protein in Px for replacement, namely the position of the worst frog Pw is calculated according to the formula Pnl ₂ ＝update2(Pw，Px，r ₂ ) Update, where r ₂ Probability of substitution of protein in Pw with component protein in Px, pnl ₂ A new position after updating the worst frog Pw according to the global optimal frog Px;

5-5) if the position of the worst frog is improved by step 5-3), i.e. the adaptation value of the worst frog in the new position is higher than the adaptation value in the original position, using the newly generated position Pnl ₂ Instead of the original position Pw, otherwiseMechanically producing a frog at any position in the wetland to replace the worst frog, i.e. the position of the worst frog Pw is according to the formula Pnl ₃ ＝update3(Pw，TopV，r ₃ ) Update, where r ₃ Pnl for each dimension in Pw the probability of the protein being replaced ₃ A new position after randomly updating the worst frog Pw;

5-6) if iter < = maxiter, go to step 5-2);

5-7) if k < = m, go to step 5-1), otherwise go to step 6.

5. The method for identifying key proteins based on a mixed frog-leaping algorithm as claimed in claim 4, wherein in step 5-3), the new position Pnl obtained after the position update of the worst frog Pw ₁ The calculation method of (1) adopts an algorithm update (Pw, pb, r) ₁ ) The specific method comprises the following steps:

step2: for component protein v _i E, pw, judging whether the E appears in Pb;

Step4: step 2-3 is repeated until all proteins in Pw are judged to be complete.

6. The method for identifying key proteins based on a mixed frog-leaping algorithm as claimed in claim 4, wherein in step 5-4), the new position Pnl obtained after the position update of the worst frog Pw ₂ The calculation method of (1) adopts an algorithm update2 (Pw, px, r) ₂ ) The specific method comprises the following steps:

step2: for component protein v _i E, pw, judging whether the E appears in Px;

Step4: step 2-3 is repeated until all proteins in Pw are judged to be complete.

7. The method for identifying key proteins based on a mixed frog-leaping algorithm as claimed in claim 4, wherein in step 5-5), the new position Pnl obtained after the position update of the worst frog Pw ₃ The calculation method of (1) adopts an algorithm update3 (Pw, topV, r) ₃ ) The specific method comprises the following steps:

step2: for component protein v _i E, pw, judging whether the E appears in TopV;

Step4: step 2-3 is repeated until all proteins in Pw are judged to be complete.