CN107784196A

CN107784196A - Method based on Artificial Fish Swarm Optimization Algorithm identification key protein matter

Info

Publication number: CN107784196A
Application number: CN201710912037.5A
Authority: CN
Inventors: 雷秀娟; 杨晓琴; 代才; 程适
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2017-09-29
Filing date: 2017-09-29
Publication date: 2018-03-09
Anticipated expiration: 2037-09-29
Also published as: CN107784196B

Abstract

The invention discloses a method for identifying key proteins based on an artificial fish swarm optimization algorithm, transforming a protein interaction network into an undirected graph, constructing a purified protein interaction network, obtaining ribonucleic acid gene expression values corresponding to proteins, and GO annotation information As well as the degree of protein in the known complex, the purified protein interaction network edges and nodes are processed, the known key proteins are selected as the initial artificial fish, the artificial fish performs foraging behavior, random behavior, tail chasing behavior, aggregation Group behavior and production of key proteins. The method of the present invention can accurately identify key proteins; the simulation experiment results show that the performance of indicators such as sensitivity, specificity, positive predictive value, and negative predictive value is better; compared with other key protein identification methods, the optimized characteristics of artificial fish swarm Combined with the topological features of the protein interaction network to realize the identification process of key proteins and improve the identification accuracy of key proteins.

Description

A Method for Identifying Key Proteins Based on Artificial Fish Swarm Optimization Algorithm

技术领域technical field

本发明属于生物信息领域，具体涉及一种基于人工鱼群优化算法识别关键蛋白质的方法。The invention belongs to the field of biological information, and in particular relates to a method for identifying key proteins based on an artificial fish swarm optimization algorithm.

背景技术Background technique

关键蛋白质是关键基因的产物，是生物体维持生命活动的必不可少的部分。关键蛋白质的缺失会导致生命活动无法正常进行，甚至导致生物体死亡。关键蛋白质的预测与识别是一项具有重要意义的研究工作，一方面，有助于研究与细胞相关的生长调控过程；另一方面，对于疾病诊断以及药物设计也具有深远的意义。最初，关键蛋白质的识别主要是通过生物实验方法，例如单基因敲除和RNA干扰等，通过这些实验技术识别关键蛋白虽然是准确有效，但代价高，效率低。因此，在生物信息学领域通过计算的方法识别关键蛋白成为了研究的热点和重点。Key proteins are the products of key genes and are an essential part of organisms to maintain life activities. The absence of key proteins will lead to the failure of life activities and even the death of organisms. The prediction and identification of key proteins is a research work of great significance. On the one hand, it helps to study the growth regulation process related to cells; on the other hand, it also has far-reaching significance for disease diagnosis and drug design. Initially, the identification of key proteins was mainly through biological experimental methods, such as single gene knockout and RNA interference. Although the identification of key proteins through these experimental techniques is accurate and effective, it is costly and inefficient. Therefore, in the field of bioinformatics, identifying key proteins through computational methods has become a research hotspot and focus.

目前，通过计算方法实现关键蛋白质的识别主要有以下两种：基于网络拓扑的结点中心性方法，PPI网络和生物信息数据结合的方法。At present, there are mainly two methods to realize the identification of key proteins through computational methods: the node centrality method based on network topology, and the method of combining PPI network and biological information data.

Jeong等于2001年提出的“中心性-致死性”法则指出一个蛋白质的关键性与该蛋白质在蛋白质相互作用网络的拓扑特性紧密相关，即拥有较多邻居结点的蛋白质的缺失更易于影响整个网络的拓扑结构。简言之，在蛋白质网络中，度越高的蛋白质结点越倾向于表现出关键性，该类蛋白质的缺失，更易造成机体功能的丧失，产生致死性作用。该法则为基于网络拓扑的关键蛋白质识别奠定了基础。之后,一系列基于拓扑中心性的关键蛋白质识别方法被提出，其中包括度中心性(Degree Centrality,DC)，介数中心性(BetweennessCentrality，BC),紧密度中心性(Closeness Centrality，CC)，特征向量中心性(Eigenvector Centrality，EC)，信息中心性(Information Centrality，IC),子图中心性(Subgraph Centrality，SC)。这些方法都是通过对所有蛋白质结点在蛋白质相互作用网络中某个中心性的值进行打分、排序，进而识别关键蛋白。但是，这些中心性方法高度依赖蛋白质相互作用网络的可靠性，由于蛋白质相互作用网络是通过高通量生物实验获得，其中包含了大量假阳性，这在在很大程度上影响了关键蛋白质识别的准确率。The "centrality-lethality" law proposed by Jeong et al. in 2001 points out that the criticality of a protein is closely related to the topological properties of the protein in the protein interaction network, that is, the absence of a protein with more neighbor nodes is more likely to affect the entire network. topology. In short, in the protein network, protein nodes with higher degrees tend to be more critical, and the absence of such proteins is more likely to cause the loss of body functions and produce lethal effects. This law lays the foundation for key protein identification based on network topology. After that, a series of key protein identification methods based on topological centrality were proposed, including degree centrality (Degree Centrality, DC), betweenness centrality (Betweenness Centrality, BC), closeness centrality (Closeness Centrality, CC), feature Vector Centrality (Eigenvector Centrality, EC), Information Centrality (Information Centrality, IC), Subgraph Centrality (Subgraph Centrality, SC). These methods identify key proteins by scoring and sorting the centrality value of all protein nodes in the protein interaction network. However, these centrality methods are highly dependent on the reliability of the protein interaction network, because the protein interaction network is obtained through high-throughput biological experiments, which contains a large number of false positives, which largely affects the identification of key proteins. Accuracy.

针对中心性方法识别关键蛋白质存在的缺点，研究人员提出一些新的识别方法来提高识别关键蛋白质的准确率。如PeC关键蛋白质识别方法将蛋白质相互作用网络与基因表达谱结合起来，ION关键蛋白质识别方法将蛋白质的同源特性与蛋白质相互作用网络进行结合，UDoNC关键蛋白质识别方法结合了蛋白质结构域和蛋白质相互作用网络，SCP关键蛋白质识别方法将亚细胞定位信息和蛋白质相互作用网络进行结合。此外，还有一些基于先验知识进行关键蛋白质识别的方法，如CPPK和CEPPK，将部分已知的关键蛋白质作为先验知识，通过网络中其他蛋白质与先验的紧密程度来判断该蛋白质的关键性。Aiming at the shortcomings of the centrality method in identifying key proteins, the researchers proposed some new identification methods to improve the accuracy of identifying key proteins. For example, the PeC key protein identification method combines the protein interaction network with the gene expression profile, the ION key protein identification method combines the homology characteristics of the protein with the protein interaction network, and the UDoNC key protein identification method combines the protein domain and the protein interaction. The interaction network, SCP key protein identification method combines subcellular localization information and protein interaction network. In addition, there are some methods for identifying key proteins based on prior knowledge, such as CPPK and CEPPK, which use some known key proteins as prior knowledge, and judge the key protein of the protein by the closeness of other proteins in the network to the prior knowledge. sex.

大量研究表明，蛋白质关键性和蛋白质复合物之间存在着密切的联系。Hart等人通过研究实验发现，蛋白质的关键性不是由单一的蛋白质所决定，而往往取决于蛋白质复合物的功能。并通过实验数据表明关键蛋白质往往富集中在某些复合物中。因此大量基于蛋白质复合物及功能模块的关键蛋白质识别方法被提出。Numerous studies have shown that there is a close connection between protein criticality and protein complexes. Hart et al. found through research experiments that the criticality of proteins is not determined by a single protein, but often depends on the function of protein complexes. And the experimental data show that key proteins are often enriched in some complexes. Therefore, a large number of key protein identification methods based on protein complexes and functional modules have been proposed.

尽管随着生物信息学的发展，研究者对关键蛋白质的识别进行了深入的研究，但目前基于网络拓扑的识别方法的准确率依旧较低，而且大多数方法都是孤立或者零碎地使用少数参数或特征分析关键蛋白质，对于结点缺乏从整体和全局上的把握。另外，由于通过高通量技术获得的蛋白质相互作用数据包含大量的假阳性，不能代表真实的蛋白质网络，因此构建一个更能真实模仿生物体的蛋白质相互作用网络能帮助进一步提升关键蛋白质识别准确率。Although with the development of bioinformatics, researchers have conducted in-depth research on the identification of key proteins, the accuracy of current identification methods based on network topology is still low, and most methods use a small number of parameters in isolation or piecemeal Or feature analysis of key proteins, lack of overall and global grasp of nodes. In addition, since the protein interaction data obtained through high-throughput technology contains a large number of false positives, it cannot represent the real protein network, so constructing a protein interaction network that can more realistically mimic organisms can help further improve the accuracy of key protein identification .

综合上述关键蛋白质识别方法的缺陷，主要有没考虑蛋白质相互作用网络的可靠性，只考虑部分特征缺乏全局和整体上的把握，关键蛋白质识别准确率较低。Based on the defects of the key protein identification methods mentioned above, the main reason is that the reliability of the protein interaction network is not considered, only some features are considered, and the overall and overall grasp is lacking, and the accuracy of key protein identification is low.

发明内容Contents of the invention

本发明的目的在于克服现有技术的缺点与不足，提供一种基于人工鱼群优化算法识别关键蛋白质的方法，构建一个提纯的蛋白质相互作用网络，关键蛋白质识别准确度高。The purpose of the present invention is to overcome the shortcomings and deficiencies of the prior art, provide a method for identifying key proteins based on an artificial fish swarm optimization algorithm, construct a purified protein interaction network, and identify key proteins with high accuracy.

为达到上述目的，本发明采用如下技术方案：To achieve the above object, the present invention adopts the following technical solutions:

包括以下步骤：Include the following steps:

(1)将蛋白质相互作用网络转化为无向图：(1) Transform the protein interaction network into an undirected graph:

将蛋白质相互作用网络转化成一个无向图G＝(V，E)，其中，V＝{v_i,i＝1,2,…,n}为结点v_i的集合，E为边e的集合，结点v_i表示蛋白质，边e表示蛋白质之间的相互作用；Transform the protein interaction network into an undirected graph G=(V, E), where V={v _i ,i=1,2,...,n} is the set of nodes v _i , E is the edge e Set, the node v _i represents the protein, and the edge e represents the interaction between proteins;

(2)构建提纯的蛋白质相互作用网络：(2) Construct a purified protein interaction network:

在时间点t时，结点v_i的基因表达值Ep_it若大于基因表达活性阈值Active_Th(i),则认为结点v_i在时间点t具有活性，否则认为该结点在时间点t不具有活性；若V中任意两个不同的结点v,u在时间点t同时具有活性，则认为在时间点t下结点v,u共表达；将无向图中在所有时间点下都没有共表达的蛋白质相互作用所对应的边删去，构建一个提纯的蛋白质相互作用网络；At time point t, if the gene expression value Ep _it of node v _i is greater than gene expression activity threshold Active_Th(i), then node v _i is considered active at time point t, otherwise it is considered that the node is not active at time point t is active; if any two different nodes v and u in V are active at the time point t at the same time, it is considered that the nodes v and u are co-expressed at the time point t; The edges corresponding to the protein interactions that are not co-expressed are deleted to construct a purified protein interaction network;

(3)对提纯的蛋白质相互作用网络的边和结点进行处理：计算边的聚集系数ECC、边的皮尔森相关系数PCC、边的GO功能相似性以及结点在蛋白质复合物内部的度；(3) Process the edges and nodes of the purified protein interaction network: calculate the aggregation coefficient ECC of the edges, the Pearson correlation coefficient PCC of the edges, the GO functional similarity of the edges, and the degree of the nodes in the protein complex;

(4)选取已知关键蛋白质组成初始人工鱼：(4) Select the known key protein composition initial artificial fish:

令N为人工鱼种群规模，m为每条人工鱼中包含的已知关键蛋白质的数量；在目前已知的关键蛋白质中随机选取m个已知关键蛋白质组成一条先验知识的人工鱼；Fish(k)表示第k条初始人工鱼中包含的已知关键蛋白质集合，k＝1,2…N；Cn为候选关键蛋白质的个数；Let N be the artificial fish population size, m be the number of known key proteins contained in each artificial fish; randomly select m known key proteins from the currently known key proteins to form an artificial fish with prior knowledge; Fish (k) represents the set of known key proteins contained in the k initial artificial fish, k=1,2...N; Cn is the number of candidate key proteins;

(5)觅食行为：(5) Foraging behavior:

找出每条人工鱼中蛋白质的所有邻居蛋白质，构成邻居蛋白质结点集合Neighbor(k)，并且集合Neighbor(k)与集合Neighbor(l)中的蛋白质互不相同，k＝1,2…N,l＝1,2…N,k≠l；对于Neighbor(k)中的每个结点v_i按照公式score1(i)＝fitness1(v_i,Fish(k))确定合并到人工鱼Fish(k)中的可能性，将邻居蛋白质结点集合Neighbor(k)中的结点按照其score1得分进行降序排序，将score1的值最高的蛋白质结点添加到Fish(k)中,同时添加到集合Add(k)中；觅食行为重复执行Tn次，向初始人工鱼中添加Tn个蛋白质结点；Find all the neighbor proteins of the protein in each artificial fish to form a neighbor protein node set Neighbor(k), and the proteins in the set Neighbor(k) and the set Neighbor(l) are different from each other, k=1,2...N ,l=1,2...N,k≠l; for each node v _i in Neighbor(k), according to the formula score1(i)=fitness1(v _i ,Fish(k)) determine the merged artificial fish Fish( The possibility in k), sort the nodes in the neighbor protein node set Neighbor(k) in descending order according to their score1 score, add the protein node with the highest score1 value to Fish(k), and add it to the set at the same time In Add(k); the foraging behavior is repeated Tn times, and Tn protein nodes are added to the initial artificial fish;

(6)追尾行为：(6) Rear-end behavior:

觅食行为执行之后，对每条人工鱼按照公式Score2(k)＝fitness2(Add(k))确定处于最优状态的人工鱼，对所有人工鱼按照其Score2得分进行降序排序，Score2的值最高的人工鱼即为最优人工鱼Fish(p)，p∈[1,N]，把对应于最优人工鱼Fish(p)的集合Add(p)中的蛋白质结点添加到集合Candidate中；After the foraging behavior is executed, determine the artificial fish in the optimal state according to the formula Score2(k)=fitness2(Add(k)) for each artificial fish, sort all artificial fish in descending order according to their Score2 scores, and the value of Score2 is the highest The artificial fish is the optimal artificial fish Fish(p), p∈[1,N], add the protein node in the set Add(p) corresponding to the optimal artificial fish Fish(p) to the set Candidate;

(7)聚群行为：(7) Group behavior:

除最优人工鱼Fish(p)对应的集合Add(p)外，将其余人工鱼Fish(k)对应的集合Add(k)中的结点v_i按照公式Score3(i)＝fitness3(v_i)计算得分，其中k≠p；对所有v_i按照其Score3得分进行降序排序,令δ为拥挤度因子，选择排在前面的δ个蛋白质结点添加到集合Candidate中；Except for the set Add(p) corresponding to the optimal artificial fish Fish(p), the nodes v _i in the set Add(k) corresponding to the remaining artificial fish Fish(k) are calculated according to the formula Score3(i)=fitness3(v _i ) to calculate the score, where k≠p; sort all v _i in descending order according to their Score3 scores, let δ be the crowding factor, and select the first δ protein nodes to add to the set Candidate;

(8)产生关键蛋白质：(8) Produce key proteins:

将步骤(7)所得的集合Candidate中的蛋白质结点作为关键蛋白质输出。Output the protein nodes in the set Candidate obtained in step (7) as key proteins.

进一步地，基因表达阈值Active_Th(i)由式(1)得到：Further, the gene expression threshold Active_Th(i) is obtained from formula (1):

Active_Th(i)＝μ(i)+3σ(i)(1-F(i)) 式(1)Active_Th(i)=μ(i)+3σ(i)(1-F(i)) Formula (1)

式(1)中μ(i)是结点v_i平均基因表达值，σ(i)是基因表达值的标准差；F(i)＝1/(1+σ²)是权函数。In formula (1), μ(i) is the average gene expression value of node v _i , σ(i) is the standard deviation of gene expression value; F(i)=1/(1+σ ² ) is the weight function.

进一步地，步骤(3)中，按式(2)计算边的聚集系数：Further, in step (3), calculate the edge clustering coefficient according to formula (2):

式中，N_i,N_j分别表示结点v_i,v_j的邻居结点集；In the formula, N _i , N _j represent the neighbor node sets of nodes v _i , v _j respectively;

按式(3)计算边的皮尔森相关系数：Calculate the Pearson correlation coefficient of the side according to formula (3):

式中，Ep_it和Ep_jt分别表示结点v_i,和v_j在时间点t时的基因表达值，μ(i)和μ(j)是结点v_i和v_j的平均基因表达值，T为时间点的最大值；In the formula, Ep _it and Ep _jt represent the gene expression values of nodes v _i and v _j at time point t respectively, and μ(i) and μ(j) are the average gene expression values of nodes v _i and v _j , T is the maximum value at the time point;

按式(4)计算边的GO功能相似性：Calculate the GO functional similarity of edges according to formula (4):

式中，GO_i,GO_j分别表示注释结点v_i和结点v_j的GO术语；In the formula, GO _i and GO _j represent the GO terms of annotation node v _i and node v _j respectively;

按式(5)计算结点v_i在蛋白质复合物内部的度：Calculate the degree of node v _i inside the protein complex according to formula (5):

式中，V(|C|)表示包含在蛋白质复合物中的结点集合，C_vi表示包含结点v_i的蛋白质复合物，D_in(v_i，C_vi)表示结点v_i在蛋白质复合物C_vi中的度，v_j是v_i的邻居结点。In the formula, V(|C|) represents the set of nodes contained in the protein complex, C _vi represents the protein complex containing node v _i , D _in (v _i , C _vi ) represents the node v _i in the protein complex The degree in the compound C _vi , v _j is the neighbor node of vi _.

进一步地，步骤(5)中集合Neighbor(k)中结点v_i添加到人工鱼Fish(k)中的可能性fiitness1由式(6)得到：Further, the probability fiitness1 of adding node v _i in the set Neighbor(k) to the artificial fish Fish(k) in step (5) is obtained by formula (6):

式中v_j是人工鱼Fish(k)里面的蛋白质结点，ECC是结点v_i与结点v_j之间的边的聚集系数，PCC是结点v_i与结点v_j之间的边的皮尔森相关系数，GO_sim是结点v_i与结点v_j之间的功能相似性。In the formula, v _j is the protein node in the artificial fish Fish(k), ECC is the aggregation coefficient of the edge between node v _i and node v _j , PCC is the edge between node v _i and node v _j Pearson correlation coefficient of edge, GO_sim is the functional similarity between node v _i and node v _j .

进一步地，步骤(5)中，如果在觅食行为执行过程中没有合适的蛋白质结点添加到人工鱼中，则执行随机行为，随机选择一个蛋白质结点添加到邻居蛋白质结点集合Neighbor(k)中。Further, in step (5), if there is no suitable protein node added to the artificial fish during the execution of the foraging behavior, a random behavior is performed, and a protein node is randomly selected to be added to the neighbor protein node set Neighbor(k )middle.

进一步地，步骤(6)中确定人工鱼处于最优状态的可能性fitness2由式(7)得到：Further, in step (6), the possibility of determining that the artificial fish is in the optimal state fitness2 is obtained from formula (7):

式中，Add(k)表示第k条人工鱼经过Tn次觅食行为所添加的蛋白质结点集合。In the formula, Add(k) represents the set of protein nodes added by the kth artificial fish after Tn foraging behaviors.

进一步地，步骤(7)中确定集合Add(k)，k≠p中结点v_i的得分fitness3由式(8)得到：Further, the set Add(k) is determined in step (7), and the score fitness3 of node v _i in k≠p is obtained by formula (8):

W(v_i,v_j)＝ECC(v_i,v_j)×(PCC(v_i,v_j)+GO_sim(v_i,v_j)) 式(9)W(v _i ,v _j )＝ECC(v _i ,v _j )×(PCC(v _i ,v _j )+GO_sim(v _i ,v _j )) formula (9)

式(8)中，a,b是系数，满足a+b＝1,Nei(v_i)表示结点v_i的邻居结点集合，DIC(v_i)表示结点v_i在蛋白质复合物内部的度。In formula (8), a and b are coefficients, satisfying a+b=1, Nei(v _i ) means the set of neighbor nodes of node v _i , DIC(v _i ) means node v _i is inside the protein complex degree.

进一步地，步骤(7)中δ＝Cn-Tn。Further, in step (7), δ=Cn-Tn.

本发明与现有的方法相比，具有以下优点：Compared with existing methods, the present invention has the following advantages:

1、本发明选择部分已知关键蛋白质作为先验知识，根据关键蛋白更倾向于彼此相互连接，通过人工鱼的觅食行为对组成人工鱼蛋白质的邻居结点进行搜索来完成关键蛋白质的预测，充分考虑了关键蛋白在网络中的拓扑特性。1. The present invention selects some known key proteins as prior knowledge, and according to the fact that the key proteins are more likely to connect with each other, the prediction of the key proteins is completed by searching the neighbor nodes that make up the protein of the artificial fish through the foraging behavior of the artificial fish. The topological properties of key proteins in the network are fully considered.

2、本发明中当人工鱼在执行聚群行为对蛋白质结点打分时，使用了边聚集系数(ECC)，皮尔森相关系数(PCC)，GO功能相似性(GO_sim),综合考虑了两个相互作用的蛋白质之间连接的紧密程度、基因表达的相似性、蛋白质功能相关性；并且使用了蛋白质在复合物内部参与度(DIC)，考虑了蛋白质关键性与复合物的关系，多种特性的融合使得关键蛋白的识别更加精确。2. In the present invention, when artificial fish perform clustering behavior to score protein nodes, edge clustering coefficient (ECC), Pearson correlation coefficient (PCC), and GO functional similarity (GO_sim) are used, and two The tightness of the connection between interacting proteins, the similarity of gene expression, and the functional correlation of proteins; and the degree of participation of proteins in the complex (DIC), considering the relationship between protein criticality and complexes, and various characteristics The fusion makes the identification of key proteins more precise.

3、本发明模拟人工鱼群的觅食或寻找同伴的过程来识别关键蛋白质，构建一个可靠的蛋白质相互作用网络，综合考虑了蛋白质相互作用网络的拓扑特性、蛋白质的基因表达值、GO语义相似性、蛋白质复合物信息以及先验知识，并加入人工鱼群的优化机理，多方面特征的使用使得采用本发明识别出来的关键蛋白质的准确度要比目前采用其他关键蛋白质识别方法识别的准确度高。3. The present invention simulates the process of foraging or finding companions in artificial fish schools to identify key proteins, construct a reliable protein interaction network, and comprehensively consider the topological characteristics of protein interaction networks, protein gene expression values, and GO semantic similarity properties, protein complex information and prior knowledge, and adding the optimization mechanism of artificial fish swarms, the use of various features makes the accuracy of the key proteins identified by the present invention better than that of other key protein identification methods currently used high.

4、本发明方法能准确地识别关键结点；仿真实验结果表明，敏感度、特异性、阳性预测值、阴性预测值等指标性能较优；与其他关键蛋白识别方法相比，将人工鱼群的优化特性与结点相互作用网络的拓扑特征进行结合实现关键结点的识别过程，提高了关键结点的识别准确率。4. The method of the present invention can accurately identify key nodes; the simulation experiment results show that the performance of indicators such as sensitivity, specificity, positive predictive value, and negative predictive value is better; compared with other key protein identification methods, artificial fish school The optimization characteristics of the network are combined with the topological characteristics of the node interaction network to realize the identification process of key nodes and improve the identification accuracy of key nodes.

5、采用本发明能够有效地从蛋白质相互作用网络中识别关键蛋白质，不仅能帮助理解细胞的生长调控过程和生命活动的运作机理，同时对如何精确研制药物以及诊断治疗疾病也具有极其重要的理论价值。5. Using the present invention can effectively identify key proteins from the protein interaction network, which can not only help understand the growth regulation process of cells and the operation mechanism of life activities, but also have an extremely important theory on how to accurately develop drugs and diagnose and treat diseases value.

附图说明Description of drawings

图1是本发明实施例1的工艺流程图。Fig. 1 is the process flow chart of embodiment 1 of the present invention.

图2是采用实施例1得出的关键蛋白质在整个蛋白质相互作用网络中的部分示意图。Fig. 2 is a partial schematic diagram of key proteins in the whole protein interaction network obtained by using Example 1.

图3是图2对应的标准库中关键蛋白质情况。Figure 3 is the situation of key proteins in the standard library corresponding to Figure 2.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅是本发明的一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

如图1所示，本发明基于人工鱼群优化算法识别关键蛋白质的方法，包括以下步骤：As shown in Figure 1, the method for identifying key proteins based on the artificial fish swarm optimization algorithm of the present invention comprises the following steps:

(1)将蛋白质相互作用网络转化为无向图(1) Transform the protein interaction network into an undirected graph

(2)构建提纯的蛋白质相互作用网络(2) Construction of purified protein interaction network

在时间点t时，结点v_i的基因表达值Ep_it若大于基因表达活性阈值Active_Th(i),则认为结点v_i在时间点t具有活性；否则认为该结点在时间点t不具有活性；若V中任意两个不同的结点v,u在时间点t同时具有活性，则认为在时间点t下结点v,u共表达；将无向图中在所有时间点下都没有共表达的蛋白质相互作用所对应的边删去，构建一个新的蛋白质相互作用网络，即提纯的蛋白质网络；At the time point t, if the gene expression value Ep _it of the node v _i is greater than the gene expression activity threshold Active_Th(i), the node v _i is considered to be active at the time point t; otherwise, the node is considered not to be active at the time point t is active; if any two different nodes v and u in V are active at the time point t at the same time, it is considered that the nodes v and u are co-expressed at the time point t; The edges corresponding to the protein interactions without co-expression are deleted, and a new protein interaction network is constructed, that is, the purified protein network;

Ep_it为结点v_i在时间点t处的基因表达值；Ep _it is the gene expression value of node v _i at time point t;

基因表达活性阈值Active_Th(i)由式(1)得到：Gene expression activity threshold Active_Th(i) is obtained from formula (1):

式(1)中μ(i)是结点v_i平均基因表达值，σ(i)是基因表达值的标准差；F(i)＝1/(1+σ²)是权函数；In formula (1), μ(i) is the average gene expression value of node v _i , σ(i) is the standard deviation of gene expression value; F(i)=1/(1+σ ² ) is the weight function;

(3)对提纯后的蛋白质相互作用网络的边和结点进行处理(3) Process the edges and nodes of the purified protein interaction network

按式(2)计算边的聚集系数：Calculate the edge clustering coefficient according to formula (2):

式中，Ep_it和Ep_jt分别表示结点v_i和v_j在时间点t时的基因表达值，μ(i)和μ(j)是结点v_i和v_j的平均基因表达值，T为时间点的最大值；In the formula, Ep _it and Ep _jt respectively represent the gene expression values of nodes v _i and v _j at time point t, μ(i) and μ(j) are the average gene expression values of nodes v _i and v _j , T is the maximum value at the time point;

式中，GO_i,GO_j分别表示注释结点v_i和结点v_j的GO术语。In the formula, GO _i and GO _j represent the GO terms of annotation node v _i and node v _j respectively.

对结点v_i预处理：按式(5)计算结点v_i在蛋白质复合物内部的度：Pretreatment of node v _i : Calculate the degree of node v _i inside the protein complex according to formula (5):

式中，V(|C|)表示包含在蛋白质复合物中的蛋白质结点集合，C_vi表示包含结点v_i的蛋白质复合物，D_in(v_i，C_vi)表示结点v_i在蛋白质复合物C_vi中的度，v_j是v_i的邻居结点；In the formula, V(|C|) represents the set of protein nodes contained in the protein complex, C _vi represents the protein complex containing node v _i , D _in (v _i , C _vi ) represents the node v _i in degree in the protein complex C _vi _, v _j is the neighbor node of vi;

(4)选取已知关键蛋白质作为初始人工鱼(4) Select known key proteins as the initial artificial fish

令N为人工鱼种群规模，m为每条人工鱼中包含的已知关键蛋白质的数量；在标准库(目前已知的关键蛋白质)中随机选取m个已知关键蛋白质组成一条先验知识的人工鱼；Fish(k)表示第k条初始人工鱼中包含的已知关键蛋白质集合，k＝1,2…N；Cn为候选关键蛋白质的个数；Let N be the population size of the artificial fish, and m be the number of known key proteins contained in each artificial fish; randomly select m known key proteins in the standard library (currently known key proteins) to form a prior knowledge Artificial fish; Fish(k) represents the set of known key proteins contained in the k initial artificial fish, k=1,2...N; Cn is the number of candidate key proteins;

(5)觅食行为(5) Foraging behavior

人工鱼在可视范围内寻找食物即寻找与人工鱼蛋白质存在直接相互作用的蛋白质，找出每条人工鱼中蛋白质的所有邻居蛋白质，构成邻居蛋白质结点集合Neighbor(k),并且集合Neighbor(k)与集合Neighbor(l)中的结点互不相同(k＝1,2…N,l＝1,2…N,k≠l),对于Neighbor(k)中的每个结点v_i按照公式score1(i)＝fitness1(v_i,Fish(k))确定合并到人工鱼Fish(k)中的可能性，将邻居蛋白质结点集合Neighbor(k)中的结点按照其score1得分进行降序排序，将score1的值最高的结点添加到Fish(k)中,同时添加到集合Add(k)中；The artificial fish looks for food within the visible range, that is, it looks for the protein that directly interacts with the protein of the artificial fish, finds out all the neighbor proteins of the protein in each artificial fish, and forms the neighbor protein node set Neighbor(k), and the set Neighbor( k) is different from the nodes in the set Neighbor(l) (k=1, 2…N, l=1, 2…N, k≠l), for each node v _i in Neighbor(k) According to the formula score1(i)=fitness1(v _i , Fish(k)) to determine the possibility of merging into the artificial fish Fish(k), the nodes in the neighbor protein node set Neighbor(k) are calculated according to their score1 scores Sort in descending order, add the node with the highest value of score1 to Fish(k), and add it to the set Add(k);

随机行为：如果在觅食行为执行过程中没有合适的蛋白质添加到人工鱼中，则执行随机行为，随机选择一个蛋白质结点添加到集合Neighbor(k)中；Random behavior: If no suitable protein is added to the artificial fish during the execution of foraging behavior, random behavior is performed, and a protein node is randomly selected to be added to the set Neighbor(k);

觅食行为重复执行Tn次，即向初始人工鱼中添加Tn个蛋白质结点；The foraging behavior is repeated Tn times, that is, Tn protein nodes are added to the initial artificial fish;

(6)追尾行为(6) Rear-end behavior

(7)聚群行为(7) Crowd behavior

除最优人工鱼Fish(p)对应的集合Add(p)外，将其余人工鱼Fish(k)(k≠p)对应的集合Add(k)(k≠p)中的结点v_i按照公式Score3(i)＝fitness3(v_i)计算得分，对所有v_i按照其Score3得分进行降序排序,令δ为拥挤度因子，δ＝Cn-Tn，选择排在前面的δ个蛋白质结点添加到集合Candidate中；Except for the set Add(p) corresponding to the optimal artificial fish Fish(p), the nodes v _i in the set Add(k)(k≠p) corresponding to the other artificial fish Fish(k)(k≠p) according to The formula Score3(i)=fitness3(v _i ) calculates the score, and sorts all v _i in descending order according to their Score3 scores, let δ be the crowding factor, δ=Cn-Tn, select the first δ protein nodes to add to the collection Candidate;

(8)产生关键蛋白质(8) Produce key proteins

将集合Candidate中的蛋白质作为关键蛋白质输出。Export the proteins in the set Candidate as key proteins.

本发明的步骤(5)中集合Neighbor(k)结点v_i添加到人工鱼Fish(k)中的可能性fiitness1由式(6)得到：In the step (5) of the present invention, the possibility fiitness1 of the collection Neighbor (k) node v _i added to the artificial fish Fish (k) is obtained by formula (6):

式中v_j是人工鱼Fish(k)里面的结点；ECC是结点v_i与结点v_j之间的边的聚集系数，由公式(2)得到；PCC是结点v_i与结点v_j之间的边的皮尔森相关系数，由公式(3)得到；GO_sim是结点v_i与结点v_j之间的功能相似性，由公式(4)得到。In the formula, v _j is the node in the artificial fish Fish(k); ECC is the aggregation coefficient of the edge between node v _i and node v _j , which is obtained by formula (2); PCC is the relationship between node v _i and node v j The Pearson correlation coefficient of the edge between points v _j is obtained by formula (3); GO_sim is the functional similarity between node v _i and node v _j , which is obtained by formula (4).

本发明的步骤(6)中确定人工鱼处于最优状态的可能性fitness2由式(7)得到：In the step (6) of the present invention, the possibility fitness2 of determining that the artificial fish is in the optimum state is obtained by formula (7):

式中，Add(k)表示第k条人工鱼经过Tn次觅食行为所添加的蛋白质结点集合，fitness1(v_i,Fish(k))如式(6)所示。In the formula, Add(k) represents the set of protein nodes added by the kth artificial fish after Tn foraging behaviors, and fitness1(v _i , Fish(k)) is shown in formula (6).

本发明的步骤(7)中确定集合Add(k)(k≠p)中蛋白质结点的得分fitness3由式(8)得到：In the step (7) of the present invention, the score fitness3 of the protein node in the set Add(k) (k≠p) is determined to be obtained by formula (8):

式(8)中，a,b是系数，满足a+b＝1,Nei(v_i)表示结点v_i的邻居结点集合，DIC(v_i)表示结点v_i在复合物内部的度由公式(5)得到。In formula (8), a and b are coefficients, satisfying a+b=1, Nei(v _i ) represents the set of neighbor nodes of node v _i , DIC(v _i ) represents the position of node v _i inside the complex The degree is obtained by formula (5).

以下通过具体实施例对本发明进一步详细说明：The present invention is described in further detail below by specific embodiment:

实施例1Example 1

以蛋白质网络为例一种基于人工鱼群优化算法识别关键蛋白质的方法的步骤如下：Taking protein network as an example, the steps of a method for identifying key proteins based on artificial fish swarm optimization algorithm are as follows:

本实施例以采自DIP数据库的酵母数据集(DIP 20160114版)作为仿真数据集，DIP数据包含了5028个蛋白质和22303个相互作用关系。基因表达数据集采自GEO数据库中的酵母新陈代谢表达数据集GSE3431，其中包括9336个基因，3个周期共36个时间点的基因值，覆盖了DIP中的95％的蛋白质。GO数据包括注释谱和SGD，已知蛋白质复合物信息是来自CYC2008，包括408个蛋白质复合物，覆盖了1492个蛋白质，关键蛋白质数据通过整合MIPS、SGD、DEG和SGDP四个数据库中的数据得到，共包含了1285个关键蛋白质，对应到5028个蛋白质中有1152个是关键蛋白，其余的视作非关键蛋白。实验平台为Windows 10操作系统，Intel酷睿i5-6600双核3.31GHz处理器，8GB物理内存，用Matlab R2014a软件实现本发明的方法。In this example, the yeast data set (DIP 20160114 version) collected from the DIP database was used as the simulation data set, and the DIP data contained 5028 proteins and 22303 interaction relationships. The gene expression dataset is collected from the yeast metabolic expression dataset GSE3431 in the GEO database, which includes 9336 genes, gene values at 36 time points in 3 cycles, covering 95% of the proteins in DIP. GO data includes annotation spectrum and SGD. Known protein complex information is from CYC2008, including 408 protein complexes, covering 1492 proteins. Key protein data are obtained by integrating data from four databases: MIPS, SGD, DEG and SGDP , contains a total of 1285 key proteins, corresponding to 1152 of the 5028 proteins are key proteins, and the rest are regarded as non-key proteins. The experimental platform is Windows 10 operating system, Intel Core i5-6600 dual-core 3.31GHz processor, 8GB physical memory, and uses Matlab R2014a software to realize the method of the present invention.

1、将蛋白质相互作用网络转化为无向图1. Transform the protein interaction network into an undirected graph

将包含5028个蛋白质和22303个相互作用关系的蛋白质相互作用网络转化成一个无向图G＝(V，E)，其中，V＝{v_i,i＝1,2,…,5028}为结点v_i的集合，E为边e的集合，结点v_i表示蛋白质，边e表示蛋白质之间的相互作用。Transform the protein interaction network containing 5028 proteins and 22303 interactions into an undirected graph G=(V, E), where V={v _i ,i=1,2,…,5028} is the knot The set of points v _i , E is the set of edge e, the node v _i represents the protein, and the edge e represents the interaction between proteins.

2、构建提纯的蛋白质相互作用网络2. Construction of purified protein interaction network

在时间点t时，结点v_i的基因表达值Ep_it若大于基因表达活性阈值Active_Th(i)，则认为结点v_i在时间点t具有活性；否则认为该结点在时间点t不具有活性；若V中任意两个不同的结点v,u在时间点t同时具有活性，则认为在时间点t下v,u共表达；基因表达活性阈值Active_Th(i)由式(1)得到：At the time point t, if the gene expression value Ep _it of the node v _i is greater than the gene expression activity threshold Active_Th(i), the node v _i is considered to be active at the time point t; otherwise, the node is considered not to be active at the time point t is active; if any two different nodes v and u in V are active at the time point t at the same time, it is considered that v and u are co-expressed at the time point t; the gene expression activity threshold Active_Th(i) is given by formula (1) get:

式(1)中μ(i)是结点v_i平均基因表达值，σ(i)是基因表达值的标准差；F(i)＝1/(1+σ²)是权函数。通过上述处理，对应到原蛋白质相互作用网络中，删除在所有时间点都没有共表达的蛋白质相互作用，形成一个新的具有5028个蛋白质结点和9576条边的蛋白质相互作用网络，即提纯的蛋白质相互作用网络。In formula (1), μ(i) is the average gene expression value of node v _i , σ(i) is the standard deviation of gene expression value; F(i)=1/(1+σ ² ) is the weight function. Through the above processing, correspond to the original protein interaction network, delete the protein interaction that is not co-expressed at all time points, and form a new protein interaction network with 5028 protein nodes and 9576 edges, that is, the purified protein interaction network.

3、对提纯后的蛋白质相互作用网络的边和结点进行处理3. Process the edges and nodes of the purified protein interaction network

式中，N_i,N_j分别表示点v_i,v_j的邻居结点个数，d_i,d_j分别是点v_i,v_j的度；按式(3)计算边的皮尔森相关系数：In the formula, N _i and N _j represent the number of neighbor nodes of points v _i and v _j respectively, and d _i and d _j are the degrees of points v _i and v _j respectively; the Pearson correlation of edges is calculated according to formula (3) coefficient:

式中，EP_it,EP_jt表示结点v_i,v_j在时间点t时的基因表达值，μ(i),μ(j)是结点v_i,v_j的平均基因表达值，T为时间点的最大值；按式(4)计算边的GO功能相似性：In the formula, EP _it , EP _jt represent the gene expression value of node v _i , v _j at time point t, μ(i), μ(j) are the average gene expression value of node v _i , v _j , T is the maximum value at the time point; calculate the GO function similarity of edges according to formula (4):

式中，GO_i,GO_j分别表示注释蛋白质v_i和蛋白质v_j的GO术语。In the formula, GO _i and GO _j represent the GO terms annotating protein v _i and protein v _j , respectively.

对结点v_i预处理：i＝1,2,…,5028，对给定一个确定的i，可计算出结点v_i在蛋白质复合物内部的参与度，按式(5)计算结点v_i在蛋白质复合物内部的度：Pretreatment of node v _i : i=1, 2,...,5028, given a certain i, the degree of participation of node v _i in the protein complex can be calculated, and the node can be calculated according to formula (5) The degree of v _i inside the protein complex:

式中，V(|C|)表示包含在蛋白质复合物中的蛋白质结点集合，Cv_i表示包含蛋白质v_i的蛋白质复合物，D_in(v_i,Cv_i)表示蛋白质v_i在蛋白质复合物Cv_i中的度，v_j是v_i的邻居结点。In the formula, V(|C|) represents the set of protein nodes contained in the protein complex, Cv _i represents the protein complex containing protein v _i , D _in (v _i ,Cv _i ) represents the protein v _i in the protein complex The degree of object Cv _i , v _j is the neighbor node of v _i .

4、选取已知关键蛋白质组成初始人工鱼4. Select known key proteins to form the initial artificial fish

令N为人工鱼种群规模，m为每条人工鱼中包含的已知关键蛋白质的数量；对于每条人工鱼，在标准库中1152个关键蛋白质中随机选取100个已知关键蛋白质组成一条先验知识的人工鱼，Fish(k)表示第k条初始人工鱼中包含的蛋白质集合；在本实例中N＝100，m＝100；Cn为候选关键蛋白质的个数。Let N be the size of the artificial fish population, m be the number of known key proteins contained in each artificial fish; for each artificial fish, randomly select 100 known key proteins from the 1152 key proteins in the standard library to form a Artificial fish of empirical knowledge, Fish(k) represents the protein set contained in the kth initial artificial fish; in this example, N=100, m=100; Cn is the number of candidate key proteins.

5、觅食行为5. Foraging behavior

人工鱼在可视范围内寻找食物即寻找与人工鱼蛋白质存在直接相互作用的蛋白质，找出每条人工鱼中蛋白质的所有邻居蛋白质Neighbor(k)，并且集合Neighbor(k)与集合Neighbor(l)中的蛋白质互不相同(k＝1,2…100,l＝1,2…100,k≠l),对于Neighbor(k)中的每个蛋白质v_i按照score1(i)＝fitness1(v_i,Fish(k))确定合并到人工鱼Fish(k)中的可能性，将蛋白质结点集合Neighbor(k)中的结点按照其score1得分进行降序排序，将score1的值最高的结点添加到Fish(k)中,同时添加到集合Add(k)中，式中score1(i)为蛋白质v_i与人工鱼中所有蛋白质的亲密度，亲密度由式(6)得到：The artificial fish looks for food within the visible range, that is, it looks for proteins that directly interact with the artificial fish protein, finds out all the neighbor proteins Neighbor(k) of the protein in each artificial fish, and sets Neighbor(k) and set Neighbor(l ) are different from each other (k=1,2...100, l=1,2...100, k≠l), for each protein v _i in Neighbor(k) according to score1(i)=fitness1(v _i , Fish(k)) determines the possibility of merging into the artificial fish Fish(k), sorts the nodes in the protein node set Neighbor(k) in descending order according to their score1 scores, and ranks the node with the highest score1 value Add to Fish(k) and add to the set Add(k) at the same time, where score1(i) is the intimacy between protein v _i and all proteins in the artificial fish, and the intimacy is obtained by formula (6):

式中v_j是人工鱼Fish(k)里面的蛋白质结点，ECC是结点v_i与结点v_j之间的边的聚集系数由公式(2)得到，PCC是结点v_i与结点v_j之间的边的皮尔森相关系数由公式(3)得到,GO_sim是结点v_i与结点v_j之间的功能相似性由公式(4)得到。In the formula, v _j is the protein node in the artificial fish Fish(k), ECC is the aggregation coefficient of the edge between node v _i and node v _j obtained from the formula (2), PCC is the connection between node v _i and node v j The Pearson correlation coefficient of the edge between points v _j is obtained by formula (3), and GO_sim is the functional similarity between node v _i and node v _j obtained by formula (4).

如果在觅食行为执行过程中没有合适的蛋白质添加到人工鱼中，则执行随机行为，随机选择一个蛋白质结点添加到集合Neighbor(k)中。觅食行为(或随机行为)重复执行Tn次即向初始人工鱼中添加Tn个蛋白质结点。If no suitable protein is added to the artificial fish during the execution of the foraging behavior, a random behavior is performed, and a protein node is randomly selected to be added to the set Neighbor(k). The foraging behavior (or random behavior) is repeated Tn times, that is, Tn protein nodes are added to the initial artificial fish.

6、追尾行为6. Rear-end behavior

觅食行为(或随机行为)执行之后，对每条人工鱼按照公式Score2(k)＝fitness2(Add(k))确定处于最优状态的人工鱼，对所有人工鱼按照其Score2得分进行降序排序，Score2的值最高的人工鱼即为最优人工鱼Fish(p)(p∈[1,100])，把对应于最优人工鱼Fish(p)的集合Add(p)中的蛋白质结点添加到集合Candidate中，fitness2表示添加了蛋白质之后，每条人工鱼的适应度，由式(7)得到：After the foraging behavior (or random behavior) is executed, determine the artificial fish in the optimal state according to the formula Score2(k)=fitness2(Add(k)) for each artificial fish, and sort all artificial fish in descending order according to their Score2 scores , the artificial fish with the highest value of Score2 is the optimal artificial fish Fish(p)(p∈[1,100]), and the protein node in the set Add(p) corresponding to the optimal artificial fish Fish(p) is added to In the set Candidate, fitness2 represents the fitness of each artificial fish after adding protein, which can be obtained from formula (7):

7、聚群行为7. Crowd behavior

除最优人工鱼Fish(p)对应的集合Add(p)外，将其余人工鱼Fish(k)(k≠p)对应的集合Add(k)(k≠p)中的蛋白质结点v_i按照公式Score3(i)＝fitness3(v_i)计算得分，对所有v_i按照其Score3得分进行降序排序,令δ(δ＝Cn-Tn)为拥挤度因子，选择排在前面的δ个蛋白质结点添加到集合Candidate中，fitness3表示集合Add(k)(k≠p)中蛋白质结点的分值，由式(8)得到：Except for the set Add(p) corresponding to the optimal artificial fish Fish(p), the protein node v _i in the set Add(k) (k≠p) corresponding to the other artificial fish Fish(k) (k≠p) Calculate the score according to the formula Score3(i)=fitness3(v _i ), sort all v _i in descending order according to their Score3 scores, let δ(δ=Cn-Tn) be the crowding factor, and select the top δ protein knots Points are added to the set Candidate, and fitness3 represents the score of the protein node in the set Add(k) (k≠p), which is obtained by formula (8):

式中，a,b是系数，a＝0.8，b＝0.2,Nei(v_i)表示结点v_i的邻居结点集合，DIC(v_i)表示结点v_i在复合物内部的度由公式(5)得到。In the formula, a, b are coefficients, a=0.8, b=0.2, Nei(v _i ) represents the set of neighbor nodes of node v _i , DIC(v _i ) represents the degree of node v _i in the complex by Formula (5) is obtained.

9、产生关键蛋白质9. Produce key proteins

为了验证本发明的有效性，发明人采用本发明实施例1人工鱼群优化算法识别关键蛋白质的方法对DIP数据库中的蛋白质网络进行关键蛋白质的识别，对候选关键蛋白质数目(Cn)为100，200，300，400，500以及600时，正确识别出的关键蛋白质进行分析，在本实验中，我们为每条人工鱼取100个已知关键蛋白质作为先验知识，鉴于实验过程中作为先验的已知关键蛋白质是随机选取的，因此将实验进行50次，取50次实验结果的平均值作为最终结果，结果见表1、图2和图3，表1显示了与当前其他识别关键蛋白质的方法识别出来的结果进行识别准确率的比较。在图2中显示了本发明识别的部分关键蛋白质在网络中的分布情况，图3显示了图2的对应标准库部分。In order to verify the effectiveness of the present invention, the inventor adopts the method for identifying key proteins by the artificial fish swarm optimization algorithm in Example 1 of the present invention to identify key proteins in the protein network in the DIP database, and the number of candidate key proteins (Cn) is 100, 200, 300, 400, 500 and 600, the correctly identified key proteins were analyzed. In this experiment, we took 100 known key proteins for each artificial fish as prior knowledge. In view of the fact that during the experiment as a priori The known key proteins are randomly selected, so the experiment is carried out 50 times, and the average value of the 50 experimental results is taken as the final result. The results are shown in Table 1, Figure 2 and Figure 3. Compare the recognition accuracy of the results identified by the above methods. Figure 2 shows the distribution of some of the key proteins identified by the present invention in the network, and Figure 3 shows the corresponding part of the standard library in Figure 2 .

表1本发明与其他方法识别关键蛋白质在准确率上的对比Table 1 The comparison between the present invention and other methods for identifying key proteins in accuracy

表1显示了本发明将识别出的100、200、300、400、500、600个蛋白质作为候选关键蛋白质与标准库中关键蛋白质进行比较的识别准确率，以及与当前其他识别关键蛋白质方法识别结果的对比。在识别前600个关键蛋白质时，与其余8种关键蛋白识别方法相比显示出本发明具有更高的预测准确率。由表2看出，本发明可以有效对关键蛋白质进行识别，候选关键蛋白的数目从100到600，本发明都有着最高的识别准确率。图2显示了本发明识别出的部分关键蛋白质在蛋白质相互作用网络中的位置。图2中带有深色背景颜色的是本发明正确识别出来的关键蛋白质，带浅色背景错误识别出来的关键蛋白质，白色背景的是非关键蛋白质。图3是图2对应的标准库中的关键蛋白质情况。通过图2和图3的对比可以发现，本发明识别出的错误的蛋白质有“YDR283C”“YPL246C”，漏识别的关键蛋白质有“YBR152W”。若以部分已知关键蛋白质作为先验知识，则本发明方法能正确识别出该先验知识周围的大部分关键蛋白质。Table 1 shows that the present invention compares the identified 100, 200, 300, 400, 500, 600 proteins as candidate key proteins with the key proteins in the standard library, and the identification results with other current methods for identifying key proteins contrast. When identifying the first 600 key proteins, compared with the remaining 8 key protein identification methods, it shows that the present invention has higher prediction accuracy. It can be seen from Table 2 that the present invention can effectively identify key proteins, and the number of candidate key proteins ranges from 100 to 600, and the present invention has the highest recognition accuracy. Fig. 2 shows the positions of some key proteins identified by the present invention in the protein interaction network. In Fig. 2, the key proteins correctly identified by the present invention are those with a dark background, the key proteins incorrectly identified with a light background, and the non-key proteins are those with a white background. Figure 3 is the situation of key proteins in the standard library corresponding to Figure 2. Through the comparison of Figure 2 and Figure 3, it can be found that the wrong proteins identified by the present invention include "YDR283C" and "YPL246C", and the key proteins that were not identified include "YBR152W". If some known key proteins are used as prior knowledge, the method of the present invention can correctly identify most of the key proteins around the prior knowledge.

本发明基于人工鱼群优化算法识别关键蛋白质的方法，将蛋白质相互作用网络转化为无向图、构建提纯的蛋白质相互作用网络、获取蛋白质对应的核糖核酸基因表达值、GO注释信息以及蛋白质在已知复合物内的度，对提纯后的蛋白质相互作用网络边和结点进行处理、选取已知关键蛋白质作为初始人工鱼、人工鱼执行觅食行为、随机行为、追尾行为、聚群行为并产生关键蛋白质。本发明方法能准确地识别关键蛋白质；仿真实验结果表明，敏感度、特异性、阳性预测值、阴性预测值等指标性能较优；与其他关键蛋白质识别方法相比，将人工鱼群的优化特性与结点相互作用网络的拓扑特征进行结合实现关键蛋白质的识别过程，提高了关键蛋白质的识别准确率。The method of the present invention is based on the artificial fish swarm optimization algorithm to identify key proteins, transforms the protein interaction network into an undirected graph, constructs a purified protein interaction network, obtains the ribonucleic acid gene expression value corresponding to the protein, GO annotation information, and The degrees in the complex are known, and the purified protein interaction network edges and nodes are processed, and the known key proteins are selected as the initial artificial fish. key protein. The method of the present invention can accurately identify key proteins; the simulation experiment results show that the performance of indicators such as sensitivity, specificity, positive predictive value, and negative predictive value is better; compared with other key protein identification methods, the optimized characteristics of artificial fish swarm Combined with the topological features of the node interaction network to realize the identification process of key proteins, and improve the identification accuracy of key proteins.

以上所述是本发明的优选实施方式，通过上述说明内容，本技术领域的相关工作人员可以在不偏离本发明技术原理的前提下，进行多样的改进和替换，这些改进和替换也应视为本发明的保护范围。The above is a preferred embodiment of the present invention. Through the above description, relevant workers in the technical field can make various improvements and replacements without departing from the technical principles of the present invention. These improvements and replacements should also be regarded as protection scope of the present invention.

Claims

1. the method based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, it is characterised in that：Comprise the following steps：

(1) protein-protein interaction network is converted into non-directed graph：

Protein-protein interaction network is changed into a non-directed graph G=(V, E), wherein, V={ v_i, i=1,2 ..., n } it is knot Point v_iSet, E be side e set, node v_iProtein is represented, side e represents the interaction between protein；

(2) protein-protein interaction network of structure purification：

In time point t, node v_iGene expression values Ep_itIf being more than activity of gene expression threshold value A ctive_Th (i), recognize For node v_iIt is active in time point t, otherwise it is assumed that the node does not have activity in time point t；If any two is different in V Node v, u time point t simultaneously it is active, then it is assumed that the node v under time point t, u co-express；By in non-directed graph in institute All leave out under having time point without the side corresponding to the protein interaction of coexpression, the protein for building a purification is mutual Act on network；

(3) side and node of the protein-protein interaction network of purification are handled：Calculate while convergence factor ECC, while The degree of Pearson correlation coefficients PCC, the GO functional similarity and node on side inside protein complex；

(4) key protein matter composition original manual fish known to choosing：

It is artificial fingerling group scale to make N, and m is the quantity of the known key protein matter included in every Artificial Fish；It is being currently known Key protein matter in randomly select the Artificial Fish that the known key protein matter of m form a priori；Fish (k) represents the The known key protein matter set included in k bar original manual fishes, k=1,2 ... N；Cn is the number of candidate key protein；

(5) foraging behavior：

All neighbours' protein of protein in every Artificial Fish are found out, form neighbours protein node set Neighbor (k), And set Neighbor (k) and the protein in set Neighbor (l) are different, k=1,2 ... N, l=1,2 ... N, k ≠l；For each node v in Neighbor (k)_iAccording to formula score1 (i)=fitness1 (v_i, Fish (k)) determine to close And to the possibility in Artificial Fish Fish (k), by the node in neighbours protein node set Neighbor (k) according to it Score1 scores carry out descending sort, and score1 value highest protein node is added in Fish (k), is added to simultaneously In set Add (k)；Foraging behavior is repeated Tn times, and Tn protein node is added into original manual fish；

(6) knock into the back behavior：

After foraging behavior performs, every Artificial Fish is determined to be according to formula S core2 (k)=fitness2 (Add (k)) The Artificial Fish of optimum state, descending sort, Score2 value highest people are carried out according to its Score2 score to all Artificial Fishs Work fish is optimal Artificial Fish Fish (p), p ∈ [1, N], corresponding in the set Add (p) of optimal Artificial Fish Fish (p) Protein node is added in set Candidate；

(7) bunch behavior：

In addition to set Add (p) corresponding to optimal Artificial Fish Fish (p), by set Add (k) corresponding to remaining Artificial Fish Fish (k) In node v_iAccording to formula S core3 (i)=fitness3 (v_i) calculate score, wherein k ≠ p；To all v_iAccording to it Score3 scores carry out descending sort, and it is the crowding factor to make δ, and δ protein node for selecting to come above is added to set In Candidate；

(8) key protein matter is produced：

Exported the protein node in the set Candidate obtained by step (7) as key protein matter.

2. the method as claimed in claim 1 based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, it is characterised in that：

Gene expression threshold value A ctive_Th (i) is obtained by formula (1)：

Active_Th (the i)=σ of μ (i)+3 (i) (1-F (i)) formula (1)

μ (i) is node v in formula (1)_iAverage gene expression value, σ (i) are the standard deviations of gene expression values；F (i)=1/ (1+ σ²) It is weight function.

3. the method as claimed in claim 1 based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, it is characterised in that：Step Suddenly in (3), the convergence factor on side is calculated by formula (2)：

In formula, N_i,N_jNode v is represented respectively_i,v_jNeighbor node collection；

The Pearson correlation coefficients on side are calculated by formula (3)：

In formula, Ep_itAnd Ep_jtNode v is represented respectively_i,And v_jGene expression values in time point t, μ (i) and μ (j) are node v_i And v_jAverage gene expression value, T be time point maximum；

The GO functional similarity on side is calculated by formula (4)：

In formula, GO_i,GO_jAnnotation node v is represented respectively_iWith node v_jGO terms；

Node v is calculated by formula (5)_iDegree inside protein complex：

In formula, V (| C |) represent the node set included in protein complex, C_viExpression includes node v_iProtein answer Compound, D_in(v_i, C_vi) represent node v_iIn protein complex C_viIn degree, v_jIt is v_iNeighbor node.

4. the method as claimed in claim 1 based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, it is characterised in that：Step Suddenly node v in (5) middle set Neighbor (k)_iThe possibility fiitness1 being added in Artificial Fish Fish (k) is obtained by formula (6) Arrive：

V in formula_jIt is the protein node inside Artificial Fish Fish (k), ECC is node v_iWith node v_jBetween side aggregation system Number, PCC is node v_iWith node v_jBetween side Pearson correlation coefficients, GO_sim is node v_iWith node v_jBetween work( Can similitude.

5. the method as claimed in claim 1 based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, it is characterised in that：Step Suddenly in (5), if there is no suitable protein node to be added in Artificial Fish in foraging behavior implementation procedure, perform random Behavior, one protein node of random selection are added in neighbours protein node set Neighbor (k).

6. the method as claimed in claim 1 based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, it is characterised in that：Step Suddenly the possibility fitness2 that (6) middle determination Artificial Fish is in optimum state is obtained by formula (7)：

In formula, Add (k) represents that kth bar Artificial Fish passes through the protein node set that Tn foraging behavior is added.

7. the method as claimed in claim 1 based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, it is characterised in that：Step Suddenly node v in determination set Add (k) in (7), k ≠ p_iScore fitness3 obtained by formula (8)：

W(v_i,v_j)=ECC (v_i,v_j)×(PCC(v_i,v_j)+GO_sim(v_i,v_j)) formula (9)

In formula (8), a, b are coefficients, meet a+b=1, Nei (v_i) represent node v_iNeighbor node set, DIC (v_i) represent knot Point v_iDegree inside protein complex.

8. the method as claimed in claim 1 based on Artificial Fish Swarm Optimization Algorithm identification key protein matter, it is characterised in that：Step Suddenly δ=Cn-Tn in (7).