CN109376842B

CN109376842B - Functional module mining method based on attribute optimization protein network

Info

Publication number: CN109376842B
Application number: CN201810946353.9A
Authority: CN
Inventors: 张兴义; 刘振杰; 田野; 程凡
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2018-08-20
Filing date: 2018-08-20
Publication date: 2022-04-05
Anticipated expiration: 2038-08-20
Also published as: CN109376842A

Abstract

The invention discloses a function based on attribute optimization protein networkThe module mining method comprises the following steps: s1, extracting protein candidate node pairs; s2, initializing the population and the function module set of each individual in the population through the extraction of protein candidate node pairs and according to the modularity Q_gAnd attribute density SA_gCalculating a fitness value of each individual; s3, performing cross variation among population individuals to generate a progeny population; s4, enabling the offspring individuals to inherit the function module set of the parent individuals, adjusting the function modules of the offspring individuals according to the difference between the gene values of each offspring individual and the parent individuals, obtaining the function module set of each individual in the offspring population, and calculating the fitness value of each individual; s5, selecting the environment according to the fitness values of the parent population and the child population to obtain a new population; and S6, repeatedly executing the steps S3-S5 until the maximum iteration number is reached, and outputting the function module set of each individual in the pareto optimal solution set of the population.

Description

Functional module mining method based on attribute optimization protein network

Technical Field

The invention relates to the technical field of functional module identification, in particular to a functional module mining method based on an attribute optimization protein network.

Background

Thousands of proteins in an organism constitute a protein module with a wide variety of functions at different time and different space stages, among the cellular functions of biological interest, the protein function module, one of its most basic building blocks, plays a very important role in the binding of the respective gene product, how to mine a protein functional module closely related to biological functions from protein interaction data becomes an important breakthrough for people to uncover the relationship between protein interaction and biological function detection, the existing scheme only utilizes the structure of the protein network, and the detection result is possibly not accurate enough for some incomplete protein networks, therefore, the functional module identification method for optimizing the protein network by the attribute information can effectively mine better protein module combinations and provide more protein module selection combinations.

Disclosure of Invention

Based on the technical problems in the background art, the invention provides a functional module mining method based on an attribute optimization protein network;

the invention provides a functional module mining method based on an attribute optimization protein network, which comprises the following steps:

s1, extracting protein candidate node pairs;

s2, initializing the population and the function module set of each individual in the population through the extraction of protein candidate node pairs and according to the modularity Q_gAnd attribute density SA_gCalculating a fitness value of each individual;

s3, performing cross variation among population individuals to generate a progeny population;

s4, enabling the offspring individuals to inherit the function module set of the parent individuals, adjusting the function modules of the offspring individuals according to the difference between the gene values of each offspring individual and the parent individuals, obtaining the function module set of each individual in the offspring population, and calculating the fitness value of each individual;

s5, selecting the environment according to the fitness values of the parent population and the child population to obtain a new population;

and S6, repeatedly executing the steps S3, S4 and S5 until the maximum iteration number is reached, outputting the function module set of each individual in the pareto optimal solution set of the population, wherein the function module set of each individual is a protein function module partition set.

Preferably, step S1 specifically includes:

s11, definition the protein network is characterised by G ═ (V, E, a), V ═ V₁,v₂,…,v_i,…,v_nDenotes the set of all protein nodes in the protein network, v_iRepresents the ith protein node; n is the total number of protein nodes;

s12, calculating the attribute similarity of any two protein nodes

A_uAnd A_vRespectively representing attribute sets of the protein node u and the protein node v;

s13, adding the protein node pairs into the range of [0,1 ] according to the attribute similarity]In 100 buckets with a gradient of 0.01, the number Buck of protein node pairs in each bucket is calculated_i；

S14 according to Buck_iThe buckets are arranged in descending order, with the first bucket corresponding to [0,1 ]]Value in between₁The second bucket corresponds to [0,1 ]]Value in between₂,Value₁And Value₂The average value T of the attribute similarity is used as a threshold value of the attribute similarity;

s15, taking out a node pair (u, v) from the protein node pair set, if S_uvIf the node pair (u, v) is not less than T, adding the node pair (u, v) into a candidate protein node pair set Nodepair, and removing the node pair (u, v) from the protein node pair set;

s16, repeating step S15 for the remaining protein nodes to obtain the extracted candidate protein node pair set Nodepair ═ { P }₁,P₂,...,P_kIn which P is_rRepresenting the r protein node pair.

Preferably, step S2 specifically includes:

s201, defining the maximum iteration time as maxgen, the initial iteration time as t 1, the number of population individuals as pop, and pop individuals { X ] in the population₁,X₂，…,X_g,…,X_pop}，X_gRepresents the g-th individual;

s202, taking out a protein node pair (u, v) from the protein node pair set, randomly generating a random number R between 0 and 1, and calculating the ith gene coefficient zeta_i＝0.5+S_uvAvg (S), if R ≦ ζ_iThe value of the ith gene of the individual is 1, otherwise, the value is 0, wherein S_uv(ii) represents the attribute similarity of the protein node pair (u, v), and avg(s) represents the average attribute similarity of the candidate protein node pair;

s203, repeating step S202 for the remaining protein node pairs in the set of protein node pairs until the set of protein node pairs is equal to the empty set, and obtaining the code X ═ g of the individual₁,g₂,...,g_i,...,g_m}；

S204、The pop is repeated for the steps S202 and S203 to obtain the initial population code { X }₁,X₂,...,X_POP}；

S205, obtaining { X₁,X₂,...,X_POPAn individual, let i equal 1 if the individual's ith gene value g_iEstablishing a connecting edge between a node u and a node v in the protein network G, wherein the ith gene of an individual corresponds to the ith candidate protein node pair (u, v) in the candidate protein set;

s206, repeating step S205 until i > m to obtain a new protein network G_nM represents an individual code length;

s207, pairing the population { X₁,X₂,...,X_POPRepeat steps S205, S206 for the remaining individuals to obtain the corresponding protein network G ═ G {₁,G₂,...,G_POP}；

S208, slave G ═ G₁,G₂,...,G_POPSelect a network G_iCalculating the node priority of each node in the network { V, E, A }

Wherein n is_iRepresenting the number of edges connected between the neighbor nodes of the protein node i, and k representing the number of the neighbors of the protein node i;

s209, selecting the protein node V with the maximum node priority from the V, and calculating the similarity between the protein node V and each neighbor node

Selecting the neighbor nodes u, u, v with the maximum similarity and the common neighbor of u, v to form a functional module C_iRemoving u, V and common neighbors of u, V from V, and calculating node priority of nodes in V, wherein N is_rA neighbor node representing a protein node r;

s210, repeatedly executing the step S209 until

Obtaining the functional module division of the network;

s211, G ═ G₁,G₂,...,G_POPRepeatedly executing the steps S208, S209 and S210 by the rest networks to obtain pop protein functional module partition sets;

s212, calculating and initializing the g individual X in the parent population_gTwo objective functions of (2):

degree of modularity

Wherein l_kDenotes the number of connecting edges in the kth functional module, d_kRepresents the total degree in the kth functional module; l denotes the G-th protein network G_gTotal number of edges in;

density of properties

Wherein S (i, j) represents the similarity of the attributes of the protein node i and the protein node j; r is_kRepresents the number of protein nodes within the kth protein module;

s213, executing step S212 on the pop protein functional module division sets to obtain the functional module set modularity and attribute density of the parent population.

Preferably, step S3 specifically includes:

s31, making t equal to 1, selecting an individual g and an individual j from the population P in a binary tournament mode, and performing cross variation on the individual g and the individual j to obtain a child individual child;

s32, execute pop step S31 to obtain the offspring population O ═ X₁,X₂,...,X_POP}。

Preferably, step S4 specifically includes:

s41, selecting from the offspring population O ═ { X ═ X₁,X₂,...,X_POPGet an individual X_KThe individual X_KComparing with corresponding parent individuals, finding out protein node pairs corresponding to the gene positions with changed gene values in the candidate protein node pairs, and extracting the protein nodesProtein nodes in the point pairs obtain a protein node set V_cg；

S42 for individual X_KWhich individually code for X_K＝{g₁,g₂,...,g_i,...,g_mIf the i-th gene value g of the individual is 1_iEstablishing a connecting edge between a node u and a node v in the protein network G, wherein the ith gene of an individual corresponds to the ith candidate protein node pair (u, v) in the candidate protein set;

s43, repeating step S42 until i > m to obtain a new protein network G_nM represents an individual code length;

s44, extracting protein network G_nMiddle by V_cgSubgraph composed of protein nodes in (1);

s45, V is processed according to the mode that the number of the neighbors of the sub graph is increased progressively_cgThe protein nodes in the sequence are sorted, the first protein node v is selected, and the modularity change of v from the current functional module i to any functional module j is calculated

Adding the protein node V into the module k corresponding to the maximum module degree change, and separating the protein node from V_cgWherein L represents the total number of edges in the kth protein network of the progeny population,

represents the number of neighbors, k, of the protein node v in the r-th protein functional module_vRepresents the number of neighbors of the protein node v, K^rRepresents the total number of the r protein functional modules;

s46, execute | V_cgI ] Steps S45 get Individual X_KThe protein functional module partition set of (3);

s47, executing pop steps S41, S42, S43, S44, S45 and S46 to obtain pop protein functional module partition sets of the offspring population;

s48, calculating the offspring seedsThe g-th individual X in the population_gTwo objective functions of (2):

degree of modularity

density of properties

Wherein S (i, j) represents the similarity of the attributes of the protein node i and the protein node j; r is_kRepresenting the number of protein nodes in the kth protein module to obtain the modularity and attribute density corresponding to pop individuals of the offspring population;

s49, executing step S48 on the pop protein functional module division sets to obtain the functional module set modularity and attribute density of the filial population.

Preferably, step S5 specifically includes:

merging the parent population and the offspring population to obtain a population P_unionSorted from P by congestion distance according to non-dominated sorting maximization_unionPop individuals were selected as a new population P.

Preferably, step S6 specifically includes:

and (4) repeatedly executing the steps S3, S4 and S5 when t is equal to t +1 until t is greater than maxgen, outputting the function module set of each individual in the pareto optimal solution set of the population, wherein the function module set of each individual is the protein function module partition set.

The invention comprehensively considers the unique attribute information of the protein nodes and the interaction between the proteins, the combination of the protein functional modules is obtained by extracting useful attribute information to continuously optimize the structure of the protein network and adjusting the attribution condition of partial protein nodes, thereby greatly improving the accuracy and effectiveness of the functional module mining in the protein network and achieving the purpose of better dividing the protein network, secondly, protein node pairs are extracted before evolution, individual coding length is reduced, combination of protein network function modules can be rapidly obtained based on the method, protein mining efficiency is improved to a great extent, and finally, the multi-objective evolutionary algorithm is used for mining functional modules in the protein network, the advantages of the multi-objective evolutionary algorithm are fully utilized, various choices are provided for decision makers, and mining results are diversified.

Drawings

Fig. 1 is a schematic flow chart of a functional module mining method for optimizing a protein network based on attributes according to the present invention.

Detailed Description

Referring to fig. 1, the functional module mining method for optimizing a protein network based on attributes provided by the invention comprises the following steps:

step S1, extracting protein candidate node pairs, specifically including:

s12, calculating the attribute similarity of any two protein nodes

In the specific scheme, the individual coding length is the number of node pairs in the protein network, so that the protein node pairs are extracted before evolution, the individual coding length is reduced, the combination of functional modules of the protein network can be quickly obtained based on the method, and the protein mining efficiency is improved to a great extent.

Step S2, initializing the population and the function module set of each individual in the population through the extraction of the protein candidate node pairs and calculating the fitness value of each individual, which specifically comprises the following steps:

S204, repeatedly executing the pop steps S202 and S203 to obtain the initialStarting group code { X₁,X₂,...,X_POP}；

s210, repeatedly executing the step S209 until

Obtaining the functional module division of the network;

degree of modularity

density of properties

Step S3, performing cross variation among population individuals to generate a progeny population, specifically including:

Step S4, the offspring individuals inherit the function module set of the parent individuals, and adjust the function modules of the offspring individuals according to the difference between the gene values of each offspring individual and the parent individual, to obtain the function module set of each individual in the offspring population and calculate the fitness value of each individual, which specifically includes:

s41, selecting from the offspring population O ═ { X ═ X₁,X₂,...,X_POPGet an individual X_KThe individual X_KComparing with corresponding parent individuals, finding out protein node pairs corresponding to the gene positions with changed gene values in the candidate protein node pairs, and extracting the protein nodes in the protein node pairs to obtain a protein node set V_cg；

Adding the protein node V into the module k corresponding to the maximum module degree change, and separating the protein node from V_cgWherein L represents the total number of edges in the kth protein network of the progeny population, k_v ^rRepresents the number of neighbors, k, of the protein node v in the r-th protein functional module_vRepresents the number of neighbors of the protein node v, K^rRepresents the total number of the r protein functional modules;

s48, calculating the g individual X in the filial generation population_gTwo objective functions of (2):

degree of modularity

density of properties

Step S5, selecting an environment according to the fitness values of the parent population and the child population to obtain a new population, specifically including:

Step S6, repeating steps S3, S4, and S5 until the maximum number of iterations is reached, outputting a set of function modules of each individual in the population, where the set of function modules of an individual is a protein function module partition set, and specifically includes: and (4) repeatedly executing the steps S3, S4 and S5 when t is equal to t +1 until t is greater than maxgen, outputting the function module set of each individual in the pareto optimal solution set of the population, wherein the function module set of each individual is the protein function module partition set.

The embodiment comprehensively considers the interaction between the unique attribute information of the protein nodes and the protein, continuously optimizes the protein network structure by extracting useful attribute information, adjusts the attribution condition of partial protein nodes to obtain the combination of the protein functional modules, greatly improves the accuracy and effectiveness of the functional module mining in the protein network, achieves the aim of better dividing the protein network, extracts the protein node pairs before evolution, reduces the individual coding length, can quickly obtain the combination of the protein network functional modules based on the method, greatly improves the efficiency of protein mining, finally, uses the multi-target evolutionary algorithm to mine the protein modules in the protein network, fully utilizes the advantages of the multi-target evolutionary algorithm, provides multiple choices for decision makers, and the mining results are diversified.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A functional module mining method for optimizing a protein network based on attributes is characterized by comprising the following steps:

s1, extracting protein candidate node pairs;

s6, repeatedly executing the steps S3, S4 and S5 until the maximum iteration times are reached, outputting a function module set of each individual in the pareto optimal solution set of the population, wherein the function module set of each individual is a protein function module partition set;

the step S2 specifically includes:

s201, defining the maximum iteration number as max gen, defining the initial iteration number as t 1, wherein the number of population individuals is pop, and pop individuals { X ] exist in the population₁,X₂，…,X_g,…,X_pop}，X_gRepresents the g-th individual;

S204, repeatedly executing the pop steps S202 and S203 to obtain an initial population code { X }₁,X₂,...,X_POP}；

s210, repeatedly executing the step S209 until

Obtaining the functional module division of the network;

degree of modularity

Wherein l_kIndicating the number of connecting edges in the kth functional module,d_krepresents the total degree in the kth functional module; l denotes the G-th protein network G_gTotal number of edges in;

density of properties

2. The method for mining functional modules based on attribute-optimized protein networks according to claim 1, wherein the step S1 specifically comprises:

s12, calculating the attribute similarity of any two protein nodes

s15, from eggTaking out a node pair (u, v) from the white matter node pair set, if S_uvIf the node pair (u, v) is not less than T, adding the node pair (u, v) into a candidate protein node pair set Nodepair, and removing the node pair (u, v) from the protein node pair set;

3. The method for mining functional modules based on attribute-optimized protein networks according to claim 1, wherein the step S3 specifically comprises:

4. The method for mining functional modules based on attribute-optimized protein networks according to claim 3, wherein the step S4 specifically comprises:

s43, will orderRepeating step S42 until i > m to obtain a new protein network G_nM represents an individual code length;

degree of modularity

density of properties

5. The method for mining functional modules based on attribute-optimized protein networks according to claim 4, wherein the step S5 specifically comprises:

6. The method for mining functional modules based on attribute-optimized protein networks according to claim 5, wherein the step S6 specifically comprises:

and (4) repeatedly executing the steps S3, S4 and S5 when t is equal to t +1, and outputting the function module set of each individual in the pareto optimal solution set of the population when t is larger than max gen, wherein the function module set of each individual is the protein function module partition set.