CN107798215B

CN107798215B - PPI-based network hierarchy prediction function module and function method

Info

Publication number: CN107798215B
Application number: CN201711153530.XA
Authority: CN
Inventors: 刘维; 马良玉; 陈昕
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2017-11-15
Filing date: 2017-11-15
Publication date: 2021-07-23
Anticipated expiration: 2037-11-15
Also published as: CN107798215A

Abstract

The invention relates to a PPI-based network hierarchy prediction function module and an action method. The technical scheme of the invention relates to a genetic algorithm, function module mining and action prediction for inputting PPI network and biological information, constructing a hierarchical structure tree T according to a protein interaction network, calculating a likelihood value of the protein interaction network, coding the hierarchical structure tree T, searching a hierarchical tree structure tree T with a maximum likelihood value. The invention overcomes the defects of poor effect and randomness in a sparse PPI network with low density. The invention carries out mining and action prediction on the function module according to the maximum likelihood value hierarchical structure tree T, and simultaneously realizes the mining and action prediction of the function module through the likelihood value calculation of the network.

Description

PPI-based network hierarchy prediction function module and function method

Technical Field

The invention belongs to the technical field of biological information, mainly relates to a technology for mining a function module and predicting an action through a network hierarchical structure analysis algorithm in a protein interaction network, and particularly relates to a method for predicting the function module and the action based on a network hierarchical structure in a PPI network.

Background

Protein interaction networks (PPI) play an important role in life activities and have important application values in aspects of living body, drug target design, disease treatment and prediction and the like. Although some achievements are achieved for mining functional modules in a protein interaction network at present, due to the high complexity and randomness of a living system, the methods with high success in other fields do not always achieve ideal effects in PPI network analysis, and therefore the predicted protein accuracy is low.

Before the invention is made, in the existing method, the density of a protein network is mostly calculated, some closely-connected functional regions existing in the PPI network are detected by calculating the density, a node with the maximum local neighborhood density is selected as an initial functional module, and then the node is expanded outwards to form a final functional module. The shortcomings of such mining function modules and action prediction are: (1) the existing method can effectively detect the functional module with high density, but the effect is not good in the sparse PPI network with low density. (2) Due to the high complexity and randomness of the life system, the method of mining the functional modules by calculating the network density is not always ideal. And because the interaction and interconnection of proteins in the PPI network have randomness, the optimal solution is more difficult to obtain.

Disclosure of Invention

The invention aims to overcome the defects and develop a method for predicting a functional module and an effect based on a PPI network hierarchical structure.

The technical scheme of the invention is as follows:

the PPI network hierarchy structure-based function module and function predicting method is mainly technically characterized by comprising the following steps of:

(1) inputting a PPI network and biological information;

(2) constructing a hierarchical structure tree T according to a protein interaction network;

(3) likelihood value calculation of protein interaction network: obtaining a likelihood value corresponding to the original network G according to the combination of the hierarchical structure tree T and the assigned probability value on the internal hierarchy;

(4) coding hierarchical structure tree T: a middle-order traversal mode is adopted, namely a left child node is traversed, then a root node is traversed, finally a right child node is traversed, and the hierarchical structure tree T is coded;

(5) genetic algorithm for finding the hierarchical tree structure of maximum likelihood values T: selecting a pair of individuals which are not crossed according to the probability to carry out cross operation, and selecting one individual according to the probability to carry out mutation operation;

(6) functional module mining and action prediction: and calculating the modularity of each module according to the maximum likelihood value hierarchical structure tree T, and mining the functional modules to obtain the interaction probability.

The step (2) of calculating the likelihood value of the protein interaction network: through the step (1) of constructing a hierarchical structure tree T according to the PPI network, the interaction probability between protein vertexes is convenient to obtain, namely the number of edges of two vertexes in the network just taking the root as the nearest common ancestor is reduced, the calculation mode is simplified, and the likelihood value of the network is calculated.

The step (4) is to search a genetic algorithm of the maximum likelihood value hierarchical structure tree T: selecting a pair of individuals which are not crossed according to the probability to carry out cross operation, and selecting one individual according to the probability to carry out mutation operation; and simultaneously, global search is carried out, biological evolution is taken as a prototype, and the maximum likelihood value hierarchical structure tree T is searched, so that the modularity value of each module is obtained.

The method has the advantages and effects that the functional module is mined and the function is predicted according to the maximum likelihood value hierarchical structure tree T, the mining and the function prediction of the functional module are realized simultaneously through the likelihood value calculation of the network, the corresponding biological information is fused on the basis of considering the network topology, the prediction result is more accurate, and the reliability of the prediction result is improved. Meanwhile, the method provided by the invention can completely describe the hierarchical structure of the network and reflect the internal relationship among the network nodes. A hierarchical structure tree T with the maximum likelihood value corresponding to a network is found through a genetic algorithm, so that a plurality of unnecessary density calculations are reduced, functional module divisions are obtained through hierarchical division of the tree, the possibility of interaction of the nodes in the tree is obtained through information of common ancestors among the nodes, the efficiency of protein function mining and action prediction is improved, and the application range and the practicability of the technology in the field of biological information are expanded.

The invention relates to a network hierarchical structure analysis, which is a modular analysis of a protein interaction network, and comprises the steps of firstly constructing a hierarchical structure tree T according to the protein interaction network, then obtaining the hierarchical tree structure tree T with the maximum likelihood value by using a genetic algorithm, and finally carrying out hierarchical division and deeply excavating functional modules in the hierarchical tree structure tree T.

Drawings

FIG. 1 is a schematic diagram of the functional module mining and action prediction process of the present invention.

FIG. 2 is a comparison of the present invention with the MCODE method to identify the functional blocks Rpd 3S; wherein (a) is the identified function module by the MCODE algorithm, and (b) is the identified function module, wherein the black circles are the real function modules.

FIG. 3-a comparison of the four algorithms in predicted performance; wherein Pr represents the accuracy, Sn represents the sensitivity, Acc represents the accuracy, and FMM-HS represents the method of the present invention.

Detailed Description

The technical idea of the invention is as follows:

the invention provides a high-efficiency functional module mining and action prediction method by combining the deep mining performance of a network hierarchical structure analysis algorithm, namely, firstly, a hierarchical structure tree T is constructed according to a given protein interaction network, then, the hierarchical structure tree T is obtained through a genetic algorithm, the likelihood value of the hierarchical structure tree T is maximum, then, hierarchical division is carried out, and the optimal division scheme is selected according to the value of modularity. The network hierarchical structure analysis is helpful for understanding the function of unknown protein, has important significance for explaining the molecular mechanism of specific functions, and can provide important theoretical basis for the design of drug target cells and the like. The network-based hierarchical analysis method is naturally applicable to the detection of protein functional modules, while enabling the prediction of interactions.

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

Step 1: inputting PPI networks and biological information

Step 2: constructing a hierarchical tree T according to a protein interaction network

In a protein interaction network, the connections between proteins can be represented as an undirected graph G (V, E), whose hierarchy can be represented by a hierarchical tree structure T. Where V represents the set of N proteins in the undirected graph and E represents the interaction between them. The N nodes are leaf nodes of the tree, N-1 non-leaf nodes are connected into a binary tree, each non-leaf node is endowed with a probability value P, and the probability of the connecting edge of each pair of leaf nodes is equal to the probability value P corresponding to the nearest common ancestor r of the leaf nodes_r。

And step 3: likelihood value calculation for protein interaction network

Integrating a undirected network G and a hierarchical tree T (the probability of an internal vertex r is unknown), and setting P for an internal node r_rRepresenting the probability value at the vertex, E_rThe number of edges in G whose two vertices exactly have r as the nearest common ancestor (i.e., the number of edges in G that connect between leaf nodes of subtree whose root is r) is represented by L_r、R_rThe left and right subtrees representing r contain the number of leaf nodes, the combination of the hierarchical tree T and the assigned probability values on the internal hierarchy { P }_iThe likelihood value corresponding to the original network G is:

to make L (T, { P)_r}) reaches the maximum, then order

(i-1, 2, … N-1), i.e.

Can be solved to obtain

Such { P }_iThe combination of values can maximize the likelihood values.

And 4, step 4: coding hierarchical structure tree T:

in the algorithm, each individual is a hierarchical tree, for this reason, we first encode the hierarchical tree T, we encode the hierarchical tree T (binary tree) in a way of a middle-order traversal, and we assign a label value to internal nodes, from 1 to n-1. The root node is given 1, for each internal node with children, the index values of its two children are both greater than its own index value, and for the leaf nodes, we give the values vl, …, vn, corresponding to the network vertices vl, …, vn.

And 5: genetic algorithm for finding the hierarchical structure tree of maximum likelihood values T:

the genetic algorithm generates an initial population consisting of m individuals according to the coding rule, then carries out operations such as crossing, mutation, selection and the like to generate a new generation of population, uses the likelihood value as a fitness value as a basis for individual selection,

and (3) cross operation: two individuals are provided: s₁＝(s₁，s₂，L，s_2n-1)，S₂＝(r₁，r₂，L，r_2n-1) We randomly choose two positions l₁、l₂：l≤l₁＜l₂Less than or equal to 2n-1, adding S₁In

And S₂In

Exchange to obtain

Setting a transformation area S₁In is

S₂In is

After the exchange, S₁In the set W₁∪W₂-W₁Medium elements are heavy; in the set W₁∪W₂-W₂Is absent. Therefore, it is required to be at S₁' receive W outside the switching area₁∪W₂-W₁The elements in (1) are changed into W one by one₁∪W₂-W₂Of (1).

Similarly, S₂In the set W₁∪W₂-W₂Are heavy, in the set W₁∪W₂-W₁The element(s) in (b) is not present. Therefore, it is required to be at S₂' receive W outside the switching area₁∪W₂-W₂The elements in (1) are changed into W one by one₁∪W₂-W₁Of (1).

For example, let S₁＝a2b4d3e1c，S₂＝d4e1a3c2b，l₁＝4，l₂When the value is 6, then W₁＝{4，d，3}，W₂1, {1, a, 3), exchanged to yield: s'₁＝a2b1a3e1c，S′₂＝d4e4d3c2b，W₁∪W₂-W₁＝{1，a}，W₁∪W₂-W₂4, d. Is stated in S'₁In (1, a), there is a repetition but the absence of {4, d } is at S'₂There are repeats but there is a lack of {1, a }. We are at S'₁Except the exchange region of (a) to (d), 1 to (4), to form S ″₁D2b1a3e4c, we are at S'₂Outside the exchange region of (a) by changing d to a, 4 to 1, to form S ″₂A1e4d3c2b, such that S ″)₁、S″₂Are all legal codes.

Mutation operation: let S be ═ S₁，s₂，…，s_2n-1) We select two positions l₁、l₂：l≤l₁＜l₂2n-1 or less, and l₁-l₂Is even (i.e. |)₁、l₂Parity), exchange

And

to obtain S', i.e

Step 6: functional module mining and role prediction

And (3) performing functional module mining according to the maximum likelihood value hierarchical structure tree T, and firstly labeling a layer number on a vertex in the hierarchical structure tree T: the root node is the first level, the children of the root node are the second level … …, and so on, and generally, the children of the i-th level node are the i +1 th level. Let us divide the functional blocks from the k-th layer, let the k-th layer have n_kInternal nodes corresponding to n_kSub-trees, the leaf nodes of each sub-tree constituting a functional module, whereby the nodes in the network G are divided into n_kAnd (4) a module. To determine the optimal partition, we let k be 2, 3, … k_maxWhere k is_maxTo the maximum number of layers, k is obtained_maxAnd (4) dividing the schemes, calculating the modularity value of each scheme, wherein the scheme with the maximum modularity is the required result.

When predicting protein interactions based on a hierarchical tree T of maximum likelihood values, at each internal node r in the hierarchical tree T, they carry a probability P_rFor each vertex pair v_i、v_jWhenever its nearest common ancestor r, P is found in the tree_rI.e. the probability that they will interact.

Example (b):

comparing each predicted functional module with a reference functional module, wherein the matching degree between the predicted functional module and the reference functional module is measured by an Overlap Ratio (OR), and the calculation formula is as follows:

OR＝2×O/(A+B)

wherein O represents a protein shared by the identified functional module and the reference functional module, a represents the number of proteins in the predicted functional module, B represents the number of proteins in the reference functional module, and the overlap ratio thereof is between 0 and 1, OR ═ 0 indicates that the predicted functional module and the reference functional module do not have a common protein, OR ═ 1 indicates that the predicted functional module and the reference functional module are completely identical, and a larger overlap ratio indicates that the higher the degree of matching between the mined functional module and the reference functional module, the larger the significance of the mined module is. A reasonable threshold should be one that ensures sufficient similarity between them while not being particularly strict, if a certain threshold is exceeded, where the threshold is set to 0.4, to be considered as predictive successful.

When compared with the MCODE, CFinder and ClusterONE algorithms, the proposed algorithm FMM-HS can be more accurate in the recognition of some functional modules, for example, among the recognized functional modules, only FMM-HS is completely recognized by the Rpd3S functional module as shown in FIG. 2 (b). The functional module identified by the MCODE algorithm is shown in the diagram (a), wherein the real functional module is shown in the black circle, and the other two methods cannot accurately identify the functional module.

In order to evaluate the effectiveness of the FMM-HS algorithm, three indexes of Precision (Precision), Sensitivity (Sensitivity) and Accuracy (Accuracy) are used as evaluation parameters.

Wherein TP represents the number of identified functional modules that overlap with the reference functional module by a ratio greater than or equal to 0.4, FP represents the number of functional modules that are not themselves incorrectly predicted as functional modules, the value is the total number of identified functional modules minus TP, FN represents the number of incorrectly predicted functional modules that are not themselves functional modules, and TN represents the number of correctly predicted functional modules that are not functional modules. Fig. 3 shows the prediction performance of each algorithm, where Pr represents accuracy, Sn represents sensitivity, and Acc represents road accuracy. From FIG. 3, it can be seen that the proposed algorithm FMM-HS is better than MCODE, CFinder and ClusterONE than the other three methods. Data sets of 4 yeast interaction networks were selected, Gavin, Krogan core, Collins and BioGRID, respectively. The results are shown in Table 1:

table 1: accuracy of four methods on 4 data sets

"N/A" in the CFinder algorithm indicates that no results were observed for 24 hours of operation on the BioGRID dataset. The proposed method may exhibit advantages on different data sets.

Claims

1. A method for predicting a functional module and an effect based on a PPI network hierarchical structure is characterized by comprising the following steps:

(1) inputting a PPI network and biological information;

(5) genetic algorithm for finding the hierarchical structure tree of maximum likelihood values T: selecting a pair of individuals which are not crossed according to the probability to carry out cross operation, and selecting one individual according to the probability to carry out mutation operation;

2. The method of claim 1, wherein said step (5) of finding a genetic algorithm for a hierarchical tree of maximum likelihood values T comprises: selecting a pair of individuals which are not crossed according to the probability to carry out cross operation, and selecting one individual according to the probability to carry out mutation operation; and simultaneously, global search is carried out, the biological evolution is used as a prototype, and the convergence of the prototype is utilized to search the maximum likelihood value hierarchical structure tree T so as to obtain the modularity value of each module.