CN108804870B

CN108804870B - Markov random walk-based key protein identification method

Info

Publication number: CN108804870B
Application number: CN201810499870.6A
Authority: CN
Inventors: 刘维; 马良玉; 陈昕
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2018-05-23
Filing date: 2018-05-23
Publication date: 2021-11-19
Anticipated expiration: 2038-05-23
Also published as: CN108804870A

Abstract

The invention aims to provide a key protein identification method based on Markov random walk, and belongs to the technical field of biological information. A key protein identification method based on Markov random walk comprises the following steps: using Markov random walk idea, assigning a score representing the importance degree of each vertex, wherein the scores of all the vertices form a vector of n columns, giving an initial value of the score, and allowing the score to randomly walk in the network according to a certain probability and modify in transmission; finally, the values are arranged from large to small according to the values, and the values corresponding to the output valueskThe final result is the individual protein. The invention integrates biological attributes and topological characteristics, improves the accuracy of identifying key proteins, simultaneously enables prediction results to be more accurate, and improves the prediction efficiency.

Description

Markov random walk-based key protein identification method

Technical Field

The invention belongs to the technical field of biological information, mainly relates to a technology for identifying key protein by a Markov random walk algorithm in a protein interaction network, and particularly relates to a method for identifying key protein by network topology information and protein biological attributes in a PPI network.

Background

Proteins are indispensable substances in life activities, almost participate in all cycles of life activities, key proteins play an indispensable role in the process, and the absence of key proteins can cause that a living body cannot survive. Therefore, identification of key proteins in PPI networks not only helps understanding the growth regulation process of cells, but also can help research on the mechanism of biological evolution. In addition, in the biomedical field, the identification of key proteins is of great importance in disease treatment and the design of drug target cells.

Before the present invention is proposed, the field of identification of key proteins is initially identified by topological features of networks, for example, degree-centrality (DC), betweenness-centrality (BC), Local Average Connectivity (LAC), Li, etc. fused PPI and gene expression data, the centrality measure method PeC is proposed, and Zhang, etc. fused PPI network topological features and gene co-expression information, the CoEWC method is proposed, but these methods have the disadvantages of identifying key proteins: (1) only the topological characteristics of the network are considered, and the inherent biological characteristics of the protein are ignored. (2) PPI networks obtained by biological experiments are noisy, so that protein interaction data are false positive.

Disclosure of Invention

The invention aims to overcome the defects and develop a key protein identification method based on Markov random walk. The key protein identification method based on Markov random walk uses the idea of Markov random walk to assign a score representing the importance degree of each vertex, the scores of all the vertices form a vector of n columns, an initial value of the score is given, and the score is randomly walked in a network according to a certain probability and is modified in transmission. And finally, arranging the values from large to small, and outputting k proteins corresponding to the values, namely the final result.

The key protein identification method based on Markov random walk is mainly technically characterized by comprising the following steps of:

(1) inputting a PPI network and biological information;

(2) calculating the weight q between the protein vertexes according to the attribute values and the edge weights of the protein vertexes, and constructing a weight matrix;

(3) normalizing all the attribute values to construct an attribute matrix;

(4) constructing a transfer matrix according to the interaction relation between the protein vertexes;

(5) iterating to obtain a score vector r according to a PageRank algorithm, and determining a return probability P through the attributes of the vertexes;

(6) obtaining a target function, optimizing the target function, and performing iterative update on the initial values r and q by using a gradient descent formula;

(7) obtaining the post-iteration r^(t)＝(r₁,r₂,…,r_n) The values of (a) are sorted from large to small, and the largest k values after sorting are key proteins.

Calculating the weight q between the protein vertexes according to the attribute values and the edge weights of the protein vertexes, and constructing a weight matrix: according to the PPI network, the weight between the proteins is obtained through the similarity of common neighbors between the proteins, expression similarity and GO semantic similarity.

And (3) normalizing all the attribute values to construct an attribute matrix, wherein the attribute values are all included in the range of (0,1) by a Z-Score or normalization method, and all the vertex attribute vectors form the attribute matrix.

The method has the advantages and effects that not only the topological characteristics of the protein interaction network are considered, but also the biological attributes of the protein are considered, and further the negative influence caused by high data noise is overcome. The accuracy of identifying the key protein is improved by fusing biological attributes and topological characteristics, the prediction result is more accurate, and the prediction efficiency is improved. The application range and the practicability of the technology in the field of biological information are expanded.

Drawings

FIG. 1 is a schematic flow chart of the Markov random walk-based key protein identification method of the present invention.

FIG. 2a is a graph comparing the number of key proteins in the first 100 proteins of the present invention;

FIG. 2b is a graph comparing the number of key proteins in the first 200 proteins of the present invention;

FIG. 2c is a graph comparing the number of key proteins in the first 300 proteins of the present invention;

FIG. 2d is a graph comparing the number of key proteins in the first 400 proteins of the present invention;

FIG. 2e is a graph comparing the number of key proteins in the first 500 proteins of the present invention;

FIG. 2f is a graph comparing the number of key proteins in the first 600 proteins of the present invention;

FIG. 3 is a graph comparing the statistical indicator results of the present invention with other methods.

Detailed Description

The technical idea of the invention is as follows: combining biological attributes and topological characteristics, using the idea of Markov random walk, assigning a score representing the importance degree of each vertex, forming a vector of n columns by the scores of all the vertices, giving an initial value of the score, and allowing the score to randomly walk in the network according to a certain probability and modify the score in transmission. Firstly, obtaining the weight between proteins according to the similarity of common neighbors, expression similarity and GO semantic similarity, obtaining a weight matrix, and forming an attribute matrix according to all vertex attribute vectors. Second, transition probabilities between pairs of vertices are obtained by protein interaction relationships, thereby obtaining a transition matrix. And finally, obtaining an objective function, optimizing the objective function, and finally identifying the key protein. The fusion of biological attributes and topological characteristics is helpful for understanding the functions of unknown proteins, has important significance for explaining the molecular mechanism of specific functions, and can provide important theoretical basis for the design of drug target cells and the like. The key protein identification method based on Markov random walk is naturally applicable to the detection of key proteins.

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, the method for identifying key proteins based on Markov random walk comprises the following steps:

step 1: inputting a PPI network and biological information;

step 2: calculating the weight q between the protein vertexes according to the attribute values and the edge weights of the protein vertexes, and constructing a weight matrix;

common neighbor similarity (NTE): the topological features of protein interaction networks have irreplaceable positions in the identification of key proteins, which are more likely to be present in clusters in the network according to the "center-lethality" rule, and therefore, we used common neighbor similarity (NTE) as one of the indicators to measure the criticality of proteins. In undirected graph G (V, E), the mutual neighbor similarity between proteins u and V is expressed as:

NTE(u，v)＝|C_u∩C_v< 1 > +1 type (1)

Wherein, C_u(or C)_v) A set of neighbors representing node u (or v) in the PPI network; i C_u∩C_vAnd | represents the number of common neighbors of the nodes u and v, namely the number of triangles to which the edge belongs. The results are all made larger than 0 by adding "1" after the results, thereby avoiding problems during the specification.

Gene Expression Similarity (GES): since gene expression data are easier to obtain and widely used in the field of identification of key proteins, and co-expressed genes are more likely to become key proteins, we use Gene Expression Similarity (GES) as an index for measuring key proteins. The formula we used to calculate the similarity of gene expression between proteins u and v is as follows:

wherein s is the number of samples in the gene expression data, U and V are the gene codes for the corresponding proteins U, V, U_iAnd V_iRespectively representing the expression levels of genes U and V in the corresponding sample i,

and

is the average of the expression levels of genes U and V, σ (U) and σ (V) represent the standard deviation of the expression level of gene U, V, respectively.

GO semantic Similarity (GOs): gene Ontology (GO) provides a canonical, accurate set of terms for information about the molecular function, biological processes, and cellular components of genes (gene products). Measuring the semantic similarity of GO terms is an important aspect of GO applications, the semantic similarity of GO is to reveal the functional similarity of genes based on their biological features, and two key proteins linked are more likely to participate in the same biological process. In recent years, many scholars propose a GO semantic similarity measurement method, and the GO semantic similarity is calculated by adopting a Lin method, and the method is characterized in that: first, normalization of the sum of the information quantities of two concepts to be compared; second, assume that the two concepts to be compared are independent. The following formula is used to define the semantic similarity of GO between proteins u and v:

wherein gene U, V encodes interacting proteins u and v, c₁、c₂GO term, S (c), for gene U, V, respectively₁,c₂) Is node c₁、c₂The set of nearest common ancestor nodes of (c), the instance probability of the variable c being P (c), P_msIs the probability that their common nearest ancestor occurs.

Weight (w) between proteins in PPI networks_ij) The similarity between the two can be obtained, and the specific calculation formula is as follows:

w_ij＝a₁NTE(i,j)+a₂GES(i,j)+a₃GOS (i, j) formula (5)

Wherein the parameter a₁、a₂、a₃In the range of (0,1), and the sum is 1.

Matrix W ═ W_ij]Weight matrix, w, for PPI network_ijIs an edge (v)_i，v_j) The weight of (c):

and step 3: normalizing all the attribute values to construct an attribute matrix;

normalizing all attribute values (the attribute values can be totally included in the range of (0,1) by using a Z-Score or normalization method), and forming an attribute matrix B ═ B by using all vertex attribute vectors_ij]_nxm。

And 4, step 4: constructing a transfer matrix according to the interaction relation between the protein vertexes;

given the constant k < n, the k proteins of greatest importance in G, i.e., Top-k, are found and referred to as key proteins. We adopt Markov random walkFor each vertex v_iAssigning a score representing its degree of importance

The score values of all the vertexes form a score vector

Is a column vector of n x 1, gives an initial value of r, lets the score wander in the network and make modifications in the delivery according to a certain probability. From v_iIs transmitted to v_jThe probability of (c) is defined as:

thus, the transition probabilities between all the point pairs form an n × n transition matrix P ═ P_ij]。

And 5: iterating to obtain a score vector r according to a PageRank algorithm, and determining a return probability P through the attributes of the vertexes;

in the conventional random walk based PageRank algorithm, the score vector r is updated with the following iterations:

r^(k+1)＝αP^Tr^(k)+(1-α)P₀formula (8)

Wherein alpha is a constant, alpha belongs to (0,1), P₀And epsilon (0,1) is a constant and is the probability that the wandering particle returns to the original starting place. In the algorithm proposed in this chapter, we use the attribute b of the vertex_iTo decide the return probability P₀Is provided with

Here, the

Is a m × 1 column vector, q_jIs the weight of the jth attribute, so the formula is:

let the function be (10) and r^(k+1)Square error of (d):

we solve r, q so that J (r, q) is minimized, i.e. solve the following optimization problem:

the constraint r > 0, q > 0 means that all scores in r, q are positive.

Step 6: obtaining a target function, optimizing the target function, and performing iterative update on the initial values r and q by using a gradient descent formula;

after the objective function is obtained, we start optimizing the objective function. First, the partial derivatives of J for r, q are calculated:

from the formula (11):

from equation (13):

from the above gradient, for an initial value r⁽⁰⁾，q⁽⁰⁾We iteratively update using the gradient descent formula:

where ρ is the total number of iterations.

And 7: obtaining the post-iteration r^(t)＝(r₁,r₂,…,r_n) The values of (a) are sorted from large to small, and the largest k values after sorting are key proteins.

Example (b):

to verify the performance of the algorithm EPM proposed in this chapter, we compared the number of key proteins identified with the other five methods (DC, BC, LAC, PeC and CoEWC). For each method, the protein identification results of top100, top200, top300, top400, top500 and top600 are selected as candidate sets, and the intersection of the proteins in each candidate set and the standard key protein set is calculated, so that the number of the real key proteins in the candidate set is obtained.

As can be seen from fig. 2a, 2b, 2c, 2d, 2e, and 2f, in the yeast PPI network, the algorithm EPM proposed by us can achieve better effect on identifying key proteins than other methods, and when key proteins of top100, top200, top300, top400, top500, and top600 are taken as candidate sets, the amount of proteins identified by the algorithm proposed in this chapter is significantly higher than that identified by other methods. Compared with the PeC method, the accuracy of EPM is respectively improved by 16.4%, 18.8%, 19.5%, 19.4%, 20.5% and 22.6% when the top100, top200, top300, top400, top500 and top600 proteins are taken.

To further demonstrate the advantage of EPM in predicting key proteins, we attempted to analyze EPM on a smaller dataset (taking top200 protein) from other methods. We found the proteins that overlap with EPM among the 200 proteins and performed critical analysis on the remaining proteins as shown in table 1.

TABLE 1 comparative analysis of key protein amounts

Table 1 analyzes the quantitative comparison of key and non-key proteins identified in the top200 dataset by EPM and 5 other methods. Wherein M is_iRepresent other 5 centrality methods, | EPM ≧ M @, for comparison_iI is the amount of overlap of EPM with key proteins identified by other methods, | M_iEPM | represents a passing of M_iThe number of key proteins identified that EPM fails to identify, similarly, | EPM-M_iI indicates that EPM recognizes and M_iNumber of key proteins not identified. It is clear from the table that the number of key proteins identified by EPM but not by other methods is significantly greater than the number of key proteins identified by other methods but by EPM, while the number of non-key proteins identified by EPM is also significantly less than by other methods. These results show that the EPM algorithm considers topology and more biological information to effectively improve the prediction result of key proteins.

To further evaluate the performance of EPM methods in key protein prediction, we compared it with five other centrality methods, we introduced statistical performance evaluation methods, including 6 evaluation indices, sensitivity (sn), specificity (sp), Positive Predictive Value (PPV), Negative Predictive Value (NPV), F-evaluation F-measure (F) and accuracy (acc), which are defined as follows:

SN represents the proportion of the key protein that is correctly predicted.

SP indicates the proportion of non-critical proteins that are correctly excluded.

PPV indicates the proportion of correctly identified key proteins.

NPV represents the proportion of excluded proteins that are correctly predicted to be non-critical proteins.

F denotes the harmonic mean of sensitivity and positive predictive value.

ACC denotes the proportion of correct results among all recognition results.

Wherein tp (true positivity) means the amount of key protein correctly identified as key protein; FP (false positives) represents the number of non-key proteins that are misidentified as key by the algorithm; TN (true negotives) means the number of non-critical proteins identified as non-critical proteins that are truly negative; FN (false negatives) indicates the number of critical proteins that were incorrectly identified as non-critical proteins. The larger the values of the above six indexes are, the better the recognition performance of the algorithm is.

As can be seen from fig. 3, each of the 6 indexes of the EPM is significantly higher than any of the other five centrality measures, and compared with DC, BC and LAC methods based on network topology, the accuracy of the EPM is significantly higher, and compared with PeC method in which gene expression data is incorporated, the algorithm in this chapter can still obtain higher accuracy.

Claims

1. The key protein identification method based on Markov random walk is characterized by comprising the following steps:

(1) inputting a PPI network, biological information and the number k of key proteins to be obtained;

(2) calculating the weight q between the protein vertexes according to the attribute values and the edge weights of the protein vertexes, and constructing a weight matrix; according to the PPI network, the weight between protein vertexes is obtained through the similarity of common neighbors, the similarity of gene expression and the similarity of GO semantics;

the common neighbor similarity is expressed as:

NTE(u,v)＝|C_u∩C_v< 1 > +1 type (1)

Wherein, C_uSet of neighbors representing node u in PPI network, C_vA set of neighbors representing node v in the PPI network; i C_u∩C_vL represents the number of common neighbors of the nodes u and v, namely the number of triangles to which the edges belong;

the formula for calculating the similarity of gene expression between proteins u and v is as follows:

wherein s is the number of samples in the gene expression data, U and V are the gene codes for the corresponding proteins U, V, U_iAnd V_iRespectively representing the expression levels of the gene codes U and V in the corresponding sample i,

and

is the average of the expression levels of genes encoding U and V, then σ (U) and σ (V) represent the standard deviation of the expression level of gene U, V, respectively;

calculating the semantic similarity of GO by adopting a Lin method:

wherein the genes encode U, V interacting proteins u and v, c₁、c₂GO term, S (c), for gene coding U, V, respectively₁,c₂) Is node c₁、c₂The set of nearest common ancestor nodes of (c), the instance probability of the variable c being P (c), P_msIs node c₁、c₂Probability of occurrence of common nearest ancestors;

protein v in PPI networks_iAnd v_jWeight w of biological similarity therebetween_ijThe specific calculation formula of (2) is as follows:

w_ij＝a₁NTE(i,j)+a₂GES(i,j)+a₃GOS (i, j) formula (5)

Wherein the parameter a₁、a₂、a₃In the range of (0,1), and a₁、a₂、a₃The sum is 1;

(3) normalizing all the attribute values to construct an attribute matrix;

and (4) according to the interaction relation between the protein vertexes, the method for constructing the transfer matrix comprises the following steps:

a constant k < n is given, and k proteins with the greatest importance, namely Top-k, are found in the PPI network and are called key proteins; the idea of Markov random walk is adopted to carry out the random walk on each vertex v_iAssigning a score r representing its degree of importance_i ⁽⁰⁾The score values of all the vertexes form a score vector

The column vector is n multiplied by 1, an initial value of r is given, and the score is walked in the network and modified in the transmission according to a certain probability; from v_iIs transmitted to v_jThe probability of (c) is defined as:

thus, the transition probabilities between all the point pairs form an n × n transition matrix P ═ P_ij]；

and (5) iterating to obtain a score vector r according to a PageRank algorithm, and determining a return probability P through the attributes of the vertexes by the specific method:

r^(t+1)＝αP^Τr^(t)+(1-α)P₀formula (8)

Wherein alpha is a constant, alpha belongs to (0,1), P₀The epsilon (0,1) is a constant and is the probability of returning the wandering particles to the original departure place; by attribute b of the vertex_iTo decide the return probability P₀Is provided with

Here, the

r^(t+1)＝αP^Τr^(t)+(1-α)P₀＝αP^Τr^(t)+(1-α)B·q^(t)formula (10)

The function is given by the formula (10) and r^(t+1)Square error of (d):

solving r, q so that J (r, q) is minimized, i.e. solving the following optimization problem:

the constraint condition r is more than 0, and q is more than 0, which means that all scores in r and q are positive numbers;

(6) obtaining a target function, optimizing the target function, and performing iterative update on the initial values of r and q by using a gradient descent formula;

the step (6) of obtaining the objective function and optimizing the objective function, wherein the method for iteratively updating the initial values of r and q by using a gradient descent formula comprises the following steps:

after the objective function is obtained, optimizing the objective function: initial value r based on r and q⁽⁰⁾、q⁽⁰⁾First, we find the partial derivatives of J (r, q) with respect to r, q:

from the formula (11):

from equation (13):

from the above gradient, for an initial value r⁽⁰⁾，q⁽⁰⁾Iteratively updating using a gradient descent formula:

wherein rho is the total number of iterations;

(7) obtaining the post-iteration r^(t)＝(r₁,r₂,···,r_n) The values of (a) are sorted from large to small, and the largest k values after sorting are key proteins.

2. The Markov random walk-based key protein identification method according to claim 1, wherein the step (3) normalizes all attribute values, and the method for constructing the attribute matrix is as follows: all the attribute values are brought into the range of (0,1) through a Z-Score or normalization method, and all the vertex attribute vectors form an attribute matrix.