CN108804870A

CN108804870A - Key protein matter recognition methods based on Markov random walks

Info

Publication number: CN108804870A
Application number: CN201810499870.6A
Authority: CN
Inventors: 刘维; 马良玉; 陈昕
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2018-05-23
Filing date: 2018-05-23
Publication date: 2018-11-13
Anticipated expiration: 2038-05-23
Also published as: CN108804870B

Abstract

The key protein matter recognition methods based on Markov random walks that the object of the present invention is to provide a kind of, belongs to technical field of biological information.Key protein matter recognition methods based on Markov random walks：Use the thought of Markov random walks, the score for indicating its significance level is assigned to each vertex, the score on all vertex constitutes the vector of n row, provides the initial value of score, random walk and is modified in transmission in a network by score according to certain probability；The descending arrangement of score value is finally pressed, output score value is correspondingkA protein is final result.The present invention merges biological attribute and topological property improves the accuracy of identification key protein matter, while keeping prediction result more accurate, improves the efficiency of prediction.

Description

Key protein matter recognition methods based on Markov random walks

Technical field

The invention belongs to technical field of biological information, random by Markov mainly in protein-protein interaction network The technology of the algorithm identification key protein matter of migration, more particularly to network topological information and protein bio attribute in PPI networks The method for identifying key protein matter.

Background technology

Protein is the indispensable substance of institute in vital movement, almost takes part in all periods of vital movement, and is closed Key protein is even more to have played the effect that do not replace in this course, the missing of key protein matter may cause life entity without Method is survived.Therefore, identify that the key protein matter in PPI networks not only facilitates the adjusting and controlling growth process for understanding cell, and Help can be provided to the research of biological evolution mechanism.In addition, in biomedical sector, the identification of key protein matter is controlled in disease It treats and the design etc. of drug target cell has great importance.

Before the present invention proposes, the identification field of key protein matter most begins by the topological characteristic of network to know Not, for example, degree centrality (DC), betweenness center (BC), local average degree of communication (LAC), Li et al. people fusion PPI and gene table Centrality Measurement Method PeC, Zhang et al. fusion PPI network topology characteristics are proposed up to data and gene co-expressing information carries CoEWC methods are gone out, but the shortcomings that these methods identification key protein matter is：(1) it only considered possessed by network itself Topological characteristic, and have ignored the intrinsic biological characteristic of protein.(2) the PPI networks obtained by Bioexperiment, which exist, makes an uproar Sound so that there are false positives for protein interaction data.

Invention content

The purpose of the present invention, which is that, overcomes drawbacks described above, develops the key protein matter identification based on Markov random walks Method.Key protein matter recognition methods based on Markov random walks uses the thought of Markov random walks, to each Vertex assigns the score for indicating its significance level, and the score on all vertex constitutes the vector of n row, provides the initial of score Value random walk and is modified according to certain probability by score in transmission in a network.It is finally descending by score value Arrangement, the corresponding k protein of output score value is final result.

Key protein matter recognition methods based on Markov random walks, is mainly characterized by following steps：

(1) PPI networks and biological information are inputted；

(2) according to the attribute value on protein vertex and side right value, the weight q between protein vertex is calculated, builds weight Matrix；

(3) all attribute values are normalized, build attribute matrix；

(4) according to the interaction relationship between protein vertex, transfer matrix is built；

(5) according to PageRank algorithms, iteration obtains score vector r, determines to return to probability P by the attribute on vertex；

(6) it obtains object function and object function is optimized, to initial value r, q declines formula using gradient and changes Generation update；

(7) r after acquisition iteration^(t)=(r₁,r₂,…,r_n) the descending sequence of value, after sequence it is maximum k value for close Key protein.

The step (2) calculates the weight q between protein vertex according to the attribute value and side right value on protein vertex, Build weight matrix：By step (1) according to PPI networks, the weight between protein passes through common neighbours' phase between them It is acquired like degree, based on expression similarity, GO semantic similarities.

All attribute values are normalized the step (3), build attribute matrix:Pass through Z-Score or normalization Method makes attribute value all be included in (0,1) range, and all vertex attribute vectors constitute an attribute matrix.

Advantages of the present invention and effect are that this method not only considers the topological characteristic of protein-protein interaction network, together When have also contemplated the biological attribute of protein, and then overcome data noise it is high caused by negative effect.Merge biological attribute and Topological property improves the accuracy of identification key protein matter, while keeping prediction result more accurate, improves the efficiency of prediction. Extend the technology biological information field application range and practicability.

Description of the drawings

Fig. 1 --- the present invention is based on the flow diagrams of the key protein matter recognition methods of Markov random walks.

Fig. 2 --- present invention figure compared with the quantity for the key protein matter that other methods identify.

Fig. 2 a be the present invention preceding 100 protein in key protein matter number comparison figure；

Fig. 2 b be the present invention preceding 200 protein in key protein matter number comparison figure；

Fig. 2 c be the present invention preceding 300 protein in key protein matter number comparison figure；

Fig. 2 d be the present invention preceding 400 protein in key protein matter number comparison figure；

Fig. 2 e be the present invention preceding 500 protein in key protein matter number comparison figure；

Fig. 2 f be the present invention preceding 600 protein in key protein matter number comparison figure；

The statistical indicator results contrast figure of Fig. 3 --- the present invention and other methods.

Specific implementation mode

The present invention technical thought be：Biological attribute and topological property are combined, the think of of Markov random walks is used Think, the score for indicating its significance level is assigned to each vertex, and the score on all vertex constitutes the vector of n row, provides The initial value of score random walk and is modified according to certain probability by score in transmission in a network.That is root first The weight between protein is obtained according to common neighbours' similarity, based on expression similarity, GO semantic similarities, weight square can be obtained Battle array constitutes an attribute matrix according to all vertex attribute vectors.Secondly, vertex is obtained by protein interaction relationship Transition probability between, thus to obtain transfer matrix.It is final to obtain object function, and object function is optimized, finally Identify key protein matter.Fusion biological attribute and topological property help to understand the function of agnoprotein matter, special for explaining Determine the molecular mechanism important in inhibiting of function, while important theoretical foundation can be provided medicine target cell design etc..Institute Naturally enough it is suitable for the detection of key protein matter with the key protein matter recognition methods based on Markov random walks.

The following describes the present invention in detail with reference to the accompanying drawings and specific embodiments.

As shown in Figure 1, the key protein matter recognition methods based on Markov random walks, includes the following steps：

Step 1:Input PPI networks and biological information

Step 2：According to the attribute value on protein vertex and side right value, the weight q between protein vertex, structure power are calculated Weight matrix

Common neighbours' similarity (NTE)：The topological characteristic of protein-protein interaction network is in the identification of key protein matter There is irreplaceable status, according to " center-lethal " rule, key protein matter is more likely that cluster exists in a network , therefore, we use common neighbours' similarity (NTE) as one of index to weigh the key of protein.In non-directed graph G In (V, E), common neighbours' similarity between protein u and v is expressed as：

NTE (u, v)=| C_u∩C_v|+1 formula (1)

Wherein, C_u(or C_v) indicate PPI nodes u (or v) neighbours set；|C_u∩C_v| indicate node u and The number of the common neighbours of v, the i.e. number of triangle belonging to side.By adding " 1 " come so that result is all higher than behind result 0, to avoid going wrong in standardising process.

Gene expression similarity (GES)：Since gene expression data acquisition is easier and in the identification of key protein matter There is extensive use in field, while the gene co-expressed more likely becomes key protein matter, so we use gene table Up to similarity (GES) as an index for weighing key protein matter.The gene calculated between protein u and v that we use The formula for expressing similarity is as follows：

Wherein, s is the quantity of sample in gene expression data, and U and V are the gene code of corresponding protein u, v, U_iWith V_iExpressions of the gene U and V in corresponding sample i is indicated respectively,WithIt is being averaged for the expression of gene U and V Value, then o (U) and o (v) indicates the standard deviation of the expression of gene U, V respectively.

GO semantic similarities (GOS)：Gene ontology (gene ontology, GO) is the molecule work(of gene (gene outcome) The relevant information of energy, biological process and cell components provides a specification, accurate terminology.To the semanteme of GO terms Similitude carries out the importance that measurement is GO applications, and GO semantic similarities are the biological properties based on gene to disclose The functional similarity of gene, and two connected key protein matter more likely participate in the same bioprocess.Permitted in recent years More scholars are proposed the measure of GO Semantic Similarities, we calculate GO semantic similarities, the party using the method for Lin The characteristics of method, is：First, two standardization that compare the sum of information content of concept；Secondly, it is assumed that be compared two are general Thought is independent.GO semantic similarities between protein u, v are defined using following formula：

Wherein, the protein u and v, c of gene U, V codings interaction₁、c₂The respectively GO terms of gene U, V, S (c₁, c₂) it is node c₁、c₂Nearest public ancestor node set, the example probabilistic of variable c is P (c), P_msBe they it is public most The probability that nearly ancestors occur.

In PPI networks, the weight (w between protein_ij) can be acquired by the similarity between them, it is specific to calculate Formula is as follows：

w_ij=a₁NTE(i,j)+a₂GES(i,j)+a₃GOS (i, j) formula (5)

Wherein, parameter a₁、a₂、a₃In (0,1) range, and and it is 1.

Matrix W=[w_ij] be PPI networks weight matrix, w_ijFor side (v_i, v_j) on weight：

Step 3：All attribute values are normalized, attribute matrix is built

All properties value is standardized (can be such that attribute value is all included in (0,1) with Z-Score or method for normalizing In range), all vertex attribute vectors constitute an attribute matrix B=[b_ij]_nxm。

Step 4：According to the interaction relationship between protein vertex, transfer matrix is built；

Provide constant k<N finds out the maximum k protein of importance, i.e. Top-k, referred to as key protein matter in G.I Using Markov random walks thought, to each vertex v_iAssign a score r for indicating its significance level_i ⁽⁰⁾, own The score value on vertex constitutes a score vectorFor the column vector of n × 1, the initial value of r is provided, according to Certain probability migration and is modified in transmission in a network by score.From v_iIt is transmitted to v_jDefinition of probability be：

In this way, the transition probability between all-pair constitutes the transfer matrix P=[p of n × n_ij]。

Step 5：According to PageRank algorithms, iteration obtains score vector r, determines to return to probability P by the attribute on vertex

In traditional PageRank algorithms based on random walk, score vector r is updated with following iteration：

r^(k+1)=α P^Τr^(k)+(1-α)P₀Formula (8)

Wherein α is constant, α ∈ (0,1), P₀∈ (0,1) is constant, is that the particle of migration returns to the probability of Original Departure Point. In the algorithm that this chapter is proposed, we use the attribute b on vertex_iTo determine to return to probability P₀If

HereFor the column vectors of m × 1, q_jFor the weight of j-th of attribute, such formula is：

r^(k+1)=α P^Τr^(k)+(1-α)P₀=α P^Τr^(k)+(1-α)B·q^(k)Formula (10)

If function is (10) formula and r^(k+1)Square error：

We will find out r, q so that J (r, q) is minimum, that is, solves following optimization problem：

Constraints r>0, q>0 refers to r, and all score values in q are all positive number.

Step 6：Obtain object function and object function optimized, to initial value r, q using gradient decline formula into Row iteration updates

After obtaining object function, we start to optimize object function.Ask J for r, the local derviation of q first：From formula (11)：

J (r, q)=(α P^Τ·r+(1-α)B·q-r)^Τ·(αP^ΤR+ (1- α) Bq-r) formula (13)

=α²r^ΤPP^Τr+2α(1-α)r^ΤP·B·q-2αr^ΤP·r+(1-α)²q^ΤB^ΤB·q-2(1-α)r^ΤB·q+r^Τr

It can be obtained by formula (13)：

According to above-mentioned gradient, for initial value r⁽⁰⁾, q⁽⁰⁾, we decline formula using gradient and are updated with regard to row iteration：

Wherein, ρ is iteration sum.

Step 7：R after acquisition iteration^(t)=(r₁,r₂,…,r_n) the descending sequence of value, maximum k value is after sequence Key protein matter.

Embodiment：

In order to verify the performance for the algorithm EPM that this chapter is proposed, we are by the quantity of the key protein matter of identification and other Five kinds of methods (DC, BC, LAC, PeC and CoEWC) are compared.To each method we select top100, top200, The protein identification result of top300, top400, top500 and top600 are as Candidate Set, to the protein in each Candidate Set It seeks common ground again with the key protein matter set of standard, to obtain the quantity of true key protein matter in Candidate Set, experiment knot Fruit is as shown in Figure 2.

It can be seen that from Fig. 2 a, 2b, 2c, 2d, 2e, 2f, in yeast PPI networks, the algorithm EPM that we are proposed is knowing Effect that will be good than other methods can be obtained in other key protein matter, take top100, top200, top300, top400, When the key protein matter of top500 and top600 is as Candidate Set, the protein amounts that algorithm that this chapter is proposed identifies are apparent Higher than other methods.Wherein, compared with PeC methods, before taking top100, top200, top300, top400, top500 and When top600 protein, 16.4%, 18.8%, 19.5%, 19.4%, 20.5% and has been respectively increased in the accuracy rate of EPM 22.6%.

In order to further show advantages of the EPM in prediction key protein matter, we attempt in a smaller data set The difference of upper (taking 200 protein of top) analysis EPM and other methods.We find out Chong Die with EPM in this 200 protein Protein, and key analysis is carried out to remaining protein, as shown in table 1.

1 key protein matter volume comparison analysis of table

Table 1 analyzes the key protein matter identified in the data set of top200 by EPM and other 5 kinds of methods and non-pass The quantity of key protein compares.Wherein M_iIndicate other 5 kinds of centrality methods for comparison, | EPM ∩ M_i| it is EPM and other The quantity of the key protein matter overlapping of method identification, | M_i- EPM | expression passes through M_iIdentify and pass that EPM could not be identified The quantity of key protein, it is similar, | EPM-M_i| indicate that EPM is identified and M_iFail the number of key protein matter identified Amount.It should be clear that coming from table, by EPM rather than the number of key protein matter that identifies of other methods Amount is significantly more than the quantity for the key protein matter that other methods rather than EPM are identified, and the key identified by EPM methods The quantity of non-key protein is also significantly less than other methods in protein.These results show EPM algorithms consider topology and More biological informations can effectively improve the prediction result of key protein matter.

In order to further evaluate performance of the EPM methods in terms of key protein matter prediction, we by its in other five kinds Disposition method is compared, we introduce statistics performance estimating method, including 6 evaluation indexes, sensibility sensitivity (SN), specific specificity (SP), positive ident value positive predictive value (PPV), feminine gender mark Value negative predictive value (NPV), F- assess F-measure (F) and precision accuracy (ACC), these The definition difference of statistical indicator is as follows：

SN indicates the ratio that key protein matter is predicted correctly.

SP indicates the ratio that non-key protein is correctly excluded.

PPV indicates the ratio correctly identified in the key protein matter identified.

NPV indicates to be predicted correctly as the ratio of non-key protein in the protein excluded.

F indicates the harmonic-mean of susceptibility and positive predictive value.

ACC indicates the ratio of correct result in all recognition results.

Wherein, it refers to that key protein matter is correctly identified as key protein that TP (True Positives), which represents true positives, The quantity of matter；It refers to that the non-key egg of key protein matter is mistakenly identified as by algorithm that FP (False positives), which represents false positive, The quantity of white matter；It refers to that non-key protein is identified as non-key protein that TN (True Negatives), which represents true negative, Quantity；It refers to that key protein matter is mistakenly identified as non-key protein that FN (False Negatives), which represents false negative, Quantity.Above six kinds of fingers target value is bigger, illustrates that the recognition performance of algorithm is better, the result of calculation of the statistical indicator of each method is such as Shown in Fig. 2.

From figure 3, it can be seen that 6 indexs of EPM estimate obviously higher than other five kinds of centrality in either method, Compared with DC, BC and LAC method based on network topology, the accuracy rate of EPM is significantly higher, and with incorporated gene expression number According to PeC methods compare, the algorithm of this chapter can still obtain higher accuracy rate.

Claims

1. the key protein matter recognition methods based on Markov random walks, which is characterized in that the recognition methods includes as follows Step：

(1) PPI networks and biological information are inputted；

(3) all attribute values are normalized, build attribute matrix；

(6) it obtains object function and object function is optimized, to initial value r, q declines formula using gradient and is iterated more Newly；

(7) r after acquisition iteration^(t)=(r₁,r₂,…,r_n) the descending sequence of value, maximum k value is crucial egg after sequence White matter.

2. the key protein matter recognition methods according to claim 1 based on Markov random walks, which is characterized in that institute Attribute value and side right value of the step (2) according to protein vertex are stated, the weight q between protein vertex is calculated, builds weight square Battle array：By step (1) according to PPI networks, the weight between protein passes through common neighbours' similarity between them, gene table It is acquired up to similarity, GO semantic similarities；

Common neighbours' similarity is expressed as：

NTE (u, v)=| C_u∩C_v|+1 formula (1)

Wherein, C_uIndicate the set in the neighbours of PPI nodes u, C_vIndicate the collection in the neighbours of PPI nodes v It closes；|C_u∩C_v| indicate the number of the common neighbours of node u and v, the i.e. number of triangle belonging to side；

The formula for calculating the gene expression similarity between protein u and v is as follows：

Wherein, s is the quantity of sample in gene expression data, and U and V are the gene code of corresponding protein u, v, U_iAnd V_iPoint Not Biao Shi expressions of the gene U and V in corresponding sample i,WithIt is the average value of the expression of gene U and V, thenWithThe standard deviation of the expression of gene U, V is indicated respectively；

GO semantic similarities are calculated using the method for Lin：

Wherein, the protein u and v, c of gene U, V codings interaction₁、c₂The respectively GO terms of gene U, V, S (c₁,c₂) be Node c₁、c₂Nearest public ancestor node set, the example probabilistic of variable c is P (c), P_msIt is node c₁、c₂It is public nearest The probability that ancestors occur；

In PPI networks, the weight w between protein_ijIt is acquired by the similarity between two protein, specific formula for calculation It is as follows：

w_ij=a₁NTE(i,j)+a₂GES(i,j)+a₃GOS (i, j) formula (5)

Wherein, parameter a₁、a₂、a₃In (0,1) range, and a₁、a₂、a₃The sum of be 1；

3. the key protein matter recognition methods according to claim 1 based on Markov random walks, which is characterized in that institute It states step (2) all attribute values are normalized, the method for building attribute matrix is：Pass through Z-Score or normalization side Method makes attribute value all be included in (0,1) range, and all vertex attribute vectors constitute an attribute matrix.

4. the key protein matter recognition methods according to claim 1 based on Markov random walks, which is characterized in that institute Step (4) is stated according to the interaction relationship between protein vertex, the method for building transfer matrix is：

Provide constant k<N finds out the maximum k protein of importance, i.e. Top-k, referred to as key protein matter in G；Using The thought of Markov random walks, to each vertex v_iAssign a score r for indicating its significance level_i ⁽⁰⁾, all vertex Score value constitutes a score vectorFor the column vector of n × 1, the initial value of r is provided, according to certain Probability migration and is modified in transmission in a network by score；From v_iIt is transmitted to v_jDefinition of probability be：

5. the key protein matter recognition methods according to claim 1 based on Markov random walks, which is characterized in that institute Step (5) is stated according to PageRank algorithms, iteration obtains score vector r, and the specific side for returning to probability P is determined by the attribute on vertex Method is：

r^(k+1)=α P^Tr^(k)+(1-α)P₀Formula (8)

Wherein α is constant, α ∈ (0,1), P₀∈ (0,1) is constant, is that the particle of migration returns to the probability of Original Departure Point；Use vertex Attribute b_iTo determine to return to probability P₀If

r^(k+1)=α P^Tr^(k)+(1-α)P₀=α P^Tr^(k)+(1-α)B·q^(k)Formula (10)

If function is (10) formula and r^(k+1)Square error：

Find out r, q so that J (r, q) is minimum, that is, solves following optimization problem：

6. the key protein matter recognition methods according to claim 1 based on Markov random walks, which is characterized in that institute It states step (6) to obtain object function and optimize object function, to initial value r, q declines formula using gradient and is iterated Newer method is：

After obtaining object function, object function is optimized：

Ask J for r, the local derviation of q first：From formula (11)：

J (r, q)=(α P^T·r+(1-α)B·q-r)^T·(αP^TR+ (1- α) Bq-r) formula (13)

=α²r^TPP^Tr+2α(1-α)r^TP·B·q-2αr^TP·r+(1-α)²q^TB^TB·q-2(1-α)r^TB·q+r^Tr

It can be obtained by formula (13)：

According to above-mentioned gradient, for initial value r⁽⁰⁾, q⁽⁰⁾, decline formula using gradient and row iteration updated：

Wherein, ρ is iteration sum.