CN114242168A

CN114242168A - Method for identifying biologically essential protein

Info

Publication number: CN114242168A
Application number: CN202111548530.6A
Authority: CN
Inventors: 邹赛; 肖蕾; 贾伟; 谢明山; 王雷
Original assignee: Guizhou University; Chongqing College of Electronic Engineering
Current assignee: Guizhou University; Chongqing College of Electronic Engineering
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-03-25
Anticipated expiration: 2041-12-17
Also published as: CN114242168B

Abstract

The invention provides a method for identifying essential protein between multidimensional biological attribute information and PPI topological characteristics by utilizing a deep neural network, which is used for supplementing absent data aiming at the problem of incomplete gene expression data so as to improve robustness. And respectively reducing the convergence speed of the deep neural network by constructing a PPI network topology structure, a Pearson correlation coefficient and a homologous correlation coefficient. And finally, searching the optimal incidence relation among the three characteristics of the node degree, the Pearson correlation coefficient and the homologous correlation coefficient through a deep neural network, thereby improving the identification precision of the necessary protein.

Description

Method for identifying biologically essential protein

Technical Field

The invention belongs to the field of life science, and particularly relates to a method for identifying biologically essential protein.

Background

Proteins are essential elements for body activities. In the life process, proteins are closely related to each other, and a series of physiological activities are completed together, so that a protein-protein interaction (PPI) is formed. There are some essential proteins in the protein network, which can lose the body's associated functions after the mutation is removed, resulting in the body's failure to function properly. Therefore, the prediction of the key protein based on the PPI network has theoretical basis for pathogenic gene exploration and drug target development.

Early, the recognition of important proteins occurred mainly in biological experiments. Although the techniques of biological experimentation are highly accurate, such experiments are time consuming and expensive. With the development of information technology, predicting essential proteins based on protein complexes and topological properties has become a new trend. Topology is the method, form and geometry by which the various endpoints are interconnected by describing the two most basic elements in geometry, points and lines. Along with the research progress of the social network, the higher the association degree between the discovered node and other nodes is, the stronger the importance is; the higher the similarity between nodes, the more similar their importance level. Based on this assumption, some scholars propose many classical algorithms based on PPI network topology characteristics, such as Degree Centrality (DC), Betweenness Centrality (BC), proximity Centrality (CC), Subgraph Centrality (SC), feature vector Centrality (EC), Information Centrality (IC), and the like, and also extend many new algorithms for identifying key proteins based on mixed topology characteristics, such as a side aggregation coefficient (SoECC) method proposed by Wang et al by fusing nodes and sides to evaluate the criticality of proteins, and experimental results show that the identification accuracy of the EC method is superior to that of other six Information methods. Schui et al propose a method for detecting global and local features of a fusion node.

The above algorithms only evaluate the criticality of the protein based on the topological structure characteristics of the PPI network, neglects the biological significance of the protein as a life activity bearer, and in addition, the identification accuracy of the algorithms is generally low due to the problems of a large amount of false positive and false negative data in the PPI network, the activity information of the protein and the like. Aiming at the problem, many algorithmic researches for applying combination of multivariate biological information and topological structure characteristics to key protein identification appear, such as learning characteristics from gene expression profiles by Tang and the like and Li and the like and fusing the characteristics with network topological characteristics so as to identify key proteins. The functional characteristics are mainly related to subcellular location, molecular functions and the like. Zeng et al fuse PPI networks, subcellular locations and gene expression profiles to identify key proteins. Although these methods combine the biological information and topological characteristics of proteins, the number of features related to proteins is large, and how to effectively utilize these features for protein identification is another problem to be solved.

Deep Learning (DL) relies on the modeling capability of a Deep neural network, not only can multi-level features be automatically obtained from original data, but also the nonlinear relation between the features can be modeled. Since the introduction of deep learning, the deep learning has made a breakthrough in tasks such as image processing and natural language understanding, and has been widely used in the field of biological information. Deep learning can provide good support for the learning of sequence data, and can capture topological features from a network model, map network nodes into low-dimensional dense vectors and further directly support applications such as classification, clustering and associated reasoning. Zeng et al explored and verified the advantages of the deep learning framework in key protein recognition based on convolutional neural networks, long-short term memory networks.

Disclosure of Invention

The invention aims to provide a method for identifying biological essential protein, which overcomes the defects that the time consumption of the identification process of key protein is found and described in the prior art, and the verification of the model correctness is complex and troublesome.

The invention provides a method for identifying biologically essential proteins, comprising the steps of,

s1, downloading data;

s2 absence data supplementation;

s3, constructing a PPI network topological structure, a Pearson correlation coefficient and a homologous correlation coefficient;

s4 trains the deep neural network.

Further, the step S1 includes,

obtaining yeast protein information from the data set and rejecting self-interactions and repeat interactions;

downloading information of homologous proteins from a database, and downloading gene expression data of the yeast from a data set;

and downloading a data set containing the essential genes of the saccharomyces cerevisiae from the database as a benchmark set.

Further, the step S2 includes,

for a given gene u, its gene expression at different times is expressed by the vector Exp (u) ═ { Exp (u,1), Exp (u,2), …), where Exp (u, i) is the average expression level of gene u at time i, information d of the protein degree of gene u_μAnd origin information Ort is g_θ(u)；

Wherein,

g_θ(u)＝[d_u Ort_u]let a

Is the actual value of the protein corresponding to gene u, where θ_iRepresents the value of gene u at the i-th time when

Taking the minimum value, wherein the linear fitting degree is the highest, namely the regression model is just on the boundary of gene expression;

the following steps are used:

let

Meanwhile, taking a derivative of θ yields:

obtaining:

the absence data is represented as:

wherein N (mu, sigma)²) Representing a gaussian perturbation.

Further, the step S3 includes,

s31 calculates the degree of the PPI node,

in the PPI network, let V represent a node set of the PPI network, E represent an edge set of the PPI network, and obtain an undirected graph G (V, E) based on the PPI network,

let graph G ═ V, E,. mu.e.V (G), E_μ,νE (G), the degree d (μ) of the node μ is

Where Γ (μ) represents the set of neighbor nodes for the node μ, e_uvRepresents the edges of the node u and the node v, Num () represents a quantity function,

normalized, intensity Sd of node_uIn order to realize the purpose,

s32 calculation of correlation of Gene expression

For genes u and v, the PCC between them was calculated as follows:

wherein

The average expression level of gene u at each time point,

is the average expression level of the gene v at each time, σ (u) is the standard deviation of the expression level of the gene u at each time, σ (v) is the standard deviation of the expression level of the gene v at each time, cov () represents the correlation function, T represents the total time of the gene u, T represents the specific time of the gene u, if PCC_(u,v)Positive values indicate that the gene u is positively correlated with v, if PCC_(u,v)A negative value, genes u and v are negatively related,

represents the average expression level of the gene u,

the average expression level of the gene v is expressed, and the average gene intensity Gen of the gene u in all nodes is calculated_u。

Gen_VRepresents the average gene strength of the gene v in all nodes, and n represents the number of nodes;

s33 calculating homology correlation

The semantic similarity defined by the gene ontology aims to provide a functional relationship among different biological processes, molecular functions or cell components, search a shortest path connecting two words or annotations, and calculate the semantic similarity by utilizing the sum of weights on the shortest path to measure the semantic similarity on GO, wherein the distance between mu and v is

τ is their lowest common ancestor, root is their oldest ancestor, dis () represents a distance function,

calculate average homogenous intensity Ort of node u among all nodes_u，

Ort_vRepresenting the strength of homology for node v.

Further, the step S4 includes,

let X denote the processed protein data and Y be the essential protein data, given a training set D { (X, Y) }, X ∈ X, Y ∈ Y, and we can obtain:

y＝f(Σωx-θ)

where f () is the activation function, ω is the weight, θ is the threshold, the training set D is 3 attribute descriptions per data input, x ═ Sd, Gen, Org]^TThe output is 2-dimensional real value vector y ═ 0/10/1]Defining the number of hidden layers as L, the number of nodes of each hidden layer as h, and making y' be the predicted value of y, so as to obtain:

inputting layer to layer 1 of the hidden layer J-th node L_1,jPredicted value y'_1jIs that

Wherein theta is_jDenotes a threshold value of the J-th node of the 1 st layer of the hidden layer, h denotes the number of nodes of the hidden layer, ω_i，jRepresents the weight, ω, between the ith and jth nodes_i，j，1Representing weights between the ith and jth nodes of the layer 1 of the hidden layer

Jh node L from hidden layer c to hidden layer d_d,jPredicted value y'_d,jIs that

Implicit layer to output layer J-th node predictor Y'_jIs that

y’_L,iIndicating the predicted values of the L < th > layer to the i < th > node of the hidden layer,

the mean square error is used as the loss function Mse,

wherein size () is the length of the training data of the data set D, let Δ ω be the update form of the weight,

ω←ω+Δω

given the learning rate η, the parameters are adjusted in the direction of the negative gradient of the target, as follows

y represents an actual value, y 'represents a predicted value, y'_LAnd expressing the L-th layer predicted value of the hidden layer to obtain the number L of the hidden layers, the number h of nodes of the hidden layer and a threshold theta.

The method has the beneficial effects that firstly, aiming at the characteristics of the protein, a deep learning method is introduced, so that the accuracy of key protein identification is improved; and then reducing the input of biological information in the process of deeply learning and identifying key protein by using a transition center, gene expression and biological homology method, thereby reducing the time consumption for training and the complexity of a training model.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The invention aims to provide a method for identifying biologically essential proteins, which comprises the following steps,

s1 data download:

the present invention obtains yeast protein information from the DIP and GAVIN data sets, respectively. After knock-out from interactions and repeated interactions, the DIP dataset yielded 5093 proteins, 24743 pairs interactions and 1167 essential proteins. The GAVIN data set provided 1855 proteins, 7669 pairs of interactions, 714 base proteins.

In addition, information for homologous proteins was downloaded from the InParanoid database (Version 7), which contains 100 genome-wide pairwise comparisons. In addition, the yeast gene expression data was downloaded from the dataset provided by Tu BP.

Finally, the invention further downloads a data set containing 1285 saccharomyces cerevisiae essential genes from four databases of MIPS, SGDP, DEG and SGD as a reference set.

S2 absence data supplement

Since the coverage of the PPI network based on the proteins in the DIP and GAVIN databases in the gene expression library is only 95%, in order to increase the robustness of the algorithm, the following steps are used for processing, and for a given gene u, the gene expression of the gene u at different times is expressed by a carrier

Exp (u) { Exp (u,1), Exp (u,2), …, where Exp (u, i) is the average expression level of gene u at time i, and information d on the protein level of gene u_μAnd origin information Ort is g_θ(u)；

g_θ(u)＝θ×Exp(u)

＝θ₀+θ₁×Exp(u,1)+L，

Wherein,

g_θ(u)＝[d_u Ort_u]let a

the following can be used:

let

Meanwhile, taking a derivative of θ yields:

thus, it is possible to obtain:

the absence data may be expressed as:

wherein N (mu, sigma)²) Representing a gaussian perturbation.

S3 construction PPI network topology structure, Pearson correlation coefficient and homologous correlation coefficient as input part of deep neural network

Constructing a PPI network structure through the association relation between the Gavin and yeast protein in the DIP database, and calculating the degree of each node; calculating the gene influence of each node through a gene expression database; and calculating the homologous influence of each node through a homologous database. Then taking the gene expression profile, the protein interaction network and the subcellular location information as input characteristics, and taking the information in the key protein library as output characteristics, the specific method is as follows:

s31 calculating degree of PPI node

In the PPI network, assuming that V represents a node (protein) set of the PPI network and E represents an edge (protein-protein interaction) set of the PPI network, an undirected graph G ═ V, E based on the PPI network can be obtained.

Where Γ (μ) represents a set of neighbor nodes for the node μ. Num () is a numerical relationship. If the neighbor node e_μ,νIf present, the value is 1, otherwise it is 0.

Normalized by the strength of the node

S32 calculation of correlation of Gene expression

PCC is the pearson correlation coefficient used to measure the linear correlation between two variables, which has a value between-1, 1. The invention introduces PCC to characterize the similarity of gene coexpression, which is widely applied in the natural science. For genes u and v, the PCC between them can be calculated as follows:

wherein

σ (u) is the standard deviation of the expression level of gene u at each time. If PCC_(u,v)If the value is positive, the gene u is positively correlated with the gene v; if PCC_(u,v)Negative values indicate that genes u and v are negatively correlated.

The average gene intensity of gene u in all nodes was calculated by equation (9).

S33 calculating homology correlation

Semantic similarity defined by Gene Ontology (GO) aims at providing functional relationships between different biological processes, molecular functions or cellular components. The invention searches the shortest path connecting two words or annotations, and calculates the semantic similarity by using the sum of the weights on the shortest path to measure the semantic similarity on GO. Tversey ratio model based on similarity, the distance between mu and v is

τ is their lowest common ancestor and root is their oldest ancestor.

Calculating the average homologous strength of the node u in all the nodes by the formula (11)

S4 IYEP deep neural network model training

X represents the processed protein data, and Y is the essential protein data. Given a training set D { (X, Y) }, X ∈ X, Y ∈ Y, and thus one can obtain:

y＝f(∑ωx-θ) (12)

where f () is the activation function, and in the patent the tanh function is used, ω is the weight and θ is the threshold. The training set D is 3 attribute descriptions per data input, x ═ Sd, Gen, Org]^TThe output is 2-dimensional real value vector y ═ 0/10/1]The number of hidden layers is defined as L, and the number of nodes of each hidden layer is defined as h. FromAs can be seen in FIG. 1, the training model of IYEP consists of three parts, i.e., an input layer X to a hidden layer, i.e., between the hidden layers, and an output layer Y. Let y' be the predicted value of y, and combine equation (12) to obtain:

Wherein theta is_jRepresenting the threshold of the jth node of layer 1 of the hidden layer.

Implicit layer to output layer J-th node predictor Y'_jIs that

In deep neural network model training, the aim is to find a model with the least error, and the mean square error is used as a loss function Mse.

Where size () is the length of the training data of data set D. Let Δ ω be an updated form of the weight, i.e.

ω←ω+Δω (17)

Based on a gradient descent (gradient) strategy, given a learning rate η, the parameters are adjusted in the direction of the negative gradient of the target, there is the following equation

Similar to the formula (18), the number L of hidden layers, the number h of nodes of the hidden layers and a threshold value theta can be obtained, and various parameters are substituted into the training model of the IYEP, so that the judgment model of the deep neural network can be obtained.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for identifying a biologically essential protein comprising the steps of,

s1, downloading data;

s2 absence data supplementation;

s4 trains the deep neural network.

2. The method of claim 1, wherein the protein is a biologically essential protein,

the step S1 includes the steps of,

3. The method of claim 1, wherein the protein is a biologically essential protein,

the step S2 includes the steps of,

for a given gene u, its gene expression at different times is expressed by a vector

Wherein,

g_θ(u)＝[d_u Ort_u]let a

the following steps are used:

let

Meanwhile, taking a derivative of θ yields:

obtaining:

the absence data is represented as:

wherein N (mu, sigma)²) Representing a gaussian perturbation.

4. The method of claim 1, wherein the protein is a biologically essential protein,

the step S3 includes the steps of,

s31 calculates the degree of the PPI node,

normalized, intensity Sd of node_uIn order to realize the purpose,

s32 calculation of correlation of Gene expression

For genes u and v, the PCC between them was calculated as follows:

wherein

The average expression level of gene u at each time point,

represents the average expression level of the gene u,

s33 calculating homology correlation

calculate average homogenous intensity Ort of node u among all nodes_u，

Ort_vRepresenting the strength of homology for node v.

5. The method of claim 1, wherein the protein is a biologically essential protein,

the step S4 includes the steps of,

y＝f(∑ωx-θ)

Wherein theta is_jThreshold value of J-th node of layer 1 of the implied layerH represents the number of nodes of the hidden layer, ω_i，jRepresents the weight, ω, between the ith and jth nodes_i，j，1Representing weights between the ith and jth nodes of the layer 1 of the hidden layer

Implicit layer to output layer J-th node predictor Y'_jIs that

the mean square error is used as the loss function Mse,

ω←ω+Δω