CN114242168A - Method for identifying biologically essential protein - Google Patents

Method for identifying biologically essential protein Download PDF

Info

Publication number
CN114242168A
CN114242168A CN202111548530.6A CN202111548530A CN114242168A CN 114242168 A CN114242168 A CN 114242168A CN 202111548530 A CN202111548530 A CN 202111548530A CN 114242168 A CN114242168 A CN 114242168A
Authority
CN
China
Prior art keywords
gene
node
protein
nodes
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111548530.6A
Other languages
Chinese (zh)
Other versions
CN114242168B (en
Inventor
邹赛
肖蕾
贾伟
谢明山
王雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou University
Chongqing College of Electronic Engineering
Original Assignee
Guizhou University
Chongqing College of Electronic Engineering
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou University, Chongqing College of Electronic Engineering filed Critical Guizhou University
Priority to CN202111548530.6A priority Critical patent/CN114242168B/en
Publication of CN114242168A publication Critical patent/CN114242168A/en
Application granted granted Critical
Publication of CN114242168B publication Critical patent/CN114242168B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a method for identifying essential protein between multidimensional biological attribute information and PPI topological characteristics by utilizing a deep neural network, which is used for supplementing absent data aiming at the problem of incomplete gene expression data so as to improve robustness. And respectively reducing the convergence speed of the deep neural network by constructing a PPI network topology structure, a Pearson correlation coefficient and a homologous correlation coefficient. And finally, searching the optimal incidence relation among the three characteristics of the node degree, the Pearson correlation coefficient and the homologous correlation coefficient through a deep neural network, thereby improving the identification precision of the necessary protein.

Description

Method for identifying biologically essential protein
Technical Field
The invention belongs to the field of life science, and particularly relates to a method for identifying biologically essential protein.
Background
Proteins are essential elements for body activities. In the life process, proteins are closely related to each other, and a series of physiological activities are completed together, so that a protein-protein interaction (PPI) is formed. There are some essential proteins in the protein network, which can lose the body's associated functions after the mutation is removed, resulting in the body's failure to function properly. Therefore, the prediction of the key protein based on the PPI network has theoretical basis for pathogenic gene exploration and drug target development.
Early, the recognition of important proteins occurred mainly in biological experiments. Although the techniques of biological experimentation are highly accurate, such experiments are time consuming and expensive. With the development of information technology, predicting essential proteins based on protein complexes and topological properties has become a new trend. Topology is the method, form and geometry by which the various endpoints are interconnected by describing the two most basic elements in geometry, points and lines. Along with the research progress of the social network, the higher the association degree between the discovered node and other nodes is, the stronger the importance is; the higher the similarity between nodes, the more similar their importance level. Based on this assumption, some scholars propose many classical algorithms based on PPI network topology characteristics, such as Degree Centrality (DC), Betweenness Centrality (BC), proximity Centrality (CC), Subgraph Centrality (SC), feature vector Centrality (EC), Information Centrality (IC), and the like, and also extend many new algorithms for identifying key proteins based on mixed topology characteristics, such as a side aggregation coefficient (SoECC) method proposed by Wang et al by fusing nodes and sides to evaluate the criticality of proteins, and experimental results show that the identification accuracy of the EC method is superior to that of other six Information methods. Schui et al propose a method for detecting global and local features of a fusion node.
The above algorithms only evaluate the criticality of the protein based on the topological structure characteristics of the PPI network, neglects the biological significance of the protein as a life activity bearer, and in addition, the identification accuracy of the algorithms is generally low due to the problems of a large amount of false positive and false negative data in the PPI network, the activity information of the protein and the like. Aiming at the problem, many algorithmic researches for applying combination of multivariate biological information and topological structure characteristics to key protein identification appear, such as learning characteristics from gene expression profiles by Tang and the like and Li and the like and fusing the characteristics with network topological characteristics so as to identify key proteins. The functional characteristics are mainly related to subcellular location, molecular functions and the like. Zeng et al fuse PPI networks, subcellular locations and gene expression profiles to identify key proteins. Although these methods combine the biological information and topological characteristics of proteins, the number of features related to proteins is large, and how to effectively utilize these features for protein identification is another problem to be solved.
Deep Learning (DL) relies on the modeling capability of a Deep neural network, not only can multi-level features be automatically obtained from original data, but also the nonlinear relation between the features can be modeled. Since the introduction of deep learning, the deep learning has made a breakthrough in tasks such as image processing and natural language understanding, and has been widely used in the field of biological information. Deep learning can provide good support for the learning of sequence data, and can capture topological features from a network model, map network nodes into low-dimensional dense vectors and further directly support applications such as classification, clustering and associated reasoning. Zeng et al explored and verified the advantages of the deep learning framework in key protein recognition based on convolutional neural networks, long-short term memory networks.
Disclosure of Invention
The invention aims to provide a method for identifying biological essential protein, which overcomes the defects that the time consumption of the identification process of key protein is found and described in the prior art, and the verification of the model correctness is complex and troublesome.
The invention provides a method for identifying biologically essential proteins, comprising the steps of,
s1, downloading data;
s2 absence data supplementation;
s3, constructing a PPI network topological structure, a Pearson correlation coefficient and a homologous correlation coefficient;
s4 trains the deep neural network.
Further, the step S1 includes,
obtaining yeast protein information from the data set and rejecting self-interactions and repeat interactions;
downloading information of homologous proteins from a database, and downloading gene expression data of the yeast from a data set;
and downloading a data set containing the essential genes of the saccharomyces cerevisiae from the database as a benchmark set.
Further, the step S2 includes,
for a given gene u, its gene expression at different times is expressed by the vector Exp (u) ═ { Exp (u,1), Exp (u,2), …), where Exp (u, i) is the average expression level of gene u at time i, information d of the protein degree of gene uμAnd origin information Ort is gθ(u);
Figure BDA0003416504420000031
Wherein,
Figure BDA0003416504420000032
gθ(u)=[du Ortu]let a
Figure BDA0003416504420000033
Is the actual value of the protein corresponding to gene u, where θiRepresents the value of gene u at the i-th time when
Figure BDA0003416504420000034
Taking the minimum value, wherein the linear fitting degree is the highest, namely the regression model is just on the boundary of gene expression;
the following steps are used:
Figure BDA0003416504420000041
let
Figure BDA0003416504420000042
Meanwhile, taking a derivative of θ yields:
Figure BDA0003416504420000043
obtaining:
Figure BDA0003416504420000044
the absence data is represented as:
Figure BDA0003416504420000045
wherein N (mu, sigma)2) Representing a gaussian perturbation.
Further, the step S3 includes,
s31 calculates the degree of the PPI node,
in the PPI network, let V represent a node set of the PPI network, E represent an edge set of the PPI network, and obtain an undirected graph G (V, E) based on the PPI network,
let graph G ═ V, E,. mu.e.V (G), Eμ,νE (G), the degree d (μ) of the node μ is
Figure BDA0003416504420000046
Where Γ (μ) represents the set of neighbor nodes for the node μ, euvRepresents the edges of the node u and the node v, Num () represents a quantity function,
normalized, intensity Sd of nodeuIn order to realize the purpose,
Figure BDA0003416504420000051
s32 calculation of correlation of Gene expression
For genes u and v, the PCC between them was calculated as follows:
Figure BDA0003416504420000052
wherein
Figure BDA0003416504420000053
The average expression level of gene u at each time point,
Figure BDA0003416504420000054
is the average expression level of the gene v at each time, σ (u) is the standard deviation of the expression level of the gene u at each time, σ (v) is the standard deviation of the expression level of the gene v at each time, cov () represents the correlation function, T represents the total time of the gene u, T represents the specific time of the gene u, if PCC(u,v)Positive values indicate that the gene u is positively correlated with v, if PCC(u,v)A negative value, genes u and v are negatively related,
Figure BDA0003416504420000055
represents the average expression level of the gene u,
Figure BDA0003416504420000056
the average expression level of the gene v is expressed, and the average gene intensity Gen of the gene u in all nodes is calculatedu
Figure BDA0003416504420000057
GenVRepresents the average gene strength of the gene v in all nodes, and n represents the number of nodes;
s33 calculating homology correlation
The semantic similarity defined by the gene ontology aims to provide a functional relationship among different biological processes, molecular functions or cell components, search a shortest path connecting two words or annotations, and calculate the semantic similarity by utilizing the sum of weights on the shortest path to measure the semantic similarity on GO, wherein the distance between mu and v is
Figure BDA0003416504420000058
τ is their lowest common ancestor, root is their oldest ancestor, dis () represents a distance function,
calculate average homogenous intensity Ort of node u among all nodesu
Figure BDA0003416504420000061
OrtvRepresenting the strength of homology for node v.
Further, the step S4 includes,
let X denote the processed protein data and Y be the essential protein data, given a training set D { (X, Y) }, X ∈ X, Y ∈ Y, and we can obtain:
y=f(Σωx-θ)
where f () is the activation function, ω is the weight, θ is the threshold, the training set D is 3 attribute descriptions per data input, x ═ Sd, Gen, Org]TThe output is 2-dimensional real value vector y ═ 0/10/1]Defining the number of hidden layers as L, the number of nodes of each hidden layer as h, and making y' be the predicted value of y, so as to obtain:
inputting layer to layer 1 of the hidden layer J-th node L1,jPredicted value y'1jIs that
Figure BDA0003416504420000062
Wherein theta isjDenotes a threshold value of the J-th node of the 1 st layer of the hidden layer, h denotes the number of nodes of the hidden layer, ωi,jRepresents the weight, ω, between the ith and jth nodesi,j,1Representing weights between the ith and jth nodes of the layer 1 of the hidden layer
Jh node L from hidden layer c to hidden layer dd,jPredicted value y'd,jIs that
Figure BDA0003416504420000063
Implicit layer to output layer J-th node predictor Y'jIs that
Figure BDA0003416504420000071
y’L,iIndicating the predicted values of the L < th > layer to the i < th > node of the hidden layer,
the mean square error is used as the loss function Mse,
Figure BDA0003416504420000072
wherein size () is the length of the training data of the data set D, let Δ ω be the update form of the weight,
ω←ω+Δω
given the learning rate η, the parameters are adjusted in the direction of the negative gradient of the target, as follows
Figure BDA0003416504420000073
y represents an actual value, y 'represents a predicted value, y'LAnd expressing the L-th layer predicted value of the hidden layer to obtain the number L of the hidden layers, the number h of nodes of the hidden layer and a threshold theta.
The method has the beneficial effects that firstly, aiming at the characteristics of the protein, a deep learning method is introduced, so that the accuracy of key protein identification is improved; and then reducing the input of biological information in the process of deeply learning and identifying key protein by using a transition center, gene expression and biological homology method, thereby reducing the time consumption for training and the complexity of a training model.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The invention aims to provide a method for identifying biologically essential proteins, which comprises the following steps,
s1 data download:
the present invention obtains yeast protein information from the DIP and GAVIN data sets, respectively. After knock-out from interactions and repeated interactions, the DIP dataset yielded 5093 proteins, 24743 pairs interactions and 1167 essential proteins. The GAVIN data set provided 1855 proteins, 7669 pairs of interactions, 714 base proteins.
In addition, information for homologous proteins was downloaded from the InParanoid database (Version 7), which contains 100 genome-wide pairwise comparisons. In addition, the yeast gene expression data was downloaded from the dataset provided by Tu BP.
Finally, the invention further downloads a data set containing 1285 saccharomyces cerevisiae essential genes from four databases of MIPS, SGDP, DEG and SGD as a reference set.
S2 absence data supplement
Since the coverage of the PPI network based on the proteins in the DIP and GAVIN databases in the gene expression library is only 95%, in order to increase the robustness of the algorithm, the following steps are used for processing, and for a given gene u, the gene expression of the gene u at different times is expressed by a carrier
Exp (u) { Exp (u,1), Exp (u,2), …, where Exp (u, i) is the average expression level of gene u at time i, and information d on the protein level of gene uμAnd origin information Ort is gθ(u);
gθ(u)=θ×Exp(u)
=θ01×Exp(u,1)+L,
Wherein,
Figure BDA0003416504420000091
gθ(u)=[du Ortu]let a
Figure BDA0003416504420000092
Is the actual value of the protein corresponding to gene u, where θiRepresents the value of gene u at the i-th time when
Figure BDA0003416504420000093
Taking the minimum value, wherein the linear fitting degree is the highest, namely the regression model is just on the boundary of gene expression;
the following can be used:
Figure BDA0003416504420000094
let
Figure BDA0003416504420000095
Meanwhile, taking a derivative of θ yields:
Figure BDA0003416504420000096
thus, it is possible to obtain:
Figure BDA0003416504420000097
the absence data may be expressed as:
Figure BDA0003416504420000098
wherein N (mu, sigma)2) Representing a gaussian perturbation.
S3 construction PPI network topology structure, Pearson correlation coefficient and homologous correlation coefficient as input part of deep neural network
Constructing a PPI network structure through the association relation between the Gavin and yeast protein in the DIP database, and calculating the degree of each node; calculating the gene influence of each node through a gene expression database; and calculating the homologous influence of each node through a homologous database. Then taking the gene expression profile, the protein interaction network and the subcellular location information as input characteristics, and taking the information in the key protein library as output characteristics, the specific method is as follows:
s31 calculating degree of PPI node
In the PPI network, assuming that V represents a node (protein) set of the PPI network and E represents an edge (protein-protein interaction) set of the PPI network, an undirected graph G ═ V, E based on the PPI network can be obtained.
Let graph G ═ V, E,. mu.e.V (G), Eμ,νE (G), the degree d (μ) of the node μ is
Figure BDA0003416504420000101
Where Γ (μ) represents a set of neighbor nodes for the node μ. Num () is a numerical relationship. If the neighbor node eμ,νIf present, the value is 1, otherwise it is 0.
Normalized by the strength of the node
Figure BDA0003416504420000102
S32 calculation of correlation of Gene expression
PCC is the pearson correlation coefficient used to measure the linear correlation between two variables, which has a value between-1, 1. The invention introduces PCC to characterize the similarity of gene coexpression, which is widely applied in the natural science. For genes u and v, the PCC between them can be calculated as follows:
Figure BDA0003416504420000103
wherein
Figure BDA0003416504420000111
σ (u) is the standard deviation of the expression level of gene u at each time. If PCC(u,v)If the value is positive, the gene u is positively correlated with the gene v; if PCC(u,v)Negative values indicate that genes u and v are negatively correlated.
The average gene intensity of gene u in all nodes was calculated by equation (9).
Figure BDA0003416504420000112
S33 calculating homology correlation
Semantic similarity defined by Gene Ontology (GO) aims at providing functional relationships between different biological processes, molecular functions or cellular components. The invention searches the shortest path connecting two words or annotations, and calculates the semantic similarity by using the sum of the weights on the shortest path to measure the semantic similarity on GO. Tversey ratio model based on similarity, the distance between mu and v is
Figure BDA0003416504420000113
τ is their lowest common ancestor and root is their oldest ancestor.
Calculating the average homologous strength of the node u in all the nodes by the formula (11)
Figure BDA0003416504420000114
S4 IYEP deep neural network model training
X represents the processed protein data, and Y is the essential protein data. Given a training set D { (X, Y) }, X ∈ X, Y ∈ Y, and thus one can obtain:
y=f(∑ωx-θ) (12)
where f () is the activation function, and in the patent the tanh function is used, ω is the weight and θ is the threshold. The training set D is 3 attribute descriptions per data input, x ═ Sd, Gen, Org]TThe output is 2-dimensional real value vector y ═ 0/10/1]The number of hidden layers is defined as L, and the number of nodes of each hidden layer is defined as h. FromAs can be seen in FIG. 1, the training model of IYEP consists of three parts, i.e., an input layer X to a hidden layer, i.e., between the hidden layers, and an output layer Y. Let y' be the predicted value of y, and combine equation (12) to obtain:
inputting layer to layer 1 of the hidden layer J-th node L1,jPredicted value y'1jIs that
Figure BDA0003416504420000121
Wherein theta isjRepresenting the threshold of the jth node of layer 1 of the hidden layer.
Jh node L from hidden layer c to hidden layer dd,jPredicted value y'd,jIs that
Figure BDA0003416504420000122
Implicit layer to output layer J-th node predictor Y'jIs that
Figure BDA0003416504420000123
In deep neural network model training, the aim is to find a model with the least error, and the mean square error is used as a loss function Mse.
Figure BDA0003416504420000131
Where size () is the length of the training data of data set D. Let Δ ω be an updated form of the weight, i.e.
ω←ω+Δω (17)
Based on a gradient descent (gradient) strategy, given a learning rate η, the parameters are adjusted in the direction of the negative gradient of the target, there is the following equation
Figure BDA0003416504420000132
Similar to the formula (18), the number L of hidden layers, the number h of nodes of the hidden layers and a threshold value theta can be obtained, and various parameters are substituted into the training model of the IYEP, so that the judgment model of the deep neural network can be obtained.
The invention provides a method for identifying essential protein between multidimensional biological attribute information and PPI topological characteristics by utilizing a deep neural network, which is used for supplementing absent data aiming at the problem of incomplete gene expression data so as to improve robustness. And respectively reducing the convergence speed of the deep neural network by constructing a PPI network topology structure, a Pearson correlation coefficient and a homologous correlation coefficient. And finally, searching the optimal incidence relation among the three characteristics of the node degree, the Pearson correlation coefficient and the homologous correlation coefficient through a deep neural network, thereby improving the identification precision of the necessary protein.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (5)

1. A method for identifying a biologically essential protein comprising the steps of,
s1, downloading data;
s2 absence data supplementation;
s3, constructing a PPI network topological structure, a Pearson correlation coefficient and a homologous correlation coefficient;
s4 trains the deep neural network.
2. The method of claim 1, wherein the protein is a biologically essential protein,
the step S1 includes the steps of,
obtaining yeast protein information from the data set and rejecting self-interactions and repeat interactions;
downloading information of homologous proteins from a database, and downloading gene expression data of the yeast from a data set;
and downloading a data set containing the essential genes of the saccharomyces cerevisiae from the database as a benchmark set.
3. The method of claim 1, wherein the protein is a biologically essential protein,
the step S2 includes the steps of,
for a given gene u, its gene expression at different times is expressed by a vector
Exp (u) { Exp (u,1), Exp (u,2), …, where Exp (u, i) is the average expression level of gene u at time i, and information d on the protein level of gene uμAnd origin information Ort is gθ(u);
Figure FDA0003416504410000011
Wherein,
Figure FDA0003416504410000012
gθ(u)=[du Ortu]let a
Figure FDA0003416504410000013
Is the actual value of the protein corresponding to gene u, where θiRepresents the value of gene u at the i-th time when
Figure FDA0003416504410000014
Taking the minimum value, wherein the linear fitting degree is the highest, namely the regression model is just on the boundary of gene expression;
the following steps are used:
Figure FDA0003416504410000015
let
Figure FDA0003416504410000021
Meanwhile, taking a derivative of θ yields:
Figure FDA0003416504410000022
obtaining:
Figure FDA0003416504410000023
the absence data is represented as:
Figure FDA0003416504410000024
wherein N (mu, sigma)2) Representing a gaussian perturbation.
4. The method of claim 1, wherein the protein is a biologically essential protein,
the step S3 includes the steps of,
s31 calculates the degree of the PPI node,
in the PPI network, let V represent a node set of the PPI network, E represent an edge set of the PPI network, and obtain an undirected graph G (V, E) based on the PPI network,
let graph G ═ V, E,. mu.e.V (G), Eμ,νE (G), the degree d (μ) of the node μ is
Figure FDA0003416504410000025
Where Γ (μ) represents the set of neighbor nodes for the node μ, euvRepresents the edges of the node u and the node v, Num () represents a quantity function,
normalized, intensity Sd of nodeuIn order to realize the purpose,
Figure FDA0003416504410000026
s32 calculation of correlation of Gene expression
For genes u and v, the PCC between them was calculated as follows:
Figure FDA0003416504410000031
wherein
Figure FDA0003416504410000032
The average expression level of gene u at each time point,
Figure FDA0003416504410000033
is the average expression level of the gene v at each time, σ (u) is the standard deviation of the expression level of the gene u at each time, σ (v) is the standard deviation of the expression level of the gene v at each time, cov () represents the correlation function, T represents the total time of the gene u, T represents the specific time of the gene u, if PCC(u,v)Positive values indicate that the gene u is positively correlated with v, if PCC(u,v)A negative value, genes u and v are negatively related,
Figure FDA0003416504410000034
represents the average expression level of the gene u,
Figure FDA0003416504410000035
the average expression level of the gene v is expressed, and the average gene intensity Gen of the gene u in all nodes is calculatedu
Figure FDA0003416504410000036
GenVRepresents the average gene strength of the gene v in all nodes, and n represents the number of nodes;
s33 calculating homology correlation
The semantic similarity defined by the gene ontology aims to provide a functional relationship among different biological processes, molecular functions or cell components, search a shortest path connecting two words or annotations, and calculate the semantic similarity by utilizing the sum of weights on the shortest path to measure the semantic similarity on GO, wherein the distance between mu and v is
Figure FDA0003416504410000037
τ is their lowest common ancestor, root is their oldest ancestor, dis () represents a distance function,
calculate average homogenous intensity Ort of node u among all nodesu
Figure FDA0003416504410000041
OrtvRepresenting the strength of homology for node v.
5. The method of claim 1, wherein the protein is a biologically essential protein,
the step S4 includes the steps of,
let X denote the processed protein data and Y be the essential protein data, given a training set D { (X, Y) }, X ∈ X, Y ∈ Y, and we can obtain:
y=f(∑ωx-θ)
where f () is the activation function, ω is the weight, θ is the threshold, the training set D is 3 attribute descriptions per data input, x ═ Sd, Gen, Org]TThe output is 2-dimensional real value vector y ═ 0/10/1]Defining the number of hidden layers as L, the number of nodes of each hidden layer as h, and making y' be the predicted value of y, so as to obtain:
inputting layer to layer 1 of the hidden layer J-th node L1,jPredicted value y'1jIs that
Figure FDA0003416504410000042
Wherein theta isjThreshold value of J-th node of layer 1 of the implied layerH represents the number of nodes of the hidden layer, ωi,jRepresents the weight, ω, between the ith and jth nodesi,j,1Representing weights between the ith and jth nodes of the layer 1 of the hidden layer
Jh node L from hidden layer c to hidden layer dd,jPredicted value y'd,jIs that
Figure FDA0003416504410000051
Implicit layer to output layer J-th node predictor Y'jIs that
Figure FDA0003416504410000052
y’L,iIndicating the predicted values of the L < th > layer to the i < th > node of the hidden layer,
the mean square error is used as the loss function Mse,
Figure FDA0003416504410000053
wherein size () is the length of the training data of the data set D, let Δ ω be the update form of the weight,
ω←ω+Δω
given the learning rate η, the parameters are adjusted in the direction of the negative gradient of the target, as follows
Figure FDA0003416504410000054
y represents an actual value, y 'represents a predicted value, y'LAnd expressing the L-th layer predicted value of the hidden layer to obtain the number L of the hidden layers, the number h of nodes of the hidden layer and a threshold theta.
CN202111548530.6A 2021-12-17 2021-12-17 Method for identifying biological essential protein Active CN114242168B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111548530.6A CN114242168B (en) 2021-12-17 2021-12-17 Method for identifying biological essential protein

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111548530.6A CN114242168B (en) 2021-12-17 2021-12-17 Method for identifying biological essential protein

Publications (2)

Publication Number Publication Date
CN114242168A true CN114242168A (en) 2022-03-25
CN114242168B CN114242168B (en) 2024-06-14

Family

ID=80757760

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111548530.6A Active CN114242168B (en) 2021-12-17 2021-12-17 Method for identifying biological essential protein

Country Status (1)

Country Link
CN (1) CN114242168B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631808A (en) * 2022-10-25 2023-01-20 贵州大学 Molecular target rapid prediction and correlation mechanism analysis method
CN115935249A (en) * 2022-10-28 2023-04-07 华北理工大学 Heartbeat abnormity monitoring method, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130253894A1 (en) * 2012-03-07 2013-09-26 The Trustees Of Columbia University In The City Of New York Systems And Methods For Predicting Protein-Protein Interactions
CN109801674A (en) * 2019-01-30 2019-05-24 长沙学院 A kind of key protein matter recognition methods based on the fusion of isomery bio-networks
CN110070909A (en) * 2019-03-21 2019-07-30 中南大学 A kind of protein function prediction technique of the fusion multiple features based on deep learning
US20190304568A1 (en) * 2018-03-30 2019-10-03 Board Of Trustees Of Michigan State University System and methods for machine learning for drug design and discovery

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130253894A1 (en) * 2012-03-07 2013-09-26 The Trustees Of Columbia University In The City Of New York Systems And Methods For Predicting Protein-Protein Interactions
US20190304568A1 (en) * 2018-03-30 2019-10-03 Board Of Trustees Of Michigan State University System and methods for machine learning for drug design and discovery
CN109801674A (en) * 2019-01-30 2019-05-24 长沙学院 A kind of key protein matter recognition methods based on the fusion of isomery bio-networks
CN110070909A (en) * 2019-03-21 2019-07-30 中南大学 A kind of protein function prediction technique of the fusion multiple features based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
唐家琪;吴莉;: "基于PPI网络与机器学习的蛋白质功能预测方法", 计算机应用, no. 03, 10 March 2018 (2018-03-10) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115631808A (en) * 2022-10-25 2023-01-20 贵州大学 Molecular target rapid prediction and correlation mechanism analysis method
CN115935249A (en) * 2022-10-28 2023-04-07 华北理工大学 Heartbeat abnormity monitoring method, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN114242168B (en) 2024-06-14

Similar Documents

Publication Publication Date Title
Pan et al. A classification-based surrogate-assisted evolutionary algorithm for expensive many-objective optimization
CN105279397B (en) A kind of method of key protein matter in identification of protein interactive network
Gupta et al. Half a dozen real-world applications of evolutionary multitasking, and more
CN104008165B (en) Club detecting method based on network topology and node attribute
CN106980648B (en) Personalized recommendation method based on probability matrix decomposition and combined with similarity
CN114242168A (en) Method for identifying biologically essential protein
CN104992078B (en) A kind of protein network complex recognizing method based on semantic density
CN105930688A (en) Improved PSO algorithm based protein function module detection method
CN109637579B (en) Tensor random walk-based key protein identification method
Shi et al. Protein complex detection with semi-supervised learning in protein interaction networks
Zanghi et al. Strategies for online inference of model-based clustering in large and growing networks
CN109727637B (en) Method for identifying key proteins based on mixed frog-leaping algorithm
CN107784196B (en) Method for identifying key protein based on artificial fish school optimization algorithm
CN108229643B (en) Method for identifying key protein by using drosophila optimization algorithm
Tembusai et al. K-nearest neighbor with k-fold cross validation and analytic hierarchy process on data classification
CN111553140B (en) Data processing method, data processing apparatus, and computer storage medium
CN103455612A (en) Method for detecting non-overlapping network communities and overlapping network communities based on two-stage strategy
CN110109005B (en) Analog circuit fault testing method based on sequential testing
CN102779241B (en) PPI (Point-Point Interaction) network clustering method based on artificial swarm reproduction mechanism
CN111128292B (en) Key protein identification method based on protein clustering characteristic and active co-expression
CN111639712A (en) Positioning method and system based on density peak clustering and gradient lifting algorithm
CN116631496A (en) miRNA target prediction method and system based on multilayer heterograms and application
CN113470738B (en) Overlapping protein complex identification method and system based on fuzzy clustering and gene ontology semantic similarity
CN114840717B (en) Graph data-oriented mining method and device, electronic equipment and readable storage medium
CN108519881A (en) A kind of component identification method based on more rules cluster

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant