CN114242168A - Method for identifying biologically essential protein - Google Patents
Method for identifying biologically essential protein Download PDFInfo
- Publication number
- CN114242168A CN114242168A CN202111548530.6A CN202111548530A CN114242168A CN 114242168 A CN114242168 A CN 114242168A CN 202111548530 A CN202111548530 A CN 202111548530A CN 114242168 A CN114242168 A CN 114242168A
- Authority
- CN
- China
- Prior art keywords
- gene
- node
- protein
- nodes
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 123
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 64
- 238000000034 method Methods 0.000 title claims abstract description 27
- 230000014509 gene expression Effects 0.000 claims abstract description 36
- 238000013528 artificial neural network Methods 0.000 claims abstract description 11
- 238000012549 training Methods 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 13
- 239000013598 vector Substances 0.000 claims description 7
- 240000004808 Saccharomyces cerevisiae Species 0.000 claims description 6
- 230000003993 interaction Effects 0.000 claims description 6
- 235000014680 Saccharomyces cerevisiae Nutrition 0.000 claims description 5
- 108010058643 Fungal Proteins Proteins 0.000 claims description 4
- 230000002596 correlated effect Effects 0.000 claims description 4
- 230000004879 molecular function Effects 0.000 claims description 4
- 108700039887 Essential Genes Proteins 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 3
- 230000031018 biological processes and functions Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000000875 corresponding effect Effects 0.000 claims description 3
- 210000003850 cellular structure Anatomy 0.000 claims description 2
- 238000005314 correlation function Methods 0.000 claims description 2
- 230000009469 supplementation Effects 0.000 claims description 2
- 230000001502 supplementing effect Effects 0.000 abstract description 2
- 230000004850 protein–protein interaction Effects 0.000 description 24
- 238000013135 deep learning Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 108010085220 Multiprotein Complexes Proteins 0.000 description 1
- 102000007474 Multiprotein Complexes Human genes 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 239000003596 drug target Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 230000001717 pathogenic effect Effects 0.000 description 1
- 230000037081 physical activity Effects 0.000 description 1
- 230000001766 physiological effect Effects 0.000 description 1
- 230000006916 protein interaction Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Genetics & Genomics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a method for identifying essential protein between multidimensional biological attribute information and PPI topological characteristics by utilizing a deep neural network, which is used for supplementing absent data aiming at the problem of incomplete gene expression data so as to improve robustness. And respectively reducing the convergence speed of the deep neural network by constructing a PPI network topology structure, a Pearson correlation coefficient and a homologous correlation coefficient. And finally, searching the optimal incidence relation among the three characteristics of the node degree, the Pearson correlation coefficient and the homologous correlation coefficient through a deep neural network, thereby improving the identification precision of the necessary protein.
Description
Technical Field
The invention belongs to the field of life science, and particularly relates to a method for identifying biologically essential protein.
Background
Proteins are essential elements for body activities. In the life process, proteins are closely related to each other, and a series of physiological activities are completed together, so that a protein-protein interaction (PPI) is formed. There are some essential proteins in the protein network, which can lose the body's associated functions after the mutation is removed, resulting in the body's failure to function properly. Therefore, the prediction of the key protein based on the PPI network has theoretical basis for pathogenic gene exploration and drug target development.
Early, the recognition of important proteins occurred mainly in biological experiments. Although the techniques of biological experimentation are highly accurate, such experiments are time consuming and expensive. With the development of information technology, predicting essential proteins based on protein complexes and topological properties has become a new trend. Topology is the method, form and geometry by which the various endpoints are interconnected by describing the two most basic elements in geometry, points and lines. Along with the research progress of the social network, the higher the association degree between the discovered node and other nodes is, the stronger the importance is; the higher the similarity between nodes, the more similar their importance level. Based on this assumption, some scholars propose many classical algorithms based on PPI network topology characteristics, such as Degree Centrality (DC), Betweenness Centrality (BC), proximity Centrality (CC), Subgraph Centrality (SC), feature vector Centrality (EC), Information Centrality (IC), and the like, and also extend many new algorithms for identifying key proteins based on mixed topology characteristics, such as a side aggregation coefficient (SoECC) method proposed by Wang et al by fusing nodes and sides to evaluate the criticality of proteins, and experimental results show that the identification accuracy of the EC method is superior to that of other six Information methods. Schui et al propose a method for detecting global and local features of a fusion node.
The above algorithms only evaluate the criticality of the protein based on the topological structure characteristics of the PPI network, neglects the biological significance of the protein as a life activity bearer, and in addition, the identification accuracy of the algorithms is generally low due to the problems of a large amount of false positive and false negative data in the PPI network, the activity information of the protein and the like. Aiming at the problem, many algorithmic researches for applying combination of multivariate biological information and topological structure characteristics to key protein identification appear, such as learning characteristics from gene expression profiles by Tang and the like and Li and the like and fusing the characteristics with network topological characteristics so as to identify key proteins. The functional characteristics are mainly related to subcellular location, molecular functions and the like. Zeng et al fuse PPI networks, subcellular locations and gene expression profiles to identify key proteins. Although these methods combine the biological information and topological characteristics of proteins, the number of features related to proteins is large, and how to effectively utilize these features for protein identification is another problem to be solved.
Deep Learning (DL) relies on the modeling capability of a Deep neural network, not only can multi-level features be automatically obtained from original data, but also the nonlinear relation between the features can be modeled. Since the introduction of deep learning, the deep learning has made a breakthrough in tasks such as image processing and natural language understanding, and has been widely used in the field of biological information. Deep learning can provide good support for the learning of sequence data, and can capture topological features from a network model, map network nodes into low-dimensional dense vectors and further directly support applications such as classification, clustering and associated reasoning. Zeng et al explored and verified the advantages of the deep learning framework in key protein recognition based on convolutional neural networks, long-short term memory networks.
Disclosure of Invention
The invention aims to provide a method for identifying biological essential protein, which overcomes the defects that the time consumption of the identification process of key protein is found and described in the prior art, and the verification of the model correctness is complex and troublesome.
The invention provides a method for identifying biologically essential proteins, comprising the steps of,
s1, downloading data;
s2 absence data supplementation;
s3, constructing a PPI network topological structure, a Pearson correlation coefficient and a homologous correlation coefficient;
s4 trains the deep neural network.
Further, the step S1 includes,
obtaining yeast protein information from the data set and rejecting self-interactions and repeat interactions;
downloading information of homologous proteins from a database, and downloading gene expression data of the yeast from a data set;
and downloading a data set containing the essential genes of the saccharomyces cerevisiae from the database as a benchmark set.
Further, the step S2 includes,
for a given gene u, its gene expression at different times is expressed by the vector Exp (u) ═ { Exp (u,1), Exp (u,2), …), where Exp (u, i) is the average expression level of gene u at time i, information d of the protein degree of gene uμAnd origin information Ort is gθ(u);
Wherein,gθ(u)=[du Ortu]let aIs the actual value of the protein corresponding to gene u, where θiRepresents the value of gene u at the i-th time whenTaking the minimum value, wherein the linear fitting degree is the highest, namely the regression model is just on the boundary of gene expression;
the following steps are used:
obtaining:
the absence data is represented as:
wherein N (mu, sigma)2) Representing a gaussian perturbation.
Further, the step S3 includes,
s31 calculates the degree of the PPI node,
in the PPI network, let V represent a node set of the PPI network, E represent an edge set of the PPI network, and obtain an undirected graph G (V, E) based on the PPI network,
let graph G ═ V, E,. mu.e.V (G), Eμ,νE (G), the degree d (μ) of the node μ is
Where Γ (μ) represents the set of neighbor nodes for the node μ, euvRepresents the edges of the node u and the node v, Num () represents a quantity function,
normalized, intensity Sd of nodeuIn order to realize the purpose,
s32 calculation of correlation of Gene expression
For genes u and v, the PCC between them was calculated as follows:
whereinThe average expression level of gene u at each time point,is the average expression level of the gene v at each time, σ (u) is the standard deviation of the expression level of the gene u at each time, σ (v) is the standard deviation of the expression level of the gene v at each time, cov () represents the correlation function, T represents the total time of the gene u, T represents the specific time of the gene u, if PCC(u,v)Positive values indicate that the gene u is positively correlated with v, if PCC(u,v)A negative value, genes u and v are negatively related,represents the average expression level of the gene u,the average expression level of the gene v is expressed, and the average gene intensity Gen of the gene u in all nodes is calculatedu。
GenVRepresents the average gene strength of the gene v in all nodes, and n represents the number of nodes;
s33 calculating homology correlation
The semantic similarity defined by the gene ontology aims to provide a functional relationship among different biological processes, molecular functions or cell components, search a shortest path connecting two words or annotations, and calculate the semantic similarity by utilizing the sum of weights on the shortest path to measure the semantic similarity on GO, wherein the distance between mu and v is
τ is their lowest common ancestor, root is their oldest ancestor, dis () represents a distance function,
calculate average homogenous intensity Ort of node u among all nodesu,
OrtvRepresenting the strength of homology for node v.
Further, the step S4 includes,
let X denote the processed protein data and Y be the essential protein data, given a training set D { (X, Y) }, X ∈ X, Y ∈ Y, and we can obtain:
y=f(Σωx-θ)
where f () is the activation function, ω is the weight, θ is the threshold, the training set D is 3 attribute descriptions per data input, x ═ Sd, Gen, Org]TThe output is 2-dimensional real value vector y ═ 0/10/1]Defining the number of hidden layers as L, the number of nodes of each hidden layer as h, and making y' be the predicted value of y, so as to obtain:
inputting layer to layer 1 of the hidden layer J-th node L1,jPredicted value y'1jIs that
Wherein theta isjDenotes a threshold value of the J-th node of the 1 st layer of the hidden layer, h denotes the number of nodes of the hidden layer, ωi,jRepresents the weight, ω, between the ith and jth nodesi,j,1Representing weights between the ith and jth nodes of the layer 1 of the hidden layer
Jh node L from hidden layer c to hidden layer dd,jPredicted value y'd,jIs that
Implicit layer to output layer J-th node predictor Y'jIs that
y’L,iIndicating the predicted values of the L < th > layer to the i < th > node of the hidden layer,
the mean square error is used as the loss function Mse,
wherein size () is the length of the training data of the data set D, let Δ ω be the update form of the weight,
ω←ω+Δω
given the learning rate η, the parameters are adjusted in the direction of the negative gradient of the target, as follows
y represents an actual value, y 'represents a predicted value, y'LAnd expressing the L-th layer predicted value of the hidden layer to obtain the number L of the hidden layers, the number h of nodes of the hidden layer and a threshold theta.
The method has the beneficial effects that firstly, aiming at the characteristics of the protein, a deep learning method is introduced, so that the accuracy of key protein identification is improved; and then reducing the input of biological information in the process of deeply learning and identifying key protein by using a transition center, gene expression and biological homology method, thereby reducing the time consumption for training and the complexity of a training model.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The invention aims to provide a method for identifying biologically essential proteins, which comprises the following steps,
s1 data download:
the present invention obtains yeast protein information from the DIP and GAVIN data sets, respectively. After knock-out from interactions and repeated interactions, the DIP dataset yielded 5093 proteins, 24743 pairs interactions and 1167 essential proteins. The GAVIN data set provided 1855 proteins, 7669 pairs of interactions, 714 base proteins.
In addition, information for homologous proteins was downloaded from the InParanoid database (Version 7), which contains 100 genome-wide pairwise comparisons. In addition, the yeast gene expression data was downloaded from the dataset provided by Tu BP.
Finally, the invention further downloads a data set containing 1285 saccharomyces cerevisiae essential genes from four databases of MIPS, SGDP, DEG and SGD as a reference set.
S2 absence data supplement
Since the coverage of the PPI network based on the proteins in the DIP and GAVIN databases in the gene expression library is only 95%, in order to increase the robustness of the algorithm, the following steps are used for processing, and for a given gene u, the gene expression of the gene u at different times is expressed by a carrier
Exp (u) { Exp (u,1), Exp (u,2), …, where Exp (u, i) is the average expression level of gene u at time i, and information d on the protein level of gene uμAnd origin information Ort is gθ(u);
gθ(u)=θ×Exp(u)
=θ0+θ1×Exp(u,1)+L,
Wherein,gθ(u)=[du Ortu]let aIs the actual value of the protein corresponding to gene u, where θiRepresents the value of gene u at the i-th time whenTaking the minimum value, wherein the linear fitting degree is the highest, namely the regression model is just on the boundary of gene expression;
the following can be used:
thus, it is possible to obtain:
the absence data may be expressed as:
wherein N (mu, sigma)2) Representing a gaussian perturbation.
S3 construction PPI network topology structure, Pearson correlation coefficient and homologous correlation coefficient as input part of deep neural network
Constructing a PPI network structure through the association relation between the Gavin and yeast protein in the DIP database, and calculating the degree of each node; calculating the gene influence of each node through a gene expression database; and calculating the homologous influence of each node through a homologous database. Then taking the gene expression profile, the protein interaction network and the subcellular location information as input characteristics, and taking the information in the key protein library as output characteristics, the specific method is as follows:
s31 calculating degree of PPI node
In the PPI network, assuming that V represents a node (protein) set of the PPI network and E represents an edge (protein-protein interaction) set of the PPI network, an undirected graph G ═ V, E based on the PPI network can be obtained.
Let graph G ═ V, E,. mu.e.V (G), Eμ,νE (G), the degree d (μ) of the node μ is
Where Γ (μ) represents a set of neighbor nodes for the node μ. Num () is a numerical relationship. If the neighbor node eμ,νIf present, the value is 1, otherwise it is 0.
Normalized by the strength of the node
S32 calculation of correlation of Gene expression
PCC is the pearson correlation coefficient used to measure the linear correlation between two variables, which has a value between-1, 1. The invention introduces PCC to characterize the similarity of gene coexpression, which is widely applied in the natural science. For genes u and v, the PCC between them can be calculated as follows:
whereinσ (u) is the standard deviation of the expression level of gene u at each time. If PCC(u,v)If the value is positive, the gene u is positively correlated with the gene v; if PCC(u,v)Negative values indicate that genes u and v are negatively correlated.
The average gene intensity of gene u in all nodes was calculated by equation (9).
S33 calculating homology correlation
Semantic similarity defined by Gene Ontology (GO) aims at providing functional relationships between different biological processes, molecular functions or cellular components. The invention searches the shortest path connecting two words or annotations, and calculates the semantic similarity by using the sum of the weights on the shortest path to measure the semantic similarity on GO. Tversey ratio model based on similarity, the distance between mu and v is
τ is their lowest common ancestor and root is their oldest ancestor.
Calculating the average homologous strength of the node u in all the nodes by the formula (11)
S4 IYEP deep neural network model training
X represents the processed protein data, and Y is the essential protein data. Given a training set D { (X, Y) }, X ∈ X, Y ∈ Y, and thus one can obtain:
y=f(∑ωx-θ) (12)
where f () is the activation function, and in the patent the tanh function is used, ω is the weight and θ is the threshold. The training set D is 3 attribute descriptions per data input, x ═ Sd, Gen, Org]TThe output is 2-dimensional real value vector y ═ 0/10/1]The number of hidden layers is defined as L, and the number of nodes of each hidden layer is defined as h. FromAs can be seen in FIG. 1, the training model of IYEP consists of three parts, i.e., an input layer X to a hidden layer, i.e., between the hidden layers, and an output layer Y. Let y' be the predicted value of y, and combine equation (12) to obtain:
inputting layer to layer 1 of the hidden layer J-th node L1,jPredicted value y'1jIs that
Wherein theta isjRepresenting the threshold of the jth node of layer 1 of the hidden layer.
Jh node L from hidden layer c to hidden layer dd,jPredicted value y'd,jIs that
Implicit layer to output layer J-th node predictor Y'jIs that
In deep neural network model training, the aim is to find a model with the least error, and the mean square error is used as a loss function Mse.
Where size () is the length of the training data of data set D. Let Δ ω be an updated form of the weight, i.e.
ω←ω+Δω (17)
Based on a gradient descent (gradient) strategy, given a learning rate η, the parameters are adjusted in the direction of the negative gradient of the target, there is the following equation
Similar to the formula (18), the number L of hidden layers, the number h of nodes of the hidden layers and a threshold value theta can be obtained, and various parameters are substituted into the training model of the IYEP, so that the judgment model of the deep neural network can be obtained.
The invention provides a method for identifying essential protein between multidimensional biological attribute information and PPI topological characteristics by utilizing a deep neural network, which is used for supplementing absent data aiming at the problem of incomplete gene expression data so as to improve robustness. And respectively reducing the convergence speed of the deep neural network by constructing a PPI network topology structure, a Pearson correlation coefficient and a homologous correlation coefficient. And finally, searching the optimal incidence relation among the three characteristics of the node degree, the Pearson correlation coefficient and the homologous correlation coefficient through a deep neural network, thereby improving the identification precision of the necessary protein.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (5)
1. A method for identifying a biologically essential protein comprising the steps of,
s1, downloading data;
s2 absence data supplementation;
s3, constructing a PPI network topological structure, a Pearson correlation coefficient and a homologous correlation coefficient;
s4 trains the deep neural network.
2. The method of claim 1, wherein the protein is a biologically essential protein,
the step S1 includes the steps of,
obtaining yeast protein information from the data set and rejecting self-interactions and repeat interactions;
downloading information of homologous proteins from a database, and downloading gene expression data of the yeast from a data set;
and downloading a data set containing the essential genes of the saccharomyces cerevisiae from the database as a benchmark set.
3. The method of claim 1, wherein the protein is a biologically essential protein,
the step S2 includes the steps of,
for a given gene u, its gene expression at different times is expressed by a vector
Exp (u) { Exp (u,1), Exp (u,2), …, where Exp (u, i) is the average expression level of gene u at time i, and information d on the protein level of gene uμAnd origin information Ort is gθ(u);
Wherein,gθ(u)=[du Ortu]let aIs the actual value of the protein corresponding to gene u, where θiRepresents the value of gene u at the i-th time whenTaking the minimum value, wherein the linear fitting degree is the highest, namely the regression model is just on the boundary of gene expression;
the following steps are used:
obtaining:
the absence data is represented as:
wherein N (mu, sigma)2) Representing a gaussian perturbation.
4. The method of claim 1, wherein the protein is a biologically essential protein,
the step S3 includes the steps of,
s31 calculates the degree of the PPI node,
in the PPI network, let V represent a node set of the PPI network, E represent an edge set of the PPI network, and obtain an undirected graph G (V, E) based on the PPI network,
let graph G ═ V, E,. mu.e.V (G), Eμ,νE (G), the degree d (μ) of the node μ is
Where Γ (μ) represents the set of neighbor nodes for the node μ, euvRepresents the edges of the node u and the node v, Num () represents a quantity function,
normalized, intensity Sd of nodeuIn order to realize the purpose,
s32 calculation of correlation of Gene expression
For genes u and v, the PCC between them was calculated as follows:
whereinThe average expression level of gene u at each time point,is the average expression level of the gene v at each time, σ (u) is the standard deviation of the expression level of the gene u at each time, σ (v) is the standard deviation of the expression level of the gene v at each time, cov () represents the correlation function, T represents the total time of the gene u, T represents the specific time of the gene u, if PCC(u,v)Positive values indicate that the gene u is positively correlated with v, if PCC(u,v)A negative value, genes u and v are negatively related,represents the average expression level of the gene u,the average expression level of the gene v is expressed, and the average gene intensity Gen of the gene u in all nodes is calculatedu。
GenVRepresents the average gene strength of the gene v in all nodes, and n represents the number of nodes;
s33 calculating homology correlation
The semantic similarity defined by the gene ontology aims to provide a functional relationship among different biological processes, molecular functions or cell components, search a shortest path connecting two words or annotations, and calculate the semantic similarity by utilizing the sum of weights on the shortest path to measure the semantic similarity on GO, wherein the distance between mu and v is
τ is their lowest common ancestor, root is their oldest ancestor, dis () represents a distance function,
calculate average homogenous intensity Ort of node u among all nodesu,
OrtvRepresenting the strength of homology for node v.
5. The method of claim 1, wherein the protein is a biologically essential protein,
the step S4 includes the steps of,
let X denote the processed protein data and Y be the essential protein data, given a training set D { (X, Y) }, X ∈ X, Y ∈ Y, and we can obtain:
y=f(∑ωx-θ)
where f () is the activation function, ω is the weight, θ is the threshold, the training set D is 3 attribute descriptions per data input, x ═ Sd, Gen, Org]TThe output is 2-dimensional real value vector y ═ 0/10/1]Defining the number of hidden layers as L, the number of nodes of each hidden layer as h, and making y' be the predicted value of y, so as to obtain:
inputting layer to layer 1 of the hidden layer J-th node L1,jPredicted value y'1jIs that
Wherein theta isjThreshold value of J-th node of layer 1 of the implied layerH represents the number of nodes of the hidden layer, ωi,jRepresents the weight, ω, between the ith and jth nodesi,j,1Representing weights between the ith and jth nodes of the layer 1 of the hidden layer
Jh node L from hidden layer c to hidden layer dd,jPredicted value y'd,jIs that
Implicit layer to output layer J-th node predictor Y'jIs that
y’L,iIndicating the predicted values of the L < th > layer to the i < th > node of the hidden layer,
the mean square error is used as the loss function Mse,
wherein size () is the length of the training data of the data set D, let Δ ω be the update form of the weight,
ω←ω+Δω
given the learning rate η, the parameters are adjusted in the direction of the negative gradient of the target, as follows
y represents an actual value, y 'represents a predicted value, y'LAnd expressing the L-th layer predicted value of the hidden layer to obtain the number L of the hidden layers, the number h of nodes of the hidden layer and a threshold theta.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111548530.6A CN114242168B (en) | 2021-12-17 | 2021-12-17 | Method for identifying biological essential protein |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111548530.6A CN114242168B (en) | 2021-12-17 | 2021-12-17 | Method for identifying biological essential protein |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114242168A true CN114242168A (en) | 2022-03-25 |
CN114242168B CN114242168B (en) | 2024-06-14 |
Family
ID=80757760
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111548530.6A Active CN114242168B (en) | 2021-12-17 | 2021-12-17 | Method for identifying biological essential protein |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114242168B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115631808A (en) * | 2022-10-25 | 2023-01-20 | 贵州大学 | Molecular target rapid prediction and correlation mechanism analysis method |
CN115935249A (en) * | 2022-10-28 | 2023-04-07 | 华北理工大学 | Heartbeat abnormity monitoring method, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130253894A1 (en) * | 2012-03-07 | 2013-09-26 | The Trustees Of Columbia University In The City Of New York | Systems And Methods For Predicting Protein-Protein Interactions |
CN109801674A (en) * | 2019-01-30 | 2019-05-24 | 长沙学院 | A kind of key protein matter recognition methods based on the fusion of isomery bio-networks |
CN110070909A (en) * | 2019-03-21 | 2019-07-30 | 中南大学 | A kind of protein function prediction technique of the fusion multiple features based on deep learning |
US20190304568A1 (en) * | 2018-03-30 | 2019-10-03 | Board Of Trustees Of Michigan State University | System and methods for machine learning for drug design and discovery |
-
2021
- 2021-12-17 CN CN202111548530.6A patent/CN114242168B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130253894A1 (en) * | 2012-03-07 | 2013-09-26 | The Trustees Of Columbia University In The City Of New York | Systems And Methods For Predicting Protein-Protein Interactions |
US20190304568A1 (en) * | 2018-03-30 | 2019-10-03 | Board Of Trustees Of Michigan State University | System and methods for machine learning for drug design and discovery |
CN109801674A (en) * | 2019-01-30 | 2019-05-24 | 长沙学院 | A kind of key protein matter recognition methods based on the fusion of isomery bio-networks |
CN110070909A (en) * | 2019-03-21 | 2019-07-30 | 中南大学 | A kind of protein function prediction technique of the fusion multiple features based on deep learning |
Non-Patent Citations (1)
Title |
---|
唐家琪;吴莉;: "基于PPI网络与机器学习的蛋白质功能预测方法", 计算机应用, no. 03, 10 March 2018 (2018-03-10) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115631808A (en) * | 2022-10-25 | 2023-01-20 | 贵州大学 | Molecular target rapid prediction and correlation mechanism analysis method |
CN115935249A (en) * | 2022-10-28 | 2023-04-07 | 华北理工大学 | Heartbeat abnormity monitoring method, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN114242168B (en) | 2024-06-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Pan et al. | A classification-based surrogate-assisted evolutionary algorithm for expensive many-objective optimization | |
CN105279397B (en) | A kind of method of key protein matter in identification of protein interactive network | |
Gupta et al. | Half a dozen real-world applications of evolutionary multitasking, and more | |
CN104008165B (en) | Club detecting method based on network topology and node attribute | |
CN106980648B (en) | Personalized recommendation method based on probability matrix decomposition and combined with similarity | |
CN114242168A (en) | Method for identifying biologically essential protein | |
CN104992078B (en) | A kind of protein network complex recognizing method based on semantic density | |
CN105930688A (en) | Improved PSO algorithm based protein function module detection method | |
CN109637579B (en) | Tensor random walk-based key protein identification method | |
Shi et al. | Protein complex detection with semi-supervised learning in protein interaction networks | |
Zanghi et al. | Strategies for online inference of model-based clustering in large and growing networks | |
CN109727637B (en) | Method for identifying key proteins based on mixed frog-leaping algorithm | |
CN107784196B (en) | Method for identifying key protein based on artificial fish school optimization algorithm | |
CN108229643B (en) | Method for identifying key protein by using drosophila optimization algorithm | |
Tembusai et al. | K-nearest neighbor with k-fold cross validation and analytic hierarchy process on data classification | |
CN111553140B (en) | Data processing method, data processing apparatus, and computer storage medium | |
CN103455612A (en) | Method for detecting non-overlapping network communities and overlapping network communities based on two-stage strategy | |
CN110109005B (en) | Analog circuit fault testing method based on sequential testing | |
CN102779241B (en) | PPI (Point-Point Interaction) network clustering method based on artificial swarm reproduction mechanism | |
CN111128292B (en) | Key protein identification method based on protein clustering characteristic and active co-expression | |
CN111639712A (en) | Positioning method and system based on density peak clustering and gradient lifting algorithm | |
CN116631496A (en) | miRNA target prediction method and system based on multilayer heterograms and application | |
CN113470738B (en) | Overlapping protein complex identification method and system based on fuzzy clustering and gene ontology semantic similarity | |
CN114840717B (en) | Graph data-oriented mining method and device, electronic equipment and readable storage medium | |
CN108519881A (en) | A kind of component identification method based on more rules cluster |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |