CN112820347B - Disease gene prediction method based on multiple protein network pulse dynamics process - Google Patents
Disease gene prediction method based on multiple protein network pulse dynamics process Download PDFInfo
- Publication number
- CN112820347B CN112820347B CN202110141656.5A CN202110141656A CN112820347B CN 112820347 B CN112820347 B CN 112820347B CN 202110141656 A CN202110141656 A CN 202110141656A CN 112820347 B CN112820347 B CN 112820347B
- Authority
- CN
- China
- Prior art keywords
- network
- protein
- multiple protein
- pulse
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 254
- 102000004169 proteins and genes Human genes 0.000 title claims abstract description 160
- 238000000034 method Methods 0.000 title claims abstract description 88
- 201000010099 disease Diseases 0.000 title claims abstract description 75
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 title claims abstract description 75
- 230000008569 process Effects 0.000 title claims abstract description 36
- 238000005065 mining Methods 0.000 claims abstract description 4
- 239000010410 layer Substances 0.000 claims description 77
- 239000011159 matrix material Substances 0.000 claims description 39
- 230000010399 physical interaction Effects 0.000 claims description 16
- 230000006870 function Effects 0.000 claims description 15
- 230000000737 periodic effect Effects 0.000 claims description 13
- 230000004186 co-expression Effects 0.000 claims description 10
- 230000004044 response Effects 0.000 claims description 9
- 238000010606 normalization Methods 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 6
- 230000003993 interaction Effects 0.000 claims description 6
- 239000011229 interlayer Substances 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 5
- 230000004853 protein function Effects 0.000 claims description 5
- 238000009792 diffusion process Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 4
- 230000004927 fusion Effects 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 4
- 230000002503 metabolic effect Effects 0.000 claims description 4
- 238000011160 research Methods 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 3
- 102000001253 Protein Kinase Human genes 0.000 claims description 3
- 230000009471 action Effects 0.000 claims description 3
- 239000002131 composite material Substances 0.000 claims description 3
- 230000005284 excitation Effects 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000006916 protein interaction Effects 0.000 claims description 3
- 108060006633 protein kinase Proteins 0.000 claims description 3
- 230000001105 regulatory effect Effects 0.000 claims description 3
- 230000011664 signaling Effects 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 abstract description 4
- 238000012163 sequencing technique Methods 0.000 abstract 1
- 230000000875 corresponding effect Effects 0.000 description 14
- 230000008901 benefit Effects 0.000 description 4
- 238000002790 cross-validation Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000012795 verification Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 238000004132 cross linking Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 102000004190 Enzymes Human genes 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000003766 bioinformatics method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 210000000349 chromosome Anatomy 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000013209 evaluation strategy Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000004850 protein–protein interaction Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000019491 signal transduction Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a disease gene prediction method based on a multiple protein network pulse dynamics process, which mainly comprises the following steps: 1. constructing a standardized multiple protein network; 2. constructing a multiple protein network pulse dynamics model; 3. extracting pulse dynamics characteristics of multiple protein networks; 4. pulse dynamics characteristics of multiple protein network nodes are fused to predict disease genes by sequencing. The prediction method can be used for more effectively fusing multiple protein networks and mining hidden characteristics in the multiple protein networks, so that the disease gene identification capacity is improved, the calculated amount is small, and the prediction method is suitable for realizing analysis of biological information big data through software.
Description
Technical Field
The invention belongs to the field of bioinformatics analysis, and relates to a disease gene prediction method based on a multiple protein network pulse dynamics process.
Background
The identification of disease-related genes is of great importance for the study of disease. Traditional methods such as linkage analysis are helpful for identifying disease-related genes, but often cannot accurately locate disease-causing genes. Due to the high economic costs and high time consumption of biological experiment predictions, the development of efficient computational methods for predicting and screening disease-related genes from a large number of candidate genes has become critical.
Genes associated with similar or identical diseases are functionally related and tend to accumulate nearby in biological networks such as protein-protein interaction networks (PPIs). Therefore, network-based algorithms are very popular in disease gene prediction and related fields, and network propagation is one of the most widely applied strategies, and has become a leading-edge method for genetic association research. Traditional network propagation is useful, but it tends to focus on dynamic steady state solutions, and thus may lose some of the useful information hidden in the dynamic process. Thus, it is necessary to directly mine hidden information in the dynamic process that helps reveal disease gene associations.
Neglecting coexistence of different types of interactions/associations in a networked system, e.g., aggregating these relationships into a single network, can change the topology properties of the overall system, resulting in significant impact on modeling and predictive capabilities of the system. It remains a challenge to make full use of various types of biological networks to effectively predict disease-related genes, as they often have different meanings and reliability, such as metabolic enzyme-coupled interactions, signal transduction, etc. The efficient use of a multi-source biomolecular network will help to enhance the ability of disease gene prediction methods.
Based on this, it is highly desirable to design a disease gene prediction method that can effectively utilize the cross-linking effects of different types of network layers in a networked system.
Disclosure of Invention
First, the technical problem to be solved
Based on the above, the invention discloses a disease gene prediction method based on a multiple protein network pulse dynamics process, which can improve the capability of the disease gene prediction method to fully reveal hidden information related to disease genes, and the analysis method based on the pulse dynamics process can effectively utilize the cross-linking influence of multiple protein networks, so that the prediction accuracy is improved, and is suitable for mass software analysis of biological data.
(II) technical scheme
The invention discloses a disease gene prediction method based on a multiple protein network pulse dynamics process, which comprises the following steps:
step 1: after biological data preprocessing, a plurality of protein networks of different types are connected with nodes corresponding to the same protein, so that a multi-protein network is constructed, multi-network fusion is realized, and the edge weight of the multi-protein network is standardized by calculating the average degree of network nodes; mapping the protein numbers to standard gene symbols uniformly;
step 2: applying periodic pulse signals to seed nodes of each network layer of the multiple protein network in the step 1 to excite the pulse dynamics process of the multiple protein network, calculating the pulse response curve of the multiple protein network nodes, and mining the hidden characteristics of the network nodes;
step 3: acquiring the association strength between the network node and the seed node by calculating the dynamic characteristics of the multiple protein network nodes on the pulse signals;
step 4: based on the dynamic characteristics in the step 3, obtaining a comprehensive protein score by calculating the reciprocal of the geometric average of node ranking values corresponding to the same protein in each network layer of the multiple protein network; disease genes are screened by calculating a descending order of protein composite scores.
Further, the step 1 specifically includes the following steps:
(1) Biological data pretreatment: acquiring known disease gene-related data, disease phenotype-related annotation data, and human phenotype ontology data; acquiring a protein physical interaction network; constructing a protein function association network; uniformly mapping protein numbers into standard gene symbols;
(2) Multiple protein network construction: the interconnection and intercommunication of M network layers with N nodes are realized through a multiple protein network model so as to integrate multiple types of protein associated networks, and the specific operation method comprises the following steps: giving M network layers, wherein each network layer comprises N nodes, connecting nodes corresponding to the same protein in M different types of protein networks, and the connection weight between the network layers is 1/M; to facilitate matrixing operation, let A (α) ∈R N×N Representing an adjacency matrix for each network layer, the multiple protein network represented by a super adjacency matrix An intra-layer super-adjacency matrix corresponding to an independent network layer, defined as,
the super-adjacency matrix between the corresponding layers, defined as,
wherein AL ∈R M×M The representative node represents an inter-layer link matrix of the network layers, the side weight of which is the link strength between the network layers, set to 1/M,represents the Cronecker product, I.epsilon.R M×M Representing the identity matrix;
(3) Normalization of multiple protein networks: dividing the weight of all sides of the multiple protein network by the average degree of network nodes to realize the standardized processing of the multiple protein network, wherein the calculation method comprises the following steps: network node averagingThe normalized network is recorded in a fourth order tensor C, wherein +.> I∈R N×N Representing the identity matrix, delta (alpha)Beta) represents a kronecker delta function, when alpha=beta, delta (alpha, beta) =1, otherwise 0.
Further, the step 2 specifically includes: when the pulse dynamics process is excited on the multiple protein network, defining a pulse dynamics equation on the multiple protein network after network normalization treatment as follows:
wherein ,the state of the node i (i=1 to N, N is the total number of nodes) at the network layer α (α=1 to M, M is the total number of network layers) at the time t; />Is a continuous micro-function for describing the self-evolution process of a node without being influenced by other nodes, and is defined +.>Wherein θ is>0 is a self-evolution weight parameter; />The diffusion coefficient between the node i representing the gene network layer alpha and the node j representing the network layer beta, namely the connection weight between the nodes after network standardization, and C corresponds to a fourth-order tensor; if node i of network layer alpha is the control node to which the periodic pulse signal is applied, i.e. the known disease gene, +.>Otherwise->Is a periodic activation function, where t σ Is the pulse time constant, delta (t-t) σ ) Is a dirac delta function (when t-t σ When=0, δ (t-t σ ) =1, otherwise 0).
Two new fourth-order tensors are defined according to the fourth-order tensor C to represent laplace matrices of the intra-layer sub-network and the inter-layer sub-network of the multiple network, respectively, as defined below,
wherein δ (α, β) represents a kronecker delta function, when α=β, δ (α, β) =1, otherwise 0; expanding the two tensors to obtain a super Laplace matrix in and between layers of the multiple network,
the multiple network pulse dynamics equation is expressed as a matrix form by the superlaplace matrix between layers and layers of the multiple network,
wherein Is a state vector +.>Is a superlaplace matrix of a multiple network,is a vector indicating the control node, u t Is the aforementioned periodic activation function; based on the matrix equation, the characteristic time tau=1/lambda of the kinetic equation is obtained m, wherein λm For matrix->I is an identity matrix, and θ>0; the pulse period is set to be 5 times or more than 5 times the characteristic time constant according to the characteristic time τ.
Further, the step 3 specifically includes: aiming at the extraction of the pulse dynamics characteristics of the multiple protein networks, the known gene action pulse excitation points related to diseases excite the pulse dynamics process in the multiple protein networks according to the multiple protein network pulse dynamics model, and the impulse response curves of the network nodes are calculated according to the multiple protein network pulse dynamics equation; the kinetic characteristics (S) of the network node to the pulse signal during the multiplex protein network pulse dynamics are defined as:i.e. the maximum value of the node in the impulse dynamics response; and calculating the magnitude of the dynamic characteristics of the network node according to the definition, and describing the association strength between the node and the control node.
Further, the step 4 specifically includes: in a multiplex protein network comprising M network layers of N nodes, each protein has M corresponding replica nodes, i.e., M pulse dynamics feature magnitudesIn each network layer, the magnitude of the dynamics of the node is +.>Calculating the descending order of nodes in each network layer>Then, calculating the reciprocal of the geometric mean of the node ranking values of the corresponding same proteins in M network layers of the multiple protein network to obtain the comprehensive score of the proteins, wherein the calculation method comprises the following steps:finally, according to the comprehensive score, the descending order of the proteins is calculated, and the proteins with the earlier order are more likely to correspond to candidate genes related to diseases, so that the disease genes are identified or predicted, and effective guidance is provided for biological experimental research of the disease genes.
Further, the acquiring protein physical interaction network in the step (1) specifically includes one or more of a regulatory network, a metabolic network, a signaling network, a protein complex network, a protein kinase network, a high-throughput binary interaction network, and a literature-validated protein interaction network.
Further, the construction of the protein function association network in the step (1) specifically includes a gene co-expression network and/or a gene semantic association network based on disease gene association.
In another aspect, the invention also discloses a disease gene prediction system based on multiple protein network pulse dynamics process, comprising:
at least one processor; and at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor to invoke the program instructions to perform the disease gene prediction method based on the multiple protein network pulse dynamics process as described in any one of the above.
In a further aspect, the invention also discloses a non-transitory computer readable storage medium storing computer instructions that cause the computer to perform the disease gene prediction method based on multiple protein network pulse dynamics process as described in any one of the above.
(III) beneficial effects
The technical scheme of the invention has the advantages that the method can more effectively fuse a plurality of types of protein networks, and the information hidden in the multiple protein network structure is mined through the pulse dynamics process of the multiple protein networks, so that the disease related genes can be more effectively identified. The experimental result on the real data set shows that compared with a plurality of existing methods, the prediction method provided by the invention has stronger and more accurate prediction capability, has small calculated amount, and is suitable for realizing analysis processing of batch biological information data through software calculation.
Drawings
The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and should not be construed as limiting the invention in any way, in which:
FIG. 1 is a flowchart of a disease gene prediction method NIDM of the present invention;
FIG. 2 is a graph showing the percentage improvement of the performance of the disease gene prediction method NIDM according to the invention in different data sets when a leave-one-out verification strategy is adopted;
FIG. 3 is a graph of percentage improvement of performance of a disease gene prediction method NIDM of the present invention in different data sets using a five-fold cross-validation strategy;
FIG. 4 is a graph comparing the performance index of the disease gene prediction method NIDM of the present invention with the performance index of the existing RWRMP, RWRMG, DRS, endeavour, RWR and KS methods when a leave-one-out verification strategy is adopted;
FIG. 5 is a graph comparing performance metrics of the disease gene prediction method NIDM of the present invention with the existing RWRMP, RWRMG, DRS, endeavour, RWR and KS methods when a five-fold cross-validation strategy is employed.
Detailed Description
The technical problems and advantages of the technical solution of the present invention will be described in detail with reference to the accompanying drawings and examples, and it should be noted that the described examples are only intended to facilitate understanding of the present invention and are not intended to limit the present invention in any way.
As shown in FIG. 1, the invention provides a disease gene prediction method based on multiple protein network pulse dynamics process, which comprises the following steps:
step 1: construction of a normalized Multiprotein network
After biological data preprocessing, a plurality of nodes corresponding to the same protein in a plurality of different types of protein networks are connected to construct a multi-protein network, so that multi-network fusion is realized; normalization of edge weights of multiple protein networks by computing network node averages
The step 1 specifically comprises the following steps:
(1) Biological data pretreatment: acquiring known disease gene-related data, disease phenotype-related annotation data, and human phenotype ontology data; acquiring a protein physical interaction network (e.g., regulatory network, metabolic network, signaling network, protein complex network, protein kinase network, high-throughput binary interaction network, and literature-validated protein interaction network); constructing a protein function association network (such as a gene co-expression network and a gene semantic association network based on disease gene association); uniformly mapping protein numbers into standard gene symbols;
(2) Multiple protein network construction: the interconnection and intercommunication of M network layers with N nodes are realized through a multiple protein network model so as to integrate multiple types of protein associated networks, and the specific operation method comprises the following steps: giving M network layers, wherein each network layer comprises N nodes, connecting nodes corresponding to the same protein in M different types of protein networks, and the connection weight between the network layers is 1/M; to facilitate matrixing operation, let A (α) ∈R N×N Representing an adjacency matrix for each network layer, the multiple protein network represented by a super adjacency matrix An intra-layer super-adjacency matrix corresponding to an independent network layer, defined as,
the super-adjacency matrix between the corresponding layers, defined as,
wherein AL ∈R M×M An inter-layer link matrix representing nodes as network layers, whose edge weights are the link strengths between the network layers, i.e. 1/M,represents the Cronecker product, I.epsilon.R M×M Representing the identity matrix;
(3) Normalization of multiple protein networks: dividing the weight of all sides of the multiple protein network by the average degree of network nodes to realize the standardized processing of the multiple protein network, wherein the calculation method comprises the following steps: network node averagingThe normalized network is recorded in a fourth order tensor C, wherein +.> I∈R N×N Representing the identity matrix, δ (α, β) represents the kronecker delta function, when α=β, δ (α, β) =1, otherwise 0.
Step 2: construction of multiple protein network pulse dynamics model
Applying periodic pulse signals to seed nodes of each network layer of the multiple protein network in the step 1, exciting the pulse dynamics process of the multiple protein network, calculating the pulse response curve of the multiple protein network nodes, and mining the hidden characteristics of the network nodes;
the step 2 specifically comprises the following steps: when the pulse dynamics process is excited on the multiple protein network, defining a pulse dynamics equation on the multiple protein network after network normalization treatment as follows:
wherein ,the state of the node i (i=1 to N, N is the total number of nodes) at the network layer α (α=1 to M, M is the total number of network layers) at the time t; />Is a continuous micro-function for describing the self-evolution process of a node without being influenced by other nodes, and is defined +.>Wherein θ is>0 is a self-evolution weight parameter; />The diffusion coefficient between nodes i and j representing the gene network layers alpha and beta, namely the connection weight between nodes after network standardization, C corresponds to a fourth-order tensor, which determines the diffusion behavior of pulse signals between nodes of each network layer; if node i of network layer alpha is the control node to which the periodic pulse signal is applied, i.e. the known disease gene, +.>Otherwise->u t =∑ σ δ(t-t σ ) Is a periodic activation function, where t σ Is the pulse time constant, delta (t-t) σ ) Is a dirac delta function (when t-t σ When=0, δ (t-t σ ) =1, otherwise 0); four terms in the pulse dynamics equation describe the self-evolution of the node, the influence of the interaction between the layers in the multiple protein network, and the influence of the periodic pulse signal respectively;
two new fourth-order tensors are defined according to the fourth-order tensor C to represent laplace matrices of the intra-layer sub-network and the inter-layer sub-network of the multiple network, respectively, as defined below,
wherein δ (α, β) represents a kronecker delta function, when α=β, δ (α, β) =1, otherwise 0; expanding the two tensors to obtain a super Laplace matrix in and between layers of the multiple network,
the multiple network pulse dynamics equation is expressed as a matrix form by the superlaplace matrix between layers and layers of the multiple network,
wherein Is a state vector +.>Is a superlaplace matrix of a multiple network,is a vector indicating the control node, u t Is the aforementioned periodic activation function; based on the matrix equation, the characteristic time tau=1/lambda of the kinetic equation is obtained m, wherein λm For matrix->I is an identity matrix, and θ>0; the pulse period is set to be 5 times or more than 5 times the characteristic time constant according to the characteristic time τ.
Step 3: extraction of multiple protein network pulse dynamics features
Obtaining the association strength between the network node and the seed node by calculating the dynamic characteristics (S) of the multiple protein network node to the pulse signals;
the step 3 specifically comprises the following steps: aiming at the extraction of the pulse dynamics characteristics of the multiple protein networks, the known gene action pulse excitation points related to diseases excite the pulse dynamics process in the multiple protein networks according to the multiple protein network pulse dynamics model, and the impulse response curves of the network nodes are calculated according to the multiple protein network pulse dynamics equation; the kinetic characteristics (S) of the network node to the pulse signal during the multiplex protein network pulse dynamics are defined as:i.e. the maximum value of the node in the impulse dynamics response; calculating the magnitude of the dynamic characteristics of the network node according to the definition, and describing the association strength between the node and the control node;
step 4: fusion of pulse dynamics characteristics of multiple protein network nodes to predict disease genes
Based on the dynamic characteristics in the step 3, obtaining a comprehensive protein score by calculating the reciprocal of the geometric mean of the node ranking values of the corresponding proteins in each network layer of the multiple protein network; disease genes are screened by calculating a descending order of protein composite scores.
The step 4 specifically includes: in a multiplex protein network comprising M network layers, each protein has M corresponding replica nodes, that is to say M pulse dynamics characteristic valuesIn each network layer, the magnitude of the dynamics of the node is +.>Separately computing a descending order of nodes in each network layer Then, calculating the reciprocal of the geometric mean of the node ranking values of the corresponding same proteins in M network layers of the multiple protein network to obtain the comprehensive score of the proteins, wherein the calculation method comprises the following steps: />Finally, according to the comprehensive score, the descending order of the proteins is calculated, and the proteins with the earlier order are more likely to correspond to candidate genes related to diseases, so that the disease genes are identified or predicted, and effective guidance is provided for biological experimental research of the disease genes.
In order to embody the advantages of the invention, in another embodiment, the validity of the prediction method of the invention is further verified through experiments, the invention also takes the known disease gene association data as a test platform, and adopts one-leave verification and five-fold cross verification to comprehensively evaluate the performance of the method;
(1) Biological data tested: disease Gene data from the OMIM database @https://omim.org/) The method comprises the steps of carrying out a first treatment on the surface of the Protein physical interaction data @ published literature datahttps://science.sciencemag.org/content/ suppl/2015/02/18/347.6224.1257601.DC1) The method comprises the steps of carrying out a first treatment on the surface of the Gene expression data is from GTex data; disease phenotype data and phenotype ontology data are from the HPO database;
(2) Evaluation strategy: for leave-one-out validation, one known disease gene is correlated at a time as a positive test set, the other acting training sets; for five-fold cross validation, randomly splitting a known disease gene set of each disease into 5 parts, wherein each part is sequentially used as a positive test set and the other parts are used as training sets; the splitting process is repeated for a plurality of times; for selection of control set, for each gene of the positive test set, 99 genes closest to it on the same chromosome and not belonging to the training set are selected as control set;
(3) Evaluation index: taking AUROC and AUPRC indexes as evaluation indexes of prediction performance; AUROC, also known as AUC, is the area under which a work characteristic curve (ROC), which is a performance curve with true positive rate (also known as recall, sensitivity) as the ordinate and false positive rate as the abscissa, has been widely used to comprehensively measure the global performance of predictive algorithms; AUPRC is the area under the precision-recall curve (PRC), where PRC curve is on the ordinate with precision and on the abscissa with recall;
(4) Evaluation results
As can be seen from fig. 2 and 3, the multiple protein network pulse dynamics approach is superior to the approach of aggregation networks when multiple types of physical interaction networks are used; when multiple types of physical interaction networks and gene co-expression networks are used, the same multiple protein network pulse dynamics approach is superior to that of the polymeric network; the addition of gene co-expression networks can enhance predictive ability relative to multiple protein network pulse dynamics methods using multiple types of physical interaction networks; compared with a multiple protein network pulse dynamics method using multiple types of physical interaction networks and gene co-expression networks, the addition of the gene semantic similarity network can further improve the prediction capability;
as can be seen from fig. 4, in the leave-one-out experiment, in the multiple protein networks of the multiple types of physical interactions ((a) and (e) in fig. 4), the multiple protein networks of the physical interaction network combined gene co-expression network ((b) and (f) in fig. 4), the multiple protein networks of the physical interaction network combined gene co-expression network and the gene semantic similarity network ((c) and (g) in fig. 4), both AUROC values and AUPRC values of the multiple protein network pulse dynamics method (method abbreviated as NIDM) are superior to other methods;
as can be seen from fig. 5, in the five-fold cross-validation experiment, in the multiple protein networks of the multiple types of physical interactions ((a) and (e) in fig. 5), the multiple protein networks of the physical interaction network combined gene co-expression network ((b) and (f) in fig. 5), the multiple protein networks of the physical interaction network combined gene co-expression network and the gene semantic similarity network ((c) and (g) in fig. 5), both the AUROC value and the AUPRC value of the multiple protein network pulse dynamics method NIDM are also superior to those of the other methods;
therefore, the prediction method NIDM of the invention can more effectively fuse multiple protein networks, and can more effectively extract the hidden information in the network through the pulse dynamics process of the multiple protein networks, thereby more effectively identifying the disease genes.
It should be further noted that the above-mentioned prediction method of the present invention may be implemented as a software program or a computer instruction in a non-transitory computer readable storage medium or in a control system with a memory and a processor, and the calculation program thereof is simple and fast. The functional units in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units. The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform part of the steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The last explanation is: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (9)
1. A disease gene prediction method based on multiple protein network pulse dynamics process is characterized by comprising the following steps:
step 1: after biological data preprocessing, a plurality of protein networks of different types are connected with nodes corresponding to the same protein, so that a multi-protein network is constructed, multi-network fusion is realized, and the edge weight of the multi-protein network is standardized by calculating the average degree of network nodes;
step 2: applying periodic pulse signals to seed nodes of each network layer of the multiple protein network in the step 1 to excite the pulse dynamics process of the multiple protein network, calculating the pulse response curve of the multiple protein network nodes, and mining the hidden characteristics of the network nodes;
step 3: acquiring the association strength between the network node and the seed node by calculating the dynamic characteristics of the multiple protein network nodes on the pulse signals;
step 4: based on the dynamic characteristics in the step 3, obtaining a comprehensive protein score by calculating the reciprocal of the geometric average of node ranking values corresponding to the same protein in each network layer of the multiple protein network; disease genes are screened by calculating a descending order of protein composite scores.
2. The disease gene prediction method based on multiple protein network pulse dynamics process according to claim 1, wherein the step 1 specifically comprises the following steps:
(1) Biological data pretreatment: acquiring known disease gene-related data, disease phenotype-related annotation data, and human phenotype ontology data; acquiring a protein physical interaction network; constructing a protein function association network; uniformly mapping protein numbers into standard gene symbols;
(2) Multiple protein network construction: the interconnection and intercommunication of M network layers with N nodes are realized through a multiple protein network model so as to integrate multiple types of protein associated networks, and the specific operation method comprises the following steps: giving M network layers, wherein each network layer comprises N nodes, connecting nodes corresponding to the same protein in M different types of protein networks, and the connection weight between the network layers is 1/M; to facilitate matrixing operation, let A (α) ∈R N×N Representing an adjacency matrix for each network layer, the multiple protein network represented by a super adjacency matrix An intra-layer super-adjacency matrix corresponding to an independent network layer, defined as,
the super-adjacency matrix between the corresponding layers, defined as,
wherein AL ∈R M×M The node represents an interlayer link matrix of the network layers, the side weight is the link strength between the network layers, which is set to be 1/M,represents the Cronecker product, I.epsilon.R M×M Representing the identity matrix;
(3) Normalization of multiple protein networks: dividing the weight of all sides of the multiple protein network by the average degree of network nodes to realize the standardized processing of the multiple protein network, wherein the calculation method comprises the following steps: network node averagingThe normalized network is recorded in a fourth order tensor C, wherein +.> I∈R N×N Representing the identity matrix, δ (α, β) represents the kronecker delta function, when α=β, δ (α, β) =1, otherwise 0.
3. The disease gene prediction method based on multiple protein network pulse dynamics process according to claim 1, wherein the step 2 specifically comprises: when the pulse dynamics process is excited on the multiple protein network, defining a pulse dynamics equation on the multiple protein network after network normalization treatment as follows:
wherein ,the state of a node i at a network layer alpha at a time t is represented, alpha=1 to M, M is the total number of the network layers, i=1 to N, and N is the total number of the nodes; />Is a continuous micro-function for describing the self-evolution process of a node without being influenced by other nodes, and is defined +.>Wherein θ is>0 is a self-evolution weight parameter; />The diffusion coefficient between the node i representing the network layer alpha and the node j representing the network layer beta, namely the connection weight between the nodes after network standardization, and C corresponds to a fourth-order tensor; if node i of network layer alpha is the control node to which the periodic pulse signal is applied, i.e. the known disease gene, +.>Otherwise->u t =∑ σ δ(t-t σ ) Is a periodic activation function, where t σ Is the pulse time constant, delta (t-t) σ ) As a dirac delta function, i.e. when t-t σ When=0, δ (t-t σ ) =1, otherwise 0;
two new fourth-order tensors are defined according to the fourth-order tensor C to represent laplace matrices of the intra-layer sub-network and the inter-layer sub-network of the multiple network, respectively, as defined below,
wherein δ (α, β) represents a kronecker delta function, when α=β, δ (α, β) =1, otherwise 0; expanding the two tensors to obtain a super Laplace matrix in and between layers of the multiple network,
the multiple network pulse dynamics equation is expressed as a matrix form by the superlaplace matrix between layers and layers of the multiple network,
wherein As a state vector of the state vector,is a superlaplace matrix of a multiple network,is a vector indicating the control node, u t Is the aforementioned periodic activation function; based on the matrix equation, the characteristic time tau=1/lambda of the kinetic equation is obtained m, wherein λm For matrix->I is an identity matrix, and θ>0; the pulse period is set to be 5 times or more of the characteristic time constant according to the characteristic time τ.
4. The method for predicting disease genes based on the pulse dynamics process of multiple protein networks according to claim 3, wherein the step 3 specifically comprises: aiming at the extraction of the pulse dynamics characteristics of the multiple protein networks, the known gene action pulse excitation points related to diseases excite the pulse dynamics process in the multiple protein networks according to the multiple protein network pulse dynamics model, and the impulse response curves of the network nodes are calculated according to the multiple protein network pulse dynamics equation; the kinetic characteristics (S) of the network node to the pulse signal during the multiplex protein network pulse dynamics are defined as:i.e. the maximum value of the node in the impulse dynamics response; and calculating the magnitude of the dynamic characteristics of the network node according to the definition, and describing the association strength between the node and the control node.
5. The method for predicting disease genes based on the pulse dynamics process of multiple protein networks according to claim 4, wherein the step 4 specifically comprises: in a multiplex protein network comprising M network layers of N nodes, each protein has M corresponding replica nodes, i.e., M pulse dynamics feature magnitudesIn each network layer, the magnitude of the dynamics of the node is +.>Calculating the descending order of nodes in each network layer>Then, calculating the reciprocal of the geometric mean of the node ranking values of the corresponding same proteins in M network layers of the multiple protein network to obtain the comprehensive score of the proteins, wherein the calculation method comprises the following steps:finally, according to the comprehensive score, the descending order of the proteins is calculated, and the proteins with the earlier order are more likely to correspond to candidate genes related to diseases, so that the disease genes are identified or predicted, and effective guidance is provided for biological experimental research of the disease genes.
6. The method of claim 2, wherein the acquiring protein physical interaction network in step (1) comprises one or more of a regulatory network, a metabolic network, a signaling network, a protein complex network, a protein kinase network, a high throughput binary interaction network, and a literature-validated protein interaction network.
7. The disease gene prediction method based on multiple protein network pulse dynamics process according to claim 2, wherein the constructing protein function association network in the step (1) specifically includes a gene co-expression network and/or a gene semantic association network based on disease gene association.
8. A disease gene prediction system based on multiple protein network pulse dynamics process, comprising:
at least one processor; and at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions capable of performing the multiple protein network pulse dynamics-based disease gene prediction method of any one of claims 1 to 7.
9. A non-transitory computer readable storage medium storing computer instructions that cause the computer to perform the disease gene prediction method based on multiple protein network pulse dynamics process according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110141656.5A CN112820347B (en) | 2021-02-02 | 2021-02-02 | Disease gene prediction method based on multiple protein network pulse dynamics process |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110141656.5A CN112820347B (en) | 2021-02-02 | 2021-02-02 | Disease gene prediction method based on multiple protein network pulse dynamics process |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112820347A CN112820347A (en) | 2021-05-18 |
CN112820347B true CN112820347B (en) | 2023-09-22 |
Family
ID=75860547
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110141656.5A Active CN112820347B (en) | 2021-02-02 | 2021-02-02 | Disease gene prediction method based on multiple protein network pulse dynamics process |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112820347B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8168568B1 (en) * | 2003-03-10 | 2012-05-01 | The United States Of America, As Represented By The Secretary Of The Department Of Health And Human Services | Combinatorial therapy for protein signaling diseases |
CN107887023A (en) * | 2017-12-08 | 2018-04-06 | 中南大学 | A kind of microbial diseases Relationship Prediction method based on similitude and double random walks |
CN108877953A (en) * | 2018-06-06 | 2018-11-23 | 中南大学 | A kind of drug sensitivity prediction method based on more similitude networks |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170098030A1 (en) * | 2014-05-11 | 2017-04-06 | Ofek - Eshkolot Research And Development Ltd | System and method for generating detection of hidden relatedness between proteins via a protein connectivity network |
EP3574096A4 (en) * | 2017-01-25 | 2020-11-04 | Whitehead Institute for Biomedical Research | Methods for building genomic networks and uses thereof |
KR20180117529A (en) * | 2017-04-19 | 2018-10-29 | 주식회사 프로티나 | Method for predicting drug responsiveness by protein-protein interaction analysis |
US11994512B2 (en) * | 2018-01-04 | 2024-05-28 | Massachusetts Institute Of Technology | Single-cell genomic methods to generate ex vivo cell systems that recapitulate in vivo biology with improved fidelity |
WO2020006409A1 (en) * | 2018-06-28 | 2020-01-02 | Trustees Of Boston University | Systems and methods for control of gene expression |
-
2021
- 2021-02-02 CN CN202110141656.5A patent/CN112820347B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8168568B1 (en) * | 2003-03-10 | 2012-05-01 | The United States Of America, As Represented By The Secretary Of The Department Of Health And Human Services | Combinatorial therapy for protein signaling diseases |
CN107887023A (en) * | 2017-12-08 | 2018-04-06 | 中南大学 | A kind of microbial diseases Relationship Prediction method based on similitude and double random walks |
CN108877953A (en) * | 2018-06-06 | 2018-11-23 | 中南大学 | A kind of drug sensitivity prediction method based on more similitude networks |
Non-Patent Citations (5)
Title |
---|
Disease Gene Prediction by Integrating PPI Networks, Clinical RNA-Seq Data and OMIM Data;Ping Luo 等;《IEEE/ACM Trans Comput Biol Bioinform》;222-232 * |
NIDM: network impulsive dynamics on multiplex biological network for disease-gene prediction;Ju Xiang 等;《Briefings in Bioinformatics》;第22卷(第5期);1-18 * |
Predicting disease-related genes by path structure and community structure in protein–protein networks;Ke Hu 等;《Manuscript》;1-15 * |
基于动态蛋白互作网络的蛋白质复合物识别算法研究;苏令涛;《中国优秀硕士学位论文全文数据库 基础科学辑》;A006-49 * |
基于多信息融合的结核病相关基因预测及其网络分析;孙隽;《中国博士学位论文全文数据库 基础科学辑》(第01期);A006-223 * |
Also Published As
Publication number | Publication date |
---|---|
CN112820347A (en) | 2021-05-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Vlasblom et al. | Markov clustering versus affinity propagation for the partitioning of protein interaction graphs | |
EP2864919B1 (en) | Systems and methods for generating biomarker signatures with integrated dual ensemble and generalized simulated annealing techniques | |
NZ572036A (en) | Data analysis and predictive systems and related methodologies | |
US11398297B2 (en) | Systems and methods for using machine learning and DNA sequencing to extract latent information for DNA, RNA and protein sequences | |
CN110993113B (en) | LncRNA-disease relation prediction method and system based on MF-SDAE | |
Zhang et al. | Predicting disease-related RNA associations based on graph convolutional attention network | |
KR101888628B1 (en) | Method and Media of Predicting protein-binding regions in RNA Using Nucleotide Profiles and Compositions | |
CN113871021A (en) | Graph and attention machine mechanism-based circRNA and disease association relation prediction method | |
KR101990429B1 (en) | System and method for selecting multi-marker panels | |
CN115798730A (en) | Method, apparatus and medium for circular RNA-disease association prediction based on weighted graph attention and heterogeneous graph neural networks | |
CN114420201A (en) | Method for predicting interaction of drug targets by efficient fusion of multi-source data | |
CN111009290A (en) | Training method of plateau adaptability detection model, adaptability identification method and device | |
CN112820347B (en) | Disease gene prediction method based on multiple protein network pulse dynamics process | |
CN111783088B (en) | Malicious code family clustering method and device and computer equipment | |
CN113192562B (en) | Pathogenic gene identification method and system fusing multi-scale module structure information | |
CN112837752B (en) | Depression disorder gene feature mining method based on multi-network fusion and multi-layer network diffusion | |
Phuong et al. | Predicting gene function using similarity learning | |
CN110739028B (en) | Cell line drug response prediction method based on K-nearest neighbor constraint matrix decomposition | |
US20160378914A1 (en) | Method of and apparatus for identifying phenotype-specific gene network using gene expression data | |
CN112133367A (en) | Method and device for predicting interaction relation between medicine and target spot | |
Halsana et al. | DensePPI: A Novel Image-Based Deep Learning Method for Prediction of Protein–Protein Interactions | |
CN114694748B (en) | Proteomics molecular typing method based on prognosis information and reinforcement learning | |
CN116453586B (en) | Cell specific synthetic lethal pair prediction method, device, equipment and medium | |
CN114093422B (en) | Prediction method and system for interaction between miRNA and gene based on multiple relationship graph rolling network | |
KR102429120B1 (en) | HUMAN PPARγ ANTAGONIST PREDICTION METHOD BASED ON LEARNING MODEL AND ANALYSIS APPARATUS |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |