CN112820347B - Disease gene prediction method based on multiple protein network pulse dynamics process - Google Patents

Disease gene prediction method based on multiple protein network pulse dynamics process Download PDF

Info

Publication number
CN112820347B
CN112820347B CN202110141656.5A CN202110141656A CN112820347B CN 112820347 B CN112820347 B CN 112820347B CN 202110141656 A CN202110141656 A CN 202110141656A CN 112820347 B CN112820347 B CN 112820347B
Authority
CN
China
Prior art keywords
network
protein
multiple protein
pulse
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110141656.5A
Other languages
Chinese (zh)
Other versions
CN112820347A (en
Inventor
李敏
项炬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202110141656.5A priority Critical patent/CN112820347B/en
Publication of CN112820347A publication Critical patent/CN112820347A/en
Application granted granted Critical
Publication of CN112820347B publication Critical patent/CN112820347B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a disease gene prediction method based on a multiple protein network pulse dynamics process, which mainly comprises the following steps: 1. constructing a standardized multiple protein network; 2. constructing a multiple protein network pulse dynamics model; 3. extracting pulse dynamics characteristics of multiple protein networks; 4. pulse dynamics characteristics of multiple protein network nodes are fused to predict disease genes by sequencing. The prediction method can be used for more effectively fusing multiple protein networks and mining hidden characteristics in the multiple protein networks, so that the disease gene identification capacity is improved, the calculated amount is small, and the prediction method is suitable for realizing analysis of biological information big data through software.

Description

Disease gene prediction method based on multiple protein network pulse dynamics process
Technical Field
The invention belongs to the field of bioinformatics analysis, and relates to a disease gene prediction method based on a multiple protein network pulse dynamics process.
Background
The identification of disease-related genes is of great importance for the study of disease. Traditional methods such as linkage analysis are helpful for identifying disease-related genes, but often cannot accurately locate disease-causing genes. Due to the high economic costs and high time consumption of biological experiment predictions, the development of efficient computational methods for predicting and screening disease-related genes from a large number of candidate genes has become critical.
Genes associated with similar or identical diseases are functionally related and tend to accumulate nearby in biological networks such as protein-protein interaction networks (PPIs). Therefore, network-based algorithms are very popular in disease gene prediction and related fields, and network propagation is one of the most widely applied strategies, and has become a leading-edge method for genetic association research. Traditional network propagation is useful, but it tends to focus on dynamic steady state solutions, and thus may lose some of the useful information hidden in the dynamic process. Thus, it is necessary to directly mine hidden information in the dynamic process that helps reveal disease gene associations.
Neglecting coexistence of different types of interactions/associations in a networked system, e.g., aggregating these relationships into a single network, can change the topology properties of the overall system, resulting in significant impact on modeling and predictive capabilities of the system. It remains a challenge to make full use of various types of biological networks to effectively predict disease-related genes, as they often have different meanings and reliability, such as metabolic enzyme-coupled interactions, signal transduction, etc. The efficient use of a multi-source biomolecular network will help to enhance the ability of disease gene prediction methods.
Based on this, it is highly desirable to design a disease gene prediction method that can effectively utilize the cross-linking effects of different types of network layers in a networked system.
Disclosure of Invention
First, the technical problem to be solved
Based on the above, the invention discloses a disease gene prediction method based on a multiple protein network pulse dynamics process, which can improve the capability of the disease gene prediction method to fully reveal hidden information related to disease genes, and the analysis method based on the pulse dynamics process can effectively utilize the cross-linking influence of multiple protein networks, so that the prediction accuracy is improved, and is suitable for mass software analysis of biological data.
(II) technical scheme
The invention discloses a disease gene prediction method based on a multiple protein network pulse dynamics process, which comprises the following steps:
step 1: after biological data preprocessing, a plurality of protein networks of different types are connected with nodes corresponding to the same protein, so that a multi-protein network is constructed, multi-network fusion is realized, and the edge weight of the multi-protein network is standardized by calculating the average degree of network nodes; mapping the protein numbers to standard gene symbols uniformly;
step 2: applying periodic pulse signals to seed nodes of each network layer of the multiple protein network in the step 1 to excite the pulse dynamics process of the multiple protein network, calculating the pulse response curve of the multiple protein network nodes, and mining the hidden characteristics of the network nodes;
step 3: acquiring the association strength between the network node and the seed node by calculating the dynamic characteristics of the multiple protein network nodes on the pulse signals;
step 4: based on the dynamic characteristics in the step 3, obtaining a comprehensive protein score by calculating the reciprocal of the geometric average of node ranking values corresponding to the same protein in each network layer of the multiple protein network; disease genes are screened by calculating a descending order of protein composite scores.
Further, the step 1 specifically includes the following steps:
(1) Biological data pretreatment: acquiring known disease gene-related data, disease phenotype-related annotation data, and human phenotype ontology data; acquiring a protein physical interaction network; constructing a protein function association network; uniformly mapping protein numbers into standard gene symbols;
(2) Multiple protein network construction: the interconnection and intercommunication of M network layers with N nodes are realized through a multiple protein network model so as to integrate multiple types of protein associated networks, and the specific operation method comprises the following steps: giving M network layers, wherein each network layer comprises N nodes, connecting nodes corresponding to the same protein in M different types of protein networks, and the connection weight between the network layers is 1/M; to facilitate matrixing operation, let A (α) ∈R N×N Representing an adjacency matrix for each network layer, the multiple protein network represented by a super adjacency matrix An intra-layer super-adjacency matrix corresponding to an independent network layer, defined as,
the super-adjacency matrix between the corresponding layers, defined as,
wherein AL ∈R M×M The representative node represents an inter-layer link matrix of the network layers, the side weight of which is the link strength between the network layers, set to 1/M,represents the Cronecker product, I.epsilon.R M×M Representing the identity matrix;
(3) Normalization of multiple protein networks: dividing the weight of all sides of the multiple protein network by the average degree of network nodes to realize the standardized processing of the multiple protein network, wherein the calculation method comprises the following steps: network node averagingThe normalized network is recorded in a fourth order tensor C, wherein +.> I∈R N×N Representing the identity matrix, delta (alpha)Beta) represents a kronecker delta function, when alpha=beta, delta (alpha, beta) =1, otherwise 0.
Further, the step 2 specifically includes: when the pulse dynamics process is excited on the multiple protein network, defining a pulse dynamics equation on the multiple protein network after network normalization treatment as follows:
wherein ,the state of the node i (i=1 to N, N is the total number of nodes) at the network layer α (α=1 to M, M is the total number of network layers) at the time t; />Is a continuous micro-function for describing the self-evolution process of a node without being influenced by other nodes, and is defined +.>Wherein θ is>0 is a self-evolution weight parameter; />The diffusion coefficient between the node i representing the gene network layer alpha and the node j representing the network layer beta, namely the connection weight between the nodes after network standardization, and C corresponds to a fourth-order tensor; if node i of network layer alpha is the control node to which the periodic pulse signal is applied, i.e. the known disease gene, +.>Otherwise->Is a periodic activation function, where t σ Is the pulse time constant, delta (t-t) σ ) Is a dirac delta function (when t-t σ When=0, δ (t-t σ ) =1, otherwise 0).
Two new fourth-order tensors are defined according to the fourth-order tensor C to represent laplace matrices of the intra-layer sub-network and the inter-layer sub-network of the multiple network, respectively, as defined below,
wherein δ (α, β) represents a kronecker delta function, when α=β, δ (α, β) =1, otherwise 0; expanding the two tensors to obtain a super Laplace matrix in and between layers of the multiple network,
the multiple network pulse dynamics equation is expressed as a matrix form by the superlaplace matrix between layers and layers of the multiple network,
wherein Is a state vector +.>Is a superlaplace matrix of a multiple network,is a vector indicating the control node, u t Is the aforementioned periodic activation function; based on the matrix equation, the characteristic time tau=1/lambda of the kinetic equation is obtained m, wherein λm For matrix->I is an identity matrix, and θ>0; the pulse period is set to be 5 times or more than 5 times the characteristic time constant according to the characteristic time τ.
Further, the step 3 specifically includes: aiming at the extraction of the pulse dynamics characteristics of the multiple protein networks, the known gene action pulse excitation points related to diseases excite the pulse dynamics process in the multiple protein networks according to the multiple protein network pulse dynamics model, and the impulse response curves of the network nodes are calculated according to the multiple protein network pulse dynamics equation; the kinetic characteristics (S) of the network node to the pulse signal during the multiplex protein network pulse dynamics are defined as:i.e. the maximum value of the node in the impulse dynamics response; and calculating the magnitude of the dynamic characteristics of the network node according to the definition, and describing the association strength between the node and the control node.
Further, the step 4 specifically includes: in a multiplex protein network comprising M network layers of N nodes, each protein has M corresponding replica nodes, i.e., M pulse dynamics feature magnitudesIn each network layer, the magnitude of the dynamics of the node is +.>Calculating the descending order of nodes in each network layer>Then, calculating the reciprocal of the geometric mean of the node ranking values of the corresponding same proteins in M network layers of the multiple protein network to obtain the comprehensive score of the proteins, wherein the calculation method comprises the following steps:finally, according to the comprehensive score, the descending order of the proteins is calculated, and the proteins with the earlier order are more likely to correspond to candidate genes related to diseases, so that the disease genes are identified or predicted, and effective guidance is provided for biological experimental research of the disease genes.
Further, the acquiring protein physical interaction network in the step (1) specifically includes one or more of a regulatory network, a metabolic network, a signaling network, a protein complex network, a protein kinase network, a high-throughput binary interaction network, and a literature-validated protein interaction network.
Further, the construction of the protein function association network in the step (1) specifically includes a gene co-expression network and/or a gene semantic association network based on disease gene association.
In another aspect, the invention also discloses a disease gene prediction system based on multiple protein network pulse dynamics process, comprising:
at least one processor; and at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor to invoke the program instructions to perform the disease gene prediction method based on the multiple protein network pulse dynamics process as described in any one of the above.
In a further aspect, the invention also discloses a non-transitory computer readable storage medium storing computer instructions that cause the computer to perform the disease gene prediction method based on multiple protein network pulse dynamics process as described in any one of the above.
(III) beneficial effects
The technical scheme of the invention has the advantages that the method can more effectively fuse a plurality of types of protein networks, and the information hidden in the multiple protein network structure is mined through the pulse dynamics process of the multiple protein networks, so that the disease related genes can be more effectively identified. The experimental result on the real data set shows that compared with a plurality of existing methods, the prediction method provided by the invention has stronger and more accurate prediction capability, has small calculated amount, and is suitable for realizing analysis processing of batch biological information data through software calculation.
Drawings
The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are illustrative and should not be construed as limiting the invention in any way, in which:
FIG. 1 is a flowchart of a disease gene prediction method NIDM of the present invention;
FIG. 2 is a graph showing the percentage improvement of the performance of the disease gene prediction method NIDM according to the invention in different data sets when a leave-one-out verification strategy is adopted;
FIG. 3 is a graph of percentage improvement of performance of a disease gene prediction method NIDM of the present invention in different data sets using a five-fold cross-validation strategy;
FIG. 4 is a graph comparing the performance index of the disease gene prediction method NIDM of the present invention with the performance index of the existing RWRMP, RWRMG, DRS, endeavour, RWR and KS methods when a leave-one-out verification strategy is adopted;
FIG. 5 is a graph comparing performance metrics of the disease gene prediction method NIDM of the present invention with the existing RWRMP, RWRMG, DRS, endeavour, RWR and KS methods when a five-fold cross-validation strategy is employed.
Detailed Description
The technical problems and advantages of the technical solution of the present invention will be described in detail with reference to the accompanying drawings and examples, and it should be noted that the described examples are only intended to facilitate understanding of the present invention and are not intended to limit the present invention in any way.
As shown in FIG. 1, the invention provides a disease gene prediction method based on multiple protein network pulse dynamics process, which comprises the following steps:
step 1: construction of a normalized Multiprotein network
After biological data preprocessing, a plurality of nodes corresponding to the same protein in a plurality of different types of protein networks are connected to construct a multi-protein network, so that multi-network fusion is realized; normalization of edge weights of multiple protein networks by computing network node averages
The step 1 specifically comprises the following steps:
(1) Biological data pretreatment: acquiring known disease gene-related data, disease phenotype-related annotation data, and human phenotype ontology data; acquiring a protein physical interaction network (e.g., regulatory network, metabolic network, signaling network, protein complex network, protein kinase network, high-throughput binary interaction network, and literature-validated protein interaction network); constructing a protein function association network (such as a gene co-expression network and a gene semantic association network based on disease gene association); uniformly mapping protein numbers into standard gene symbols;
(2) Multiple protein network construction: the interconnection and intercommunication of M network layers with N nodes are realized through a multiple protein network model so as to integrate multiple types of protein associated networks, and the specific operation method comprises the following steps: giving M network layers, wherein each network layer comprises N nodes, connecting nodes corresponding to the same protein in M different types of protein networks, and the connection weight between the network layers is 1/M; to facilitate matrixing operation, let A (α) ∈R N×N Representing an adjacency matrix for each network layer, the multiple protein network represented by a super adjacency matrix An intra-layer super-adjacency matrix corresponding to an independent network layer, defined as,
the super-adjacency matrix between the corresponding layers, defined as,
wherein AL ∈R M×M An inter-layer link matrix representing nodes as network layers, whose edge weights are the link strengths between the network layers, i.e. 1/M,represents the Cronecker product, I.epsilon.R M×M Representing the identity matrix;
(3) Normalization of multiple protein networks: dividing the weight of all sides of the multiple protein network by the average degree of network nodes to realize the standardized processing of the multiple protein network, wherein the calculation method comprises the following steps: network node averagingThe normalized network is recorded in a fourth order tensor C, wherein +.> I∈R N×N Representing the identity matrix, δ (α, β) represents the kronecker delta function, when α=β, δ (α, β) =1, otherwise 0.
Step 2: construction of multiple protein network pulse dynamics model
Applying periodic pulse signals to seed nodes of each network layer of the multiple protein network in the step 1, exciting the pulse dynamics process of the multiple protein network, calculating the pulse response curve of the multiple protein network nodes, and mining the hidden characteristics of the network nodes;
the step 2 specifically comprises the following steps: when the pulse dynamics process is excited on the multiple protein network, defining a pulse dynamics equation on the multiple protein network after network normalization treatment as follows:
wherein ,the state of the node i (i=1 to N, N is the total number of nodes) at the network layer α (α=1 to M, M is the total number of network layers) at the time t; />Is a continuous micro-function for describing the self-evolution process of a node without being influenced by other nodes, and is defined +.>Wherein θ is>0 is a self-evolution weight parameter; />The diffusion coefficient between nodes i and j representing the gene network layers alpha and beta, namely the connection weight between nodes after network standardization, C corresponds to a fourth-order tensor, which determines the diffusion behavior of pulse signals between nodes of each network layer; if node i of network layer alpha is the control node to which the periodic pulse signal is applied, i.e. the known disease gene, +.>Otherwise->u t =∑ σ δ(t-t σ ) Is a periodic activation function, where t σ Is the pulse time constant, delta (t-t) σ ) Is a dirac delta function (when t-t σ When=0, δ (t-t σ ) =1, otherwise 0); four terms in the pulse dynamics equation describe the self-evolution of the node, the influence of the interaction between the layers in the multiple protein network, and the influence of the periodic pulse signal respectively;
two new fourth-order tensors are defined according to the fourth-order tensor C to represent laplace matrices of the intra-layer sub-network and the inter-layer sub-network of the multiple network, respectively, as defined below,
wherein δ (α, β) represents a kronecker delta function, when α=β, δ (α, β) =1, otherwise 0; expanding the two tensors to obtain a super Laplace matrix in and between layers of the multiple network,
the multiple network pulse dynamics equation is expressed as a matrix form by the superlaplace matrix between layers and layers of the multiple network,
wherein Is a state vector +.>Is a superlaplace matrix of a multiple network,is a vector indicating the control node, u t Is the aforementioned periodic activation function; based on the matrix equation, the characteristic time tau=1/lambda of the kinetic equation is obtained m, wherein λm For matrix->I is an identity matrix, and θ>0; the pulse period is set to be 5 times or more than 5 times the characteristic time constant according to the characteristic time τ.
Step 3: extraction of multiple protein network pulse dynamics features
Obtaining the association strength between the network node and the seed node by calculating the dynamic characteristics (S) of the multiple protein network node to the pulse signals;
the step 3 specifically comprises the following steps: aiming at the extraction of the pulse dynamics characteristics of the multiple protein networks, the known gene action pulse excitation points related to diseases excite the pulse dynamics process in the multiple protein networks according to the multiple protein network pulse dynamics model, and the impulse response curves of the network nodes are calculated according to the multiple protein network pulse dynamics equation; the kinetic characteristics (S) of the network node to the pulse signal during the multiplex protein network pulse dynamics are defined as:i.e. the maximum value of the node in the impulse dynamics response; calculating the magnitude of the dynamic characteristics of the network node according to the definition, and describing the association strength between the node and the control node;
step 4: fusion of pulse dynamics characteristics of multiple protein network nodes to predict disease genes
Based on the dynamic characteristics in the step 3, obtaining a comprehensive protein score by calculating the reciprocal of the geometric mean of the node ranking values of the corresponding proteins in each network layer of the multiple protein network; disease genes are screened by calculating a descending order of protein composite scores.
The step 4 specifically includes: in a multiplex protein network comprising M network layers, each protein has M corresponding replica nodes, that is to say M pulse dynamics characteristic valuesIn each network layer, the magnitude of the dynamics of the node is +.>Separately computing a descending order of nodes in each network layer Then, calculating the reciprocal of the geometric mean of the node ranking values of the corresponding same proteins in M network layers of the multiple protein network to obtain the comprehensive score of the proteins, wherein the calculation method comprises the following steps: />Finally, according to the comprehensive score, the descending order of the proteins is calculated, and the proteins with the earlier order are more likely to correspond to candidate genes related to diseases, so that the disease genes are identified or predicted, and effective guidance is provided for biological experimental research of the disease genes.
In order to embody the advantages of the invention, in another embodiment, the validity of the prediction method of the invention is further verified through experiments, the invention also takes the known disease gene association data as a test platform, and adopts one-leave verification and five-fold cross verification to comprehensively evaluate the performance of the method;
(1) Biological data tested: disease Gene data from the OMIM database @https://omim.org/) The method comprises the steps of carrying out a first treatment on the surface of the Protein physical interaction data @ published literature datahttps://science.sciencemag.org/content/ suppl/2015/02/18/347.6224.1257601.DC1) The method comprises the steps of carrying out a first treatment on the surface of the Gene expression data is from GTex data; disease phenotype data and phenotype ontology data are from the HPO database;
(2) Evaluation strategy: for leave-one-out validation, one known disease gene is correlated at a time as a positive test set, the other acting training sets; for five-fold cross validation, randomly splitting a known disease gene set of each disease into 5 parts, wherein each part is sequentially used as a positive test set and the other parts are used as training sets; the splitting process is repeated for a plurality of times; for selection of control set, for each gene of the positive test set, 99 genes closest to it on the same chromosome and not belonging to the training set are selected as control set;
(3) Evaluation index: taking AUROC and AUPRC indexes as evaluation indexes of prediction performance; AUROC, also known as AUC, is the area under which a work characteristic curve (ROC), which is a performance curve with true positive rate (also known as recall, sensitivity) as the ordinate and false positive rate as the abscissa, has been widely used to comprehensively measure the global performance of predictive algorithms; AUPRC is the area under the precision-recall curve (PRC), where PRC curve is on the ordinate with precision and on the abscissa with recall;
(4) Evaluation results
As can be seen from fig. 2 and 3, the multiple protein network pulse dynamics approach is superior to the approach of aggregation networks when multiple types of physical interaction networks are used; when multiple types of physical interaction networks and gene co-expression networks are used, the same multiple protein network pulse dynamics approach is superior to that of the polymeric network; the addition of gene co-expression networks can enhance predictive ability relative to multiple protein network pulse dynamics methods using multiple types of physical interaction networks; compared with a multiple protein network pulse dynamics method using multiple types of physical interaction networks and gene co-expression networks, the addition of the gene semantic similarity network can further improve the prediction capability;
as can be seen from fig. 4, in the leave-one-out experiment, in the multiple protein networks of the multiple types of physical interactions ((a) and (e) in fig. 4), the multiple protein networks of the physical interaction network combined gene co-expression network ((b) and (f) in fig. 4), the multiple protein networks of the physical interaction network combined gene co-expression network and the gene semantic similarity network ((c) and (g) in fig. 4), both AUROC values and AUPRC values of the multiple protein network pulse dynamics method (method abbreviated as NIDM) are superior to other methods;
as can be seen from fig. 5, in the five-fold cross-validation experiment, in the multiple protein networks of the multiple types of physical interactions ((a) and (e) in fig. 5), the multiple protein networks of the physical interaction network combined gene co-expression network ((b) and (f) in fig. 5), the multiple protein networks of the physical interaction network combined gene co-expression network and the gene semantic similarity network ((c) and (g) in fig. 5), both the AUROC value and the AUPRC value of the multiple protein network pulse dynamics method NIDM are also superior to those of the other methods;
therefore, the prediction method NIDM of the invention can more effectively fuse multiple protein networks, and can more effectively extract the hidden information in the network through the pulse dynamics process of the multiple protein networks, thereby more effectively identifying the disease genes.
It should be further noted that the above-mentioned prediction method of the present invention may be implemented as a software program or a computer instruction in a non-transitory computer readable storage medium or in a control system with a memory and a processor, and the calculation program thereof is simple and fast. The functional units in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units. The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform part of the steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The last explanation is: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. A disease gene prediction method based on multiple protein network pulse dynamics process is characterized by comprising the following steps:
step 1: after biological data preprocessing, a plurality of protein networks of different types are connected with nodes corresponding to the same protein, so that a multi-protein network is constructed, multi-network fusion is realized, and the edge weight of the multi-protein network is standardized by calculating the average degree of network nodes;
step 2: applying periodic pulse signals to seed nodes of each network layer of the multiple protein network in the step 1 to excite the pulse dynamics process of the multiple protein network, calculating the pulse response curve of the multiple protein network nodes, and mining the hidden characteristics of the network nodes;
step 3: acquiring the association strength between the network node and the seed node by calculating the dynamic characteristics of the multiple protein network nodes on the pulse signals;
step 4: based on the dynamic characteristics in the step 3, obtaining a comprehensive protein score by calculating the reciprocal of the geometric average of node ranking values corresponding to the same protein in each network layer of the multiple protein network; disease genes are screened by calculating a descending order of protein composite scores.
2. The disease gene prediction method based on multiple protein network pulse dynamics process according to claim 1, wherein the step 1 specifically comprises the following steps:
(1) Biological data pretreatment: acquiring known disease gene-related data, disease phenotype-related annotation data, and human phenotype ontology data; acquiring a protein physical interaction network; constructing a protein function association network; uniformly mapping protein numbers into standard gene symbols;
(2) Multiple protein network construction: the interconnection and intercommunication of M network layers with N nodes are realized through a multiple protein network model so as to integrate multiple types of protein associated networks, and the specific operation method comprises the following steps: giving M network layers, wherein each network layer comprises N nodes, connecting nodes corresponding to the same protein in M different types of protein networks, and the connection weight between the network layers is 1/M; to facilitate matrixing operation, let A (α) ∈R N×N Representing an adjacency matrix for each network layer, the multiple protein network represented by a super adjacency matrix An intra-layer super-adjacency matrix corresponding to an independent network layer, defined as,
the super-adjacency matrix between the corresponding layers, defined as,
wherein AL ∈R M×M The node represents an interlayer link matrix of the network layers, the side weight is the link strength between the network layers, which is set to be 1/M,represents the Cronecker product, I.epsilon.R M×M Representing the identity matrix;
(3) Normalization of multiple protein networks: dividing the weight of all sides of the multiple protein network by the average degree of network nodes to realize the standardized processing of the multiple protein network, wherein the calculation method comprises the following steps: network node averagingThe normalized network is recorded in a fourth order tensor C, wherein +.> I∈R N×N Representing the identity matrix, δ (α, β) represents the kronecker delta function, when α=β, δ (α, β) =1, otherwise 0.
3. The disease gene prediction method based on multiple protein network pulse dynamics process according to claim 1, wherein the step 2 specifically comprises: when the pulse dynamics process is excited on the multiple protein network, defining a pulse dynamics equation on the multiple protein network after network normalization treatment as follows:
wherein ,the state of a node i at a network layer alpha at a time t is represented, alpha=1 to M, M is the total number of the network layers, i=1 to N, and N is the total number of the nodes; />Is a continuous micro-function for describing the self-evolution process of a node without being influenced by other nodes, and is defined +.>Wherein θ is>0 is a self-evolution weight parameter; />The diffusion coefficient between the node i representing the network layer alpha and the node j representing the network layer beta, namely the connection weight between the nodes after network standardization, and C corresponds to a fourth-order tensor; if node i of network layer alpha is the control node to which the periodic pulse signal is applied, i.e. the known disease gene, +.>Otherwise->u t =∑ σ δ(t-t σ ) Is a periodic activation function, where t σ Is the pulse time constant, delta (t-t) σ ) As a dirac delta function, i.e. when t-t σ When=0, δ (t-t σ ) =1, otherwise 0;
two new fourth-order tensors are defined according to the fourth-order tensor C to represent laplace matrices of the intra-layer sub-network and the inter-layer sub-network of the multiple network, respectively, as defined below,
wherein δ (α, β) represents a kronecker delta function, when α=β, δ (α, β) =1, otherwise 0; expanding the two tensors to obtain a super Laplace matrix in and between layers of the multiple network,
the multiple network pulse dynamics equation is expressed as a matrix form by the superlaplace matrix between layers and layers of the multiple network,
wherein As a state vector of the state vector,is a superlaplace matrix of a multiple network,is a vector indicating the control node, u t Is the aforementioned periodic activation function; based on the matrix equation, the characteristic time tau=1/lambda of the kinetic equation is obtained m, wherein λm For matrix->I is an identity matrix, and θ>0; the pulse period is set to be 5 times or more of the characteristic time constant according to the characteristic time τ.
4. The method for predicting disease genes based on the pulse dynamics process of multiple protein networks according to claim 3, wherein the step 3 specifically comprises: aiming at the extraction of the pulse dynamics characteristics of the multiple protein networks, the known gene action pulse excitation points related to diseases excite the pulse dynamics process in the multiple protein networks according to the multiple protein network pulse dynamics model, and the impulse response curves of the network nodes are calculated according to the multiple protein network pulse dynamics equation; the kinetic characteristics (S) of the network node to the pulse signal during the multiplex protein network pulse dynamics are defined as:i.e. the maximum value of the node in the impulse dynamics response; and calculating the magnitude of the dynamic characteristics of the network node according to the definition, and describing the association strength between the node and the control node.
5. The method for predicting disease genes based on the pulse dynamics process of multiple protein networks according to claim 4, wherein the step 4 specifically comprises: in a multiplex protein network comprising M network layers of N nodes, each protein has M corresponding replica nodes, i.e., M pulse dynamics feature magnitudesIn each network layer, the magnitude of the dynamics of the node is +.>Calculating the descending order of nodes in each network layer>Then, calculating the reciprocal of the geometric mean of the node ranking values of the corresponding same proteins in M network layers of the multiple protein network to obtain the comprehensive score of the proteins, wherein the calculation method comprises the following steps:finally, according to the comprehensive score, the descending order of the proteins is calculated, and the proteins with the earlier order are more likely to correspond to candidate genes related to diseases, so that the disease genes are identified or predicted, and effective guidance is provided for biological experimental research of the disease genes.
6. The method of claim 2, wherein the acquiring protein physical interaction network in step (1) comprises one or more of a regulatory network, a metabolic network, a signaling network, a protein complex network, a protein kinase network, a high throughput binary interaction network, and a literature-validated protein interaction network.
7. The disease gene prediction method based on multiple protein network pulse dynamics process according to claim 2, wherein the constructing protein function association network in the step (1) specifically includes a gene co-expression network and/or a gene semantic association network based on disease gene association.
8. A disease gene prediction system based on multiple protein network pulse dynamics process, comprising:
at least one processor; and at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions capable of performing the multiple protein network pulse dynamics-based disease gene prediction method of any one of claims 1 to 7.
9. A non-transitory computer readable storage medium storing computer instructions that cause the computer to perform the disease gene prediction method based on multiple protein network pulse dynamics process according to any one of claims 1 to 7.
CN202110141656.5A 2021-02-02 2021-02-02 Disease gene prediction method based on multiple protein network pulse dynamics process Active CN112820347B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110141656.5A CN112820347B (en) 2021-02-02 2021-02-02 Disease gene prediction method based on multiple protein network pulse dynamics process

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110141656.5A CN112820347B (en) 2021-02-02 2021-02-02 Disease gene prediction method based on multiple protein network pulse dynamics process

Publications (2)

Publication Number Publication Date
CN112820347A CN112820347A (en) 2021-05-18
CN112820347B true CN112820347B (en) 2023-09-22

Family

ID=75860547

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110141656.5A Active CN112820347B (en) 2021-02-02 2021-02-02 Disease gene prediction method based on multiple protein network pulse dynamics process

Country Status (1)

Country Link
CN (1) CN112820347B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8168568B1 (en) * 2003-03-10 2012-05-01 The United States Of America, As Represented By The Secretary Of The Department Of Health And Human Services Combinatorial therapy for protein signaling diseases
CN107887023A (en) * 2017-12-08 2018-04-06 中南大学 A kind of microbial diseases Relationship Prediction method based on similitude and double random walks
CN108877953A (en) * 2018-06-06 2018-11-23 中南大学 A kind of drug sensitivity prediction method based on more similitude networks

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170098030A1 (en) * 2014-05-11 2017-04-06 Ofek - Eshkolot Research And Development Ltd System and method for generating detection of hidden relatedness between proteins via a protein connectivity network
EP3574096A4 (en) * 2017-01-25 2020-11-04 Whitehead Institute for Biomedical Research Methods for building genomic networks and uses thereof
KR20180117529A (en) * 2017-04-19 2018-10-29 주식회사 프로티나 Method for predicting drug responsiveness by protein-protein interaction analysis
US11994512B2 (en) * 2018-01-04 2024-05-28 Massachusetts Institute Of Technology Single-cell genomic methods to generate ex vivo cell systems that recapitulate in vivo biology with improved fidelity
WO2020006409A1 (en) * 2018-06-28 2020-01-02 Trustees Of Boston University Systems and methods for control of gene expression

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8168568B1 (en) * 2003-03-10 2012-05-01 The United States Of America, As Represented By The Secretary Of The Department Of Health And Human Services Combinatorial therapy for protein signaling diseases
CN107887023A (en) * 2017-12-08 2018-04-06 中南大学 A kind of microbial diseases Relationship Prediction method based on similitude and double random walks
CN108877953A (en) * 2018-06-06 2018-11-23 中南大学 A kind of drug sensitivity prediction method based on more similitude networks

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Disease Gene Prediction by Integrating PPI Networks, Clinical RNA-Seq Data and OMIM Data;Ping Luo 等;《IEEE/ACM Trans Comput Biol Bioinform》;222-232 *
NIDM: network impulsive dynamics on multiplex biological network for disease-gene prediction;Ju Xiang 等;《Briefings in Bioinformatics》;第22卷(第5期);1-18 *
Predicting disease-related genes by path structure and community structure in protein–protein networks;Ke Hu 等;《Manuscript》;1-15 *
基于动态蛋白互作网络的蛋白质复合物识别算法研究;苏令涛;《中国优秀硕士学位论文全文数据库 基础科学辑》;A006-49 *
基于多信息融合的结核病相关基因预测及其网络分析;孙隽;《中国博士学位论文全文数据库 基础科学辑》(第01期);A006-223 *

Also Published As

Publication number Publication date
CN112820347A (en) 2021-05-18

Similar Documents

Publication Publication Date Title
Vlasblom et al. Markov clustering versus affinity propagation for the partitioning of protein interaction graphs
EP2864919B1 (en) Systems and methods for generating biomarker signatures with integrated dual ensemble and generalized simulated annealing techniques
NZ572036A (en) Data analysis and predictive systems and related methodologies
US11398297B2 (en) Systems and methods for using machine learning and DNA sequencing to extract latent information for DNA, RNA and protein sequences
CN110993113B (en) LncRNA-disease relation prediction method and system based on MF-SDAE
Zhang et al. Predicting disease-related RNA associations based on graph convolutional attention network
KR101888628B1 (en) Method and Media of Predicting protein-binding regions in RNA Using Nucleotide Profiles and Compositions
CN113871021A (en) Graph and attention machine mechanism-based circRNA and disease association relation prediction method
KR101990429B1 (en) System and method for selecting multi-marker panels
CN115798730A (en) Method, apparatus and medium for circular RNA-disease association prediction based on weighted graph attention and heterogeneous graph neural networks
CN114420201A (en) Method for predicting interaction of drug targets by efficient fusion of multi-source data
CN111009290A (en) Training method of plateau adaptability detection model, adaptability identification method and device
CN112820347B (en) Disease gene prediction method based on multiple protein network pulse dynamics process
CN111783088B (en) Malicious code family clustering method and device and computer equipment
CN113192562B (en) Pathogenic gene identification method and system fusing multi-scale module structure information
CN112837752B (en) Depression disorder gene feature mining method based on multi-network fusion and multi-layer network diffusion
Phuong et al. Predicting gene function using similarity learning
CN110739028B (en) Cell line drug response prediction method based on K-nearest neighbor constraint matrix decomposition
US20160378914A1 (en) Method of and apparatus for identifying phenotype-specific gene network using gene expression data
CN112133367A (en) Method and device for predicting interaction relation between medicine and target spot
Halsana et al. DensePPI: A Novel Image-Based Deep Learning Method for Prediction of Protein–Protein Interactions
CN114694748B (en) Proteomics molecular typing method based on prognosis information and reinforcement learning
CN116453586B (en) Cell specific synthetic lethal pair prediction method, device, equipment and medium
CN114093422B (en) Prediction method and system for interaction between miRNA and gene based on multiple relationship graph rolling network
KR102429120B1 (en) HUMAN PPARγ ANTAGONIST PREDICTION METHOD BASED ON LEARNING MODEL AND ANALYSIS APPARATUS

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant