CN114582418A - Biomarker identification system based on network maximum information flow model - Google Patents

Biomarker identification system based on network maximum information flow model Download PDF

Info

Publication number
CN114582418A
CN114582418A CN202210227861.8A CN202210227861A CN114582418A CN 114582418 A CN114582418 A CN 114582418A CN 202210227861 A CN202210227861 A CN 202210227861A CN 114582418 A CN114582418 A CN 114582418A
Authority
CN
China
Prior art keywords
network
information flow
gene
node
target gene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210227861.8A
Other languages
Chinese (zh)
Inventor
刘治平
高子玉
杨佳新
高瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202210227861.8A priority Critical patent/CN114582418A/en
Publication of CN114582418A publication Critical patent/CN114582418A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation

Landscapes

  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Genetics & Genomics (AREA)
  • Theoretical Computer Science (AREA)
  • Molecular Biology (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention belongs to the technical field of biological information processing, and provides a biomarker identification system based on a network maximum information flow model, which comprises: acquiring gene expression data related to a preset disease, and constructing a protein interaction network; processing the gene expression data to obtain a target gene and a differential expression gene; constructing a specific protein interaction network based on the protein interaction network, selecting the shortest path from the target gene to the differentially expressed genes, and constructing a sub-network related to the target gene; modeling a sub-network based on a target gene into a resistance network, and calculating the information flow fraction of each node in each resistance network by using a kirchhoff current law; comparing the information flow scores of the same nodes in different resistance networks, and only keeping the node with the largest information flow score in the original network to finally obtain a qualified module; and performing logistic model fitting on the obtained qualified module to obtain the possibility of disease if a certain module is contained.

Description

Biomarker identification system based on network maximum information flow model
Technical Field
The invention belongs to the technical field of biological information processing, and particularly relates to a biomarker identification system based on a network maximum information flow model.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
In the field of complex disease diagnosis, accurate diagnosis of some complex diseases often requires molecular markers with high accuracy and good specificity. However, if only experimental methods are used to obtain biomacromolecule markers such as genes, RNAs, proteins and the like, a great deal of effort, financial resources and material resources are consumed, the number of possible biomolecular markers obtained from experiments is usually small, it is difficult to obtain high-quality biomolecular markers, and a great obstacle is also existed in subsequent experiments. Therefore, the biomarker is identified and found by analyzing the biological big data, and a new method and strategy are provided for identifying and finding the biomarker.
Congenital heart disease is one of the first five causes of death in infants. Multiple studies have shown that both genetic and environmental factors can lead to congenital heart disease, and therefore, the identification of candidate genes and biomarkers for congenital heart disease has been one of the central topics in research on congenital heart disease. Congenital heart diseases have a very wide spectrum, hundreds of types can be distinguished, and some patients even have multiple malformations. The lightest people can have no disease reaction for the whole life, and the heavy people can have serious symptoms such as hypoxia, shock and the like after birth. Although congenital heart disease is known to be caused by abnormal development of the heart during embryonic development, its molecular mechanism is not clear. Currently, about 30 different genes are known to cause congenital heart disease. Understanding the molecular functions, intermolecular interactions and expression pathways of genes helps us understand the pathogenesis of congenital heart disease, thereby contributing to the improvement of clinical diagnosis and medical treatment of the disease.
Molecular network-based methods are powerful tools for systematic analysis of complex diseases, identification of major pathways, response modules and candidate genes. Previous work in the area of high-throughput omics data on the heart has focused mainly on cardiac development and cardiovascular disease. Due to the lack of whole genome expression data, there is currently no such study for congenital heart disease.
Disclosure of Invention
In order to solve the technical problems in the background art described above, the present invention provides a biomarker identification system based on a network maximum information flow model, which clarifies the role of local network structure in the progression of a complex disease by using high-throughput gene expression data, thereby finding biomarkers useful for disease diagnosis in a molecular network.
In order to achieve the purpose, the invention adopts the following technical scheme:
a first aspect of the invention provides a biomarker identification system based on a network maximum information flow model.
A biomarker identification system based on a network maximum information flow model, comprising:
a data acquisition module configured to: acquiring gene expression data related to a preset disease, and constructing a protein interaction network;
a gene processing module configured to: processing the gene expression data to obtain a target gene and a differential expression gene;
a network generation module configured to: constructing a specific protein interaction network based on the protein interaction network, selecting the shortest path from the target gene to the differentially expressed genes, and constructing a sub-network related to the target gene;
a network modeling module configured to: modeling a sub-network based on a target gene into a resistance network, and calculating the information flow fraction of each node in each resistance network by using a kirchhoff current law;
a comparison screening module configured to: comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the modules with the node number less than a set value so as to finally obtain a qualified module;
a fitting prediction module configured to: and performing logistic model fitting on the obtained qualified module to obtain the possibility of disease if a certain module is contained.
A second aspect of the invention provides a computer-readable storage medium.
A computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements steps in a network maximum information flow model-based biomarker identification method, comprising: acquiring gene expression data related to a preset disease, and constructing a protein interaction network; processing the gene expression data to obtain a target gene and a differential expression gene; constructing a specific protein interaction network based on the protein interaction network, selecting the shortest path from the target gene to the differentially expressed genes, and constructing a sub-network related to the target gene; modeling a sub-network based on a target gene into a resistance network, and calculating the information flow fraction of each node in each resistance network by using a kirchhoff current law; comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the modules with the node number less than a set value to finally obtain a qualified module; and performing logistic model fitting on the obtained qualified module to obtain the possibility of the disease if the module is contained.
A third aspect of the invention provides a computer apparatus.
A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps in a network maximum information flow model based biomarker identification method when executing the program, the network maximum information flow model based biomarker identification method comprising: acquiring gene expression data related to a preset disease, and constructing a protein interaction network; processing the gene expression data to obtain a target gene and a differential expression gene; constructing a specific protein interaction network based on the protein interaction network, selecting the shortest path from the target gene to the differentially expressed genes, and constructing a sub-network related to the target gene; modeling a sub-network based on a target gene into a resistance network, and calculating the information flow fraction of each node in each resistance network by using a kirchhoff current law; comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the modules with the node number less than a set value to finally obtain a qualified module; and performing logistic model fitting on the obtained qualified module to obtain the possibility of the disease if the module is contained.
Compared with the prior art, the invention has the beneficial effects that:
the invention introduces the topological structure of the protein interaction network, and then combines the information flow fraction to excavate a plurality of imbalance network paths and gene modules related to the disease. The module analysis shows the abnormality of the complex disease (congenital heart disease) at the molecular level, the disease sample can be accurately distinguished from the normal sample by using the module, and the module can be used as a molecular marker for accurately diagnosing the complex disease and provides a new idea for revealing the pathogenic mechanism of the complex disease (congenital heart disease).
The maximum information flow model provided by the invention weights the proteins on all possible paths, and compared with the use of degree or betweenness, the maximum information flow model can more accurately measure the action of a certain node in the network.
The model provided by the invention needs a larger number of training, is more suitable for the field with more data quantity, such as congenital heart disease prediction, and the like, and the effect of the trained integrated model is better than that of the original algorithm.
The prediction method provided by the invention has the advantages that the final training set does not participate in the integrated model training, the training result is basically the same as the test result, and the overfitting problem does not exist.
The invention provides an information processing method aiming at omics data, which realizes the discovery of possible complex disease biomarkers from the omics data, thereby realizing the accurate prediction and classification of diseases.
The invention specifically describes the processing and analysis of omics data related to congenital heart diseases, aims to clarify the technology developed by the invention, and provides a method and a system which have generality and can be easily popularized to the research of other corresponding complex diseases.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a model architecture diagram of a network maximum information flow model based biomarker recognition system in accordance with an embodiment of the present invention;
FIG. 2 is a diagram of a shortest path model structure between a selected pathogenic gene Si and a differentially expressed gene Tj in an embodiment of the present invention;
FIG. 3 is a diagram of a model structure for modeling a protein interaction network as a resistor network in an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following figures and examples.
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
It should be noted that the flowchart and block diagrams in the figures represent the architecture, functionality, and operation of possible implementations of systems according to various embodiments of the present disclosure. It should be noted that each block in the flowchart or block diagrams may represent a module, a segment, or a portion of code, which may comprise one or more executable instructions for implementing the logical function specified in the respective embodiment. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Example one
As shown in fig. 1, the present embodiment provides a biomarker recognition system based on a network maximum information flow model, including:
a data acquisition module configured to: acquiring gene expression data related to a preset disease, and constructing a protein interaction network;
a gene processing module configured to: processing the gene expression data to obtain a target gene and a differential expression gene;
a network generation module configured to: constructing a specific protein interaction network based on the protein interaction network, selecting the shortest path from the target gene to the differentially expressed genes, and constructing a sub-network related to the target gene;
a network modeling module configured to: modeling a sub-network based on a target gene into a resistance network, and calculating the information flow fraction of each node in each resistance network by using a kirchhoff current law;
a comparison screening module configured to: comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the modules with the node number less than a set value to finally obtain a qualified module;
a fitting prediction module configured to: and performing logistic model fitting on the obtained qualified module to obtain the possibility of disease if a certain module is contained.
The embodiment discloses a biomarker recognition system based on a maximum network information flow model, which is based on interaction data of transcriptome and proteome, is applicable to general diseases, and for more clear expression, the system is described by taking congenital heart disease as an example, and is shown in fig. 1 and comprises:
s1: acquiring gene expression data related to the Faluo quadruple syndrome of the congenital heart disease and reconstructing gene expression data of an operation control group, constructing a protein interaction network, and weighting each interaction.
S2: and performing Z fraction processing on the gene expression data, and performing t test to obtain the target gene and the differential expression gene.
S3: and obtaining the shortest path between the target gene Si and the differential expression gene Tj, and constructing a sub-network.
S4: and modeling the constructed protein sub-network into a resistance network, and calculating the information flow fraction of each node in each resistance network by using kirchhoff current law.
S5: and comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the module with the smaller number of nodes to finally obtain a qualified module.
S6: and (4) classifying by using a logistic regression model, and drawing an ROC curve to evaluate the classification effect.
In step S1, GSE26125 and GSE14970 in the GSE dataset are selected as raw data, and both are preprocessed according to the original document, and the common part of the two gene expression datasets is obtained to wait for subsequent processing. The common genes in the PPI (protein-protein interaction) network and in the Affymetrix Human Genome U133 Plus 2.0Array microarray were used, and the maximum connecting component of the network was extracted, and the weight of the interaction between each protein node was defined as their correlation value, and the formula is as follows:
Figure BDA0003537011050000081
and the distance between the nodes is defined as distance 1- | corr |.
In step S2, the gene expression data is processed by Z-Score, the data with different magnitudes are processed uniformly to the same magnitude, and the transformed Z-Score value is used to measure the gene expression value. The method comprises the steps of performing t test on gene expression data, specifically, extracting expression values of the same gene in disease and health samples to form a vector, calculating a t statistic for each group of vectors to compare differences of the gene expression values in the two samples, calculating a significance p value based on a t distribution result, and more dialectically measuring the significance of the differences. The calculation formula is as follows:
Figure BDA0003537011050000082
the numerator on the right of the equation is the difference of the average expression value of the gene i in the two types of samples, and the denominator is the standard error of calculating the gene i in all the samples. A gene is defined as a differentially expressed gene in CHD (genetic heart disease) cases if its normalized gene has a p-value <0.01 in CHD cases compared to all controls; a gene is defined as a target gene if the p-value is <0.01 in more than 75% of the cases.
In step S3, the igraph package in the R language is selected to construct a specific protein interaction network, nodes in the network are numbered in the input order, the shortest path from 14 CHD-source disease genes Si to 85 differentially expressed genes Tj is selected, and then all paths of each CHD-source disease gene are merged into a subnet (the model is shown in fig. 2), totaling 14 subnetworks.
In step S4, the present embodiment models the interaction network as a resistance network, where the protein is represented as interconnected nodes and the interaction is represented as resistance (the model is shown in fig. 3). The content of kirchhoff's current law is that at any node in a circuit, the sum of currents flowing into the node is equal to the sum of currents flowing out of the node at any time, and the equation of the information flow model is defined based on the kirchhoff's current law. In the model of this embodiment, if a node is on the path of current transmission of a plurality of node pairs, the sum of the absolute values of the currents finally obtained at the node is high, and the information flow score obtained by the corresponding protein is also high. Based on this, candidate proteins that play an important role in the biological information transfer of the protein subnet can be screened.
Node i is connected to a unity current source, node j is grounded, and kirchhoff's law is used to calculate how much current flows through node k. The fraction of information flow at a node k is defined as the sum of the currents through the node k in the pair-wise combination of all target gene nodes and differentially expressed gene nodes. Switching the source node and the ground node will not cause a change in current at each point, so the information flow score is only calculated under conditions of i > j. The number of paired combinations of nodes (i, j) is (N-2) (N-3)/2. The size of the flow through node k is (provided i ≠ k, j ≠ k, i > j):
Figure BDA0003537011050000091
wherein, IkmIs the value of the current, sigma, between node k and node mmIs the sum of all resistances connected to node k. For a given combination of one source node and one ground node, a convenient way to calculate the circuit resistance current is to use node analysis and solve a (N-1) -dimensional linear system of equations for the node voltage. For each node m that is not a ground node, the following equation is satisfied:
Figure BDA0003537011050000092
wherein, VlIs the voltage at node l, ∑lIs the sum of the voltages directly connected to node m. When node m is the source node, the right side of the equation is the unit value of the current. The node voltage value may be calculated by solving the following system of linear equations:
Gv=J
where G is a symmetric (N-1) × (N-1) conductance matrix, v is the node voltage vector, and J is the node current vector.
In step S5, for a part of the nodes existing in different resistor networks, the present embodiment compares a plurality of information flow scores thereof and puts them into the network with the highest information flow score. After the information flow scores are screened, 14 candidate modules are obtained by the method, and 423 nodes are obtained in total. And eliminating modules with less number of nodes, and selecting 12 modules with more remaining nodes for subsequent research.
In step S6, logistic model fitting is performed on the obtained qualified modules to obtain the possibility of disease assuming that a certain gene is contained, and the possibility of disease is compared with the original true value to evaluate the capability of each module obtained in this embodiment to distinguish phenotypes.
The embodiment is divided into three layers of models, the shortest path between the target gene and the differential expression gene is selected and forms a sub-network, the protein interaction sub-network is modeled to calculate the information flow fraction for the circuit network, and the obtained modules are subjected to logistic regression model fitting.
Example two
The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in a network maximum information flow model-based biomarker identification method, comprising: acquiring gene expression data related to a preset disease, and constructing a protein interaction network; processing the gene expression data to obtain a target gene and a differential expression gene; constructing a specific protein interaction network based on the protein interaction network, selecting the shortest path from the target gene to the differentially expressed genes, and constructing a sub-network related to the target gene; modeling a sub-network based on a target gene into a resistance network, and calculating the information flow fraction of each node in each resistance network by using a kirchhoff current law; comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the modules with the node number less than a set value to finally obtain a qualified module; and performing logistic model fitting on the obtained qualified module to obtain the possibility of disease if a certain module is contained.
EXAMPLE III
The embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps in the biomarker identification method based on the network maximum information flow model, and the biomarker identification method based on the network maximum information flow model includes: acquiring gene expression data related to a preset disease, and constructing a protein interaction network; processing the gene expression data to obtain a target gene and a differential expression gene; constructing a specific protein interaction network based on the protein interaction network, selecting the shortest path from the target gene to the differentially expressed genes, and constructing a sub-network related to the target gene; modeling a sub-network based on a target gene into a resistance network, and calculating the information flow fraction of each node in each resistance network by using a kirchhoff current law; comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the modules with the node number less than a set value to finally obtain a qualified module; and performing logistic model fitting on the obtained qualified module to obtain the possibility of disease if a certain module is contained.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A biomarker recognition system based on a network maximum information flow model, comprising:
a data acquisition module configured to: acquiring gene expression data related to a preset disease, and constructing a protein interaction network;
a gene processing module configured to: processing the gene expression data to obtain a target gene and a differential expression gene;
a network generation module configured to: constructing a specific protein interaction network based on the protein interaction network, selecting the shortest path from the target gene to the differentially expressed genes, and constructing a sub-network related to the target gene;
a network modeling module configured to: modeling a sub-network based on a target gene into a resistance network, and calculating the information flow fraction of each node in each resistance network by using a kirchhoff current law;
a comparison screening module configured to: comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the modules with the node number less than a set value to finally obtain a qualified module;
a fitting prediction module configured to: and performing logistic model fitting on the obtained qualified module to obtain the possibility of the disease if the module is contained.
2. The network maximum information flow model-based biomarker recognition system according to claim 1, wherein the gene expression data related to a preset disease comprises: gene expression data associated with a predetermined disease and gene expression data of a reconstructed surgical control group;
the construction of the protein interaction network includes weighting each interaction.
3. The network maximum information flow model-based biomarker recognition system according to claim 1, wherein the process of obtaining the target gene and the differentially expressed gene comprises: performing Z fraction processing on the gene expression data, and performing t test to obtain a t distribution result; calculating a significance p value according to the t distribution result; if the p-value of the normalization gene in the case of CHD compared to all controls is <0.01, the gene is defined as a differentially expressed gene in the case of CHD; a gene is defined as a target gene if the p-value is <0.01 in more than 75% of the cases.
4. The network maximum information flow model-based biomarker recognition system according to claim 1, wherein the selecting the shortest path from the target gene to the differentially expressed gene specifically comprises: numbering each node in the specific protein interaction network according to an input sequence, selecting the shortest path from the selected target gene to the differentially expressed gene, and combining all paths of the target gene into a subnet.
5. The network maximum information flow model-based biomarker identification system according to claim 1, wherein the modeling of the protein interaction network as a resistive network specifically comprises: the protein interaction network is modeled as a resistive network, where proteins are represented as interconnected nodes and interactions are represented as resistances.
6. The network maximum information flow model-based biomarker recognition system according to claim 5, wherein the information flow fraction of a certain node k is defined as the sum of currents passing through the node k in the pair-wise combination of all source nodes and differentially expressed gene nodes; wherein the source node is a node in the sub-network of the target gene.
7. The network maximum information flow model-based biomarker identification system according to claim 1, wherein the comparing of information flow scores of the same node in different resistive networks, and only retaining the node with the largest information flow score in the original network, thereby obtaining the candidate module specifically comprises: and for a part of nodes existing in different sub-networks, comparing a plurality of information flow scores of the nodes in different resistive networks, and putting the information flow scores into the network with the highest information flow score to obtain a candidate network.
8. The network maximum information flow model-based biomarker identification system according to claim 1, wherein the fitting prediction module further comprises: the ability of each module to distinguish between phenotypes is assessed by comparing the potential value for disease with the actual value for the gene.
9. A computer-readable storage medium on which a computer program is stored, which when executed by a processor performs steps in a network maximum information flow model-based biomarker identification method, comprising: acquiring gene expression data related to a preset disease, and constructing a protein interaction network; processing the gene expression data to obtain a target gene and a differential expression gene; constructing a specific protein interaction network based on the protein interaction network, selecting the shortest path from the target gene to the differentially expressed genes, and constructing a sub-network related to the target gene; modeling a sub-network based on a target gene into a resistance network, and calculating the information flow fraction of each node in each resistance network by using a kirchhoff current law; comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the modules with the node number less than a set value to finally obtain a qualified module; and performing logistic model fitting on the obtained qualified module to obtain the possibility of disease if a certain module is contained.
10. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements steps in a network maximum information flow model-based biomarker identification method when executing the program, the network maximum information flow model-based biomarker identification method comprising: acquiring gene expression data related to a preset disease, and constructing a protein interaction network; processing the gene expression data to obtain a target gene and a differential expression gene; constructing a specific protein interaction network based on the protein interaction network, selecting the shortest path from the target gene to the differentially expressed genes, and constructing a sub-network related to the target gene; modeling a sub-network based on a target gene into a resistance network, and calculating the information flow fraction of each node in each resistance network by using a kirchhoff current law; comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the modules with the node number less than a set value to finally obtain a qualified module; and performing logistic model fitting on the obtained qualified module to obtain the possibility of disease if a certain module is contained.
CN202210227861.8A 2022-03-08 2022-03-08 Biomarker identification system based on network maximum information flow model Pending CN114582418A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210227861.8A CN114582418A (en) 2022-03-08 2022-03-08 Biomarker identification system based on network maximum information flow model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210227861.8A CN114582418A (en) 2022-03-08 2022-03-08 Biomarker identification system based on network maximum information flow model

Publications (1)

Publication Number Publication Date
CN114582418A true CN114582418A (en) 2022-06-03

Family

ID=81773452

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210227861.8A Pending CN114582418A (en) 2022-03-08 2022-03-08 Biomarker identification system based on network maximum information flow model

Country Status (1)

Country Link
CN (1) CN114582418A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103093119A (en) * 2013-01-24 2013-05-08 南京大学 Method for recognizing significant biologic pathway through utilization of network structural information
US20140040264A1 (en) * 2011-02-04 2014-02-06 Hgh Tech Campus Method for estimation of information flow in biological networks
CN108604221A (en) * 2015-11-25 2018-09-28 安东·弗兰茨·约瑟夫·弗利里 Method and descriptor for the information flow that the object in more multiple Internets induces
CN108830045A (en) * 2018-06-29 2018-11-16 深圳先进技术研究院 A kind of biomarker screening system method based on multiple groups
CN109033748A (en) * 2018-08-14 2018-12-18 齐齐哈尔大学 A kind of miRNA identification of function method based on multiple groups
CN109063837A (en) * 2018-07-02 2018-12-21 南京邮电大学 Genetic algorithm information flow network property analysis method based on complex network structures entropy
CN111816246A (en) * 2020-05-27 2020-10-23 上海大学 Method for identifying driving gene from difference network

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140040264A1 (en) * 2011-02-04 2014-02-06 Hgh Tech Campus Method for estimation of information flow in biological networks
CN103093119A (en) * 2013-01-24 2013-05-08 南京大学 Method for recognizing significant biologic pathway through utilization of network structural information
CN108604221A (en) * 2015-11-25 2018-09-28 安东·弗兰茨·约瑟夫·弗利里 Method and descriptor for the information flow that the object in more multiple Internets induces
CN108830045A (en) * 2018-06-29 2018-11-16 深圳先进技术研究院 A kind of biomarker screening system method based on multiple groups
CN109063837A (en) * 2018-07-02 2018-12-21 南京邮电大学 Genetic algorithm information flow network property analysis method based on complex network structures entropy
CN109033748A (en) * 2018-08-14 2018-12-18 齐齐哈尔大学 A kind of miRNA identification of function method based on multiple groups
CN111816246A (en) * 2020-05-27 2020-10-23 上海大学 Method for identifying driving gene from difference network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DANNING HE等: ""Identification of dysfunctional modules and disease genes in congenital heart disease by a network-based approach", 《BMC GENOMICS》, 2 December 2011 (2011-12-02), pages 1 - 16 *
PATRYCJA VASILYEV MISSIURO等: "Information Flow Analysis of Interactome Networks", 《PLOS COMPUTATIONAL BIOLOGY》, 10 April 2009 (2009-04-10), pages 1 - 15 *

Similar Documents

Publication Publication Date Title
Skinnider et al. Cell type prioritization in single-cell data
Binder et al. Big data in medical science—a biostatistical view: Part 21 of a series on evaluation of scientific publications
Wang et al. Guidelines for bioinformatics of single-cell sequencing data analysis in Alzheimer’s disease: review, recommendation, implementation and application
KR101642270B1 (en) Evolutionary clustering algorithm
CN111933212B (en) Clinical histology data processing method and device based on machine learning
JP2015527635A (en) System and method for generating biomarker signatures using an integrated dual ensemble and generalized simulated annealing technique
CN112927757B (en) Gastric cancer biomarker identification method based on gene expression and DNA methylation data
do Nascimento et al. A decision tree to improve identification of pathogenic mutations in clinical practice
Dlamini et al. AI and precision oncology in clinical cancer genomics: From prevention to targeted cancer therapies-an outcomes based patient care
Nasir et al. Single and mitochondrial gene inheritance disorder prediction using machine learning
Dixit et al. Machine learning in bioinformatics: A novel approach for DNA sequencing
Johnson et al. Spatial cell type composition in normal and Alzheimers human brains is revealed using integrated mouse and human single cell RNA sequencing
Yang et al. A genetic ensemble approach for gene-gene interaction identification
Le et al. Machine learning for cell type classification from single nucleus RNA sequencing data
Maden et al. Challenges and opportunities to computationally deconvolve heterogeneous tissue with varying cell sizes using single-cell RNA-sequencing datasets
Kayvanpour et al. microRNA neural networks improve diagnosis of acute coronary syndrome (ACS)
Chen et al. Integration of spatial and single-cell data across modalities with weak linkage
CN114582418A (en) Biomarker identification system based on network maximum information flow model
Huuki-Myers et al. Benchmark of cellular deconvolution methods using a multi-assay reference dataset from postmortem human prefrontal cortex
Kumaran et al. eyeVarP: a computational framework for the identification of pathogenic variants specific to eye disease
Harmanci et al. scRegulocity: Detection of local RNA velocity patterns in embeddings of single cell RNA-Seq data
JP2021043056A (en) Molecular marker search method, molecule marker search device, and program
KR102659917B1 (en) Method for developing meta-gene based on non-negative matrix factorization and applications thereof
KR102659915B1 (en) Method of gene selection for predicting medical information of patients and uses thereof
Jiang et al. Label Propagation Based Semi-supervised Feature Selection to Decode Clinical Phenotype of Huntington’s Disease

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination