CN114582418A

CN114582418A - Biomarker identification system based on network maximum information flow model

Info

Publication number: CN114582418A
Application number: CN202210227861.8A
Authority: CN
Inventors: 刘治平; 高子玉; 杨佳新; 高瑞
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2022-06-03

Abstract

The invention belongs to the technical field of biological information processing, and provides a biomarker identification system based on a network maximum information flow model, which comprises: acquiring gene expression data related to a preset disease, and constructing a protein interaction network; processing the gene expression data to obtain a target gene and a differential expression gene; constructing a specific protein interaction network based on the protein interaction network, selecting the shortest path from the target gene to the differentially expressed genes, and constructing a sub-network related to the target gene; modeling a sub-network based on a target gene into a resistance network, and calculating the information flow fraction of each node in each resistance network by using a kirchhoff current law; comparing the information flow scores of the same nodes in different resistance networks, and only keeping the node with the largest information flow score in the original network to finally obtain a qualified module; and performing logistic model fitting on the obtained qualified module to obtain the possibility of disease if a certain module is contained.

Description

Biomarker identification system based on network maximum information flow model

Technical Field

The invention belongs to the technical field of biological information processing, and particularly relates to a biomarker identification system based on a network maximum information flow model.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

In the field of complex disease diagnosis, accurate diagnosis of some complex diseases often requires molecular markers with high accuracy and good specificity. However, if only experimental methods are used to obtain biomacromolecule markers such as genes, RNAs, proteins and the like, a great deal of effort, financial resources and material resources are consumed, the number of possible biomolecular markers obtained from experiments is usually small, it is difficult to obtain high-quality biomolecular markers, and a great obstacle is also existed in subsequent experiments. Therefore, the biomarker is identified and found by analyzing the biological big data, and a new method and strategy are provided for identifying and finding the biomarker.

Congenital heart disease is one of the first five causes of death in infants. Multiple studies have shown that both genetic and environmental factors can lead to congenital heart disease, and therefore, the identification of candidate genes and biomarkers for congenital heart disease has been one of the central topics in research on congenital heart disease. Congenital heart diseases have a very wide spectrum, hundreds of types can be distinguished, and some patients even have multiple malformations. The lightest people can have no disease reaction for the whole life, and the heavy people can have serious symptoms such as hypoxia, shock and the like after birth. Although congenital heart disease is known to be caused by abnormal development of the heart during embryonic development, its molecular mechanism is not clear. Currently, about 30 different genes are known to cause congenital heart disease. Understanding the molecular functions, intermolecular interactions and expression pathways of genes helps us understand the pathogenesis of congenital heart disease, thereby contributing to the improvement of clinical diagnosis and medical treatment of the disease.

Molecular network-based methods are powerful tools for systematic analysis of complex diseases, identification of major pathways, response modules and candidate genes. Previous work in the area of high-throughput omics data on the heart has focused mainly on cardiac development and cardiovascular disease. Due to the lack of whole genome expression data, there is currently no such study for congenital heart disease.

Disclosure of Invention

In order to solve the technical problems in the background art described above, the present invention provides a biomarker identification system based on a network maximum information flow model, which clarifies the role of local network structure in the progression of a complex disease by using high-throughput gene expression data, thereby finding biomarkers useful for disease diagnosis in a molecular network.

In order to achieve the purpose, the invention adopts the following technical scheme:

a first aspect of the invention provides a biomarker identification system based on a network maximum information flow model.

A biomarker identification system based on a network maximum information flow model, comprising:

a data acquisition module configured to: acquiring gene expression data related to a preset disease, and constructing a protein interaction network;

a gene processing module configured to: processing the gene expression data to obtain a target gene and a differential expression gene;

a network generation module configured to: constructing a specific protein interaction network based on the protein interaction network, selecting the shortest path from the target gene to the differentially expressed genes, and constructing a sub-network related to the target gene;

a network modeling module configured to: modeling a sub-network based on a target gene into a resistance network, and calculating the information flow fraction of each node in each resistance network by using a kirchhoff current law;

a comparison screening module configured to: comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the modules with the node number less than a set value so as to finally obtain a qualified module;

a fitting prediction module configured to: and performing logistic model fitting on the obtained qualified module to obtain the possibility of disease if a certain module is contained.

A second aspect of the invention provides a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements steps in a network maximum information flow model-based biomarker identification method, comprising: acquiring gene expression data related to a preset disease, and constructing a protein interaction network; processing the gene expression data to obtain a target gene and a differential expression gene; constructing a specific protein interaction network based on the protein interaction network, selecting the shortest path from the target gene to the differentially expressed genes, and constructing a sub-network related to the target gene; modeling a sub-network based on a target gene into a resistance network, and calculating the information flow fraction of each node in each resistance network by using a kirchhoff current law; comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the modules with the node number less than a set value to finally obtain a qualified module; and performing logistic model fitting on the obtained qualified module to obtain the possibility of the disease if the module is contained.

A third aspect of the invention provides a computer apparatus.

A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps in a network maximum information flow model based biomarker identification method when executing the program, the network maximum information flow model based biomarker identification method comprising: acquiring gene expression data related to a preset disease, and constructing a protein interaction network; processing the gene expression data to obtain a target gene and a differential expression gene; constructing a specific protein interaction network based on the protein interaction network, selecting the shortest path from the target gene to the differentially expressed genes, and constructing a sub-network related to the target gene; modeling a sub-network based on a target gene into a resistance network, and calculating the information flow fraction of each node in each resistance network by using a kirchhoff current law; comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the modules with the node number less than a set value to finally obtain a qualified module; and performing logistic model fitting on the obtained qualified module to obtain the possibility of the disease if the module is contained.

Compared with the prior art, the invention has the beneficial effects that:

the invention introduces the topological structure of the protein interaction network, and then combines the information flow fraction to excavate a plurality of imbalance network paths and gene modules related to the disease. The module analysis shows the abnormality of the complex disease (congenital heart disease) at the molecular level, the disease sample can be accurately distinguished from the normal sample by using the module, and the module can be used as a molecular marker for accurately diagnosing the complex disease and provides a new idea for revealing the pathogenic mechanism of the complex disease (congenital heart disease).

The maximum information flow model provided by the invention weights the proteins on all possible paths, and compared with the use of degree or betweenness, the maximum information flow model can more accurately measure the action of a certain node in the network.

The model provided by the invention needs a larger number of training, is more suitable for the field with more data quantity, such as congenital heart disease prediction, and the like, and the effect of the trained integrated model is better than that of the original algorithm.

The prediction method provided by the invention has the advantages that the final training set does not participate in the integrated model training, the training result is basically the same as the test result, and the overfitting problem does not exist.

The invention provides an information processing method aiming at omics data, which realizes the discovery of possible complex disease biomarkers from the omics data, thereby realizing the accurate prediction and classification of diseases.

The invention specifically describes the processing and analysis of omics data related to congenital heart diseases, aims to clarify the technology developed by the invention, and provides a method and a system which have generality and can be easily popularized to the research of other corresponding complex diseases.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a model architecture diagram of a network maximum information flow model based biomarker recognition system in accordance with an embodiment of the present invention;

FIG. 2 is a diagram of a shortest path model structure between a selected pathogenic gene Si and a differentially expressed gene Tj in an embodiment of the present invention;

FIG. 3 is a diagram of a model structure for modeling a protein interaction network as a resistor network in an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

It should be noted that the flowchart and block diagrams in the figures represent the architecture, functionality, and operation of possible implementations of systems according to various embodiments of the present disclosure. It should be noted that each block in the flowchart or block diagrams may represent a module, a segment, or a portion of code, which may comprise one or more executable instructions for implementing the logical function specified in the respective embodiment. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Example one

As shown in fig. 1, the present embodiment provides a biomarker recognition system based on a network maximum information flow model, including:

a comparison screening module configured to: comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the modules with the node number less than a set value to finally obtain a qualified module;

The embodiment discloses a biomarker recognition system based on a maximum network information flow model, which is based on interaction data of transcriptome and proteome, is applicable to general diseases, and for more clear expression, the system is described by taking congenital heart disease as an example, and is shown in fig. 1 and comprises:

s1: acquiring gene expression data related to the Faluo quadruple syndrome of the congenital heart disease and reconstructing gene expression data of an operation control group, constructing a protein interaction network, and weighting each interaction.

S2: and performing Z fraction processing on the gene expression data, and performing t test to obtain the target gene and the differential expression gene.

S3: and obtaining the shortest path between the target gene Si and the differential expression gene Tj, and constructing a sub-network.

S4: and modeling the constructed protein sub-network into a resistance network, and calculating the information flow fraction of each node in each resistance network by using kirchhoff current law.

S5: and comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the module with the smaller number of nodes to finally obtain a qualified module.

S6: and (4) classifying by using a logistic regression model, and drawing an ROC curve to evaluate the classification effect.

In step S1, GSE26125 and GSE14970 in the GSE dataset are selected as raw data, and both are preprocessed according to the original document, and the common part of the two gene expression datasets is obtained to wait for subsequent processing. The common genes in the PPI (protein-protein interaction) network and in the Affymetrix Human Genome U133 Plus 2.0Array microarray were used, and the maximum connecting component of the network was extracted, and the weight of the interaction between each protein node was defined as their correlation value, and the formula is as follows:

and the distance between the nodes is defined as distance 1- | corr |.

In step S2, the gene expression data is processed by Z-Score, the data with different magnitudes are processed uniformly to the same magnitude, and the transformed Z-Score value is used to measure the gene expression value. The method comprises the steps of performing t test on gene expression data, specifically, extracting expression values of the same gene in disease and health samples to form a vector, calculating a t statistic for each group of vectors to compare differences of the gene expression values in the two samples, calculating a significance p value based on a t distribution result, and more dialectically measuring the significance of the differences. The calculation formula is as follows:

the numerator on the right of the equation is the difference of the average expression value of the gene i in the two types of samples, and the denominator is the standard error of calculating the gene i in all the samples. A gene is defined as a differentially expressed gene in CHD (genetic heart disease) cases if its normalized gene has a p-value <0.01 in CHD cases compared to all controls; a gene is defined as a target gene if the p-value is <0.01 in more than 75% of the cases.

In step S3, the igraph package in the R language is selected to construct a specific protein interaction network, nodes in the network are numbered in the input order, the shortest path from 14 CHD-source disease genes Si to 85 differentially expressed genes Tj is selected, and then all paths of each CHD-source disease gene are merged into a subnet (the model is shown in fig. 2), totaling 14 subnetworks.

In step S4, the present embodiment models the interaction network as a resistance network, where the protein is represented as interconnected nodes and the interaction is represented as resistance (the model is shown in fig. 3). The content of kirchhoff's current law is that at any node in a circuit, the sum of currents flowing into the node is equal to the sum of currents flowing out of the node at any time, and the equation of the information flow model is defined based on the kirchhoff's current law. In the model of this embodiment, if a node is on the path of current transmission of a plurality of node pairs, the sum of the absolute values of the currents finally obtained at the node is high, and the information flow score obtained by the corresponding protein is also high. Based on this, candidate proteins that play an important role in the biological information transfer of the protein subnet can be screened.

Node i is connected to a unity current source, node j is grounded, and kirchhoff's law is used to calculate how much current flows through node k. The fraction of information flow at a node k is defined as the sum of the currents through the node k in the pair-wise combination of all target gene nodes and differentially expressed gene nodes. Switching the source node and the ground node will not cause a change in current at each point, so the information flow score is only calculated under conditions of i > j. The number of paired combinations of nodes (i, j) is (N-2) (N-3)/2. The size of the flow through node k is (provided i ≠ k, j ≠ k, i > j):

wherein, I_kmIs the value of the current, sigma, between node k and node m_mIs the sum of all resistances connected to node k. For a given combination of one source node and one ground node, a convenient way to calculate the circuit resistance current is to use node analysis and solve a (N-1) -dimensional linear system of equations for the node voltage. For each node m that is not a ground node, the following equation is satisfied:

wherein, V_lIs the voltage at node l, ∑_lIs the sum of the voltages directly connected to node m. When node m is the source node, the right side of the equation is the unit value of the current. The node voltage value may be calculated by solving the following system of linear equations:

Gv＝J

where G is a symmetric (N-1) × (N-1) conductance matrix, v is the node voltage vector, and J is the node current vector.

In step S5, for a part of the nodes existing in different resistor networks, the present embodiment compares a plurality of information flow scores thereof and puts them into the network with the highest information flow score. After the information flow scores are screened, 14 candidate modules are obtained by the method, and 423 nodes are obtained in total. And eliminating modules with less number of nodes, and selecting 12 modules with more remaining nodes for subsequent research.

In step S6, logistic model fitting is performed on the obtained qualified modules to obtain the possibility of disease assuming that a certain gene is contained, and the possibility of disease is compared with the original true value to evaluate the capability of each module obtained in this embodiment to distinguish phenotypes.

The embodiment is divided into three layers of models, the shortest path between the target gene and the differential expression gene is selected and forms a sub-network, the protein interaction sub-network is modeled to calculate the information flow fraction for the circuit network, and the obtained modules are subjected to logistic regression model fitting.

Example two

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in a network maximum information flow model-based biomarker identification method, comprising: acquiring gene expression data related to a preset disease, and constructing a protein interaction network; processing the gene expression data to obtain a target gene and a differential expression gene; constructing a specific protein interaction network based on the protein interaction network, selecting the shortest path from the target gene to the differentially expressed genes, and constructing a sub-network related to the target gene; modeling a sub-network based on a target gene into a resistance network, and calculating the information flow fraction of each node in each resistance network by using a kirchhoff current law; comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the modules with the node number less than a set value to finally obtain a qualified module; and performing logistic model fitting on the obtained qualified module to obtain the possibility of disease if a certain module is contained.

EXAMPLE III

The embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps in the biomarker identification method based on the network maximum information flow model, and the biomarker identification method based on the network maximum information flow model includes: acquiring gene expression data related to a preset disease, and constructing a protein interaction network; processing the gene expression data to obtain a target gene and a differential expression gene; constructing a specific protein interaction network based on the protein interaction network, selecting the shortest path from the target gene to the differentially expressed genes, and constructing a sub-network related to the target gene; modeling a sub-network based on a target gene into a resistance network, and calculating the information flow fraction of each node in each resistance network by using a kirchhoff current law; comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the modules with the node number less than a set value to finally obtain a qualified module; and performing logistic model fitting on the obtained qualified module to obtain the possibility of disease if a certain module is contained.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A biomarker recognition system based on a network maximum information flow model, comprising:

a fitting prediction module configured to: and performing logistic model fitting on the obtained qualified module to obtain the possibility of the disease if the module is contained.

2. The network maximum information flow model-based biomarker recognition system according to claim 1, wherein the gene expression data related to a preset disease comprises: gene expression data associated with a predetermined disease and gene expression data of a reconstructed surgical control group;

the construction of the protein interaction network includes weighting each interaction.

3. The network maximum information flow model-based biomarker recognition system according to claim 1, wherein the process of obtaining the target gene and the differentially expressed gene comprises: performing Z fraction processing on the gene expression data, and performing t test to obtain a t distribution result; calculating a significance p value according to the t distribution result; if the p-value of the normalization gene in the case of CHD compared to all controls is <0.01, the gene is defined as a differentially expressed gene in the case of CHD; a gene is defined as a target gene if the p-value is <0.01 in more than 75% of the cases.

4. The network maximum information flow model-based biomarker recognition system according to claim 1, wherein the selecting the shortest path from the target gene to the differentially expressed gene specifically comprises: numbering each node in the specific protein interaction network according to an input sequence, selecting the shortest path from the selected target gene to the differentially expressed gene, and combining all paths of the target gene into a subnet.

5. The network maximum information flow model-based biomarker identification system according to claim 1, wherein the modeling of the protein interaction network as a resistive network specifically comprises: the protein interaction network is modeled as a resistive network, where proteins are represented as interconnected nodes and interactions are represented as resistances.

6. The network maximum information flow model-based biomarker recognition system according to claim 5, wherein the information flow fraction of a certain node k is defined as the sum of currents passing through the node k in the pair-wise combination of all source nodes and differentially expressed gene nodes; wherein the source node is a node in the sub-network of the target gene.

7. The network maximum information flow model-based biomarker identification system according to claim 1, wherein the comparing of information flow scores of the same node in different resistive networks, and only retaining the node with the largest information flow score in the original network, thereby obtaining the candidate module specifically comprises: and for a part of nodes existing in different sub-networks, comparing a plurality of information flow scores of the nodes in different resistive networks, and putting the information flow scores into the network with the highest information flow score to obtain a candidate network.

8. The network maximum information flow model-based biomarker identification system according to claim 1, wherein the fitting prediction module further comprises: the ability of each module to distinguish between phenotypes is assessed by comparing the potential value for disease with the actual value for the gene.

9. A computer-readable storage medium on which a computer program is stored, which when executed by a processor performs steps in a network maximum information flow model-based biomarker identification method, comprising: acquiring gene expression data related to a preset disease, and constructing a protein interaction network; processing the gene expression data to obtain a target gene and a differential expression gene; constructing a specific protein interaction network based on the protein interaction network, selecting the shortest path from the target gene to the differentially expressed genes, and constructing a sub-network related to the target gene; modeling a sub-network based on a target gene into a resistance network, and calculating the information flow fraction of each node in each resistance network by using a kirchhoff current law; comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the modules with the node number less than a set value to finally obtain a qualified module; and performing logistic model fitting on the obtained qualified module to obtain the possibility of disease if a certain module is contained.

10. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements steps in a network maximum information flow model-based biomarker identification method when executing the program, the network maximum information flow model-based biomarker identification method comprising: acquiring gene expression data related to a preset disease, and constructing a protein interaction network; processing the gene expression data to obtain a target gene and a differential expression gene; constructing a specific protein interaction network based on the protein interaction network, selecting the shortest path from the target gene to the differentially expressed genes, and constructing a sub-network related to the target gene; modeling a sub-network based on a target gene into a resistance network, and calculating the information flow fraction of each node in each resistance network by using a kirchhoff current law; comparing the information flow scores of the same nodes in different resistance networks, only keeping the node with the largest information flow score in the original network so as to obtain a candidate module, and then eliminating the modules with the node number less than a set value to finally obtain a qualified module; and performing logistic model fitting on the obtained qualified module to obtain the possibility of disease if a certain module is contained.