CN111223523B - Gene regulation network construction method and system based on multi-time-lag causal entropy - Google Patents
Gene regulation network construction method and system based on multi-time-lag causal entropy Download PDFInfo
- Publication number
- CN111223523B CN111223523B CN202010013036.9A CN202010013036A CN111223523B CN 111223523 B CN111223523 B CN 111223523B CN 202010013036 A CN202010013036 A CN 202010013036A CN 111223523 B CN111223523 B CN 111223523B
- Authority
- CN
- China
- Prior art keywords
- time
- gene
- edges
- genes
- entropy
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 128
- 230000001364 causal effect Effects 0.000 title claims abstract description 30
- 238000010276 construction Methods 0.000 title claims description 22
- 230000014509 gene expression Effects 0.000 claims abstract description 69
- 239000011159 matrix material Substances 0.000 claims abstract description 36
- 238000001914 filtration Methods 0.000 claims abstract description 29
- 238000000034 method Methods 0.000 claims abstract description 28
- 230000001105 regulatory effect Effects 0.000 claims abstract description 23
- 238000012546 transfer Methods 0.000 claims abstract description 12
- 238000004422 calculation algorithm Methods 0.000 claims description 17
- 238000004364 calculation method Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 5
- 238000004891 communication Methods 0.000 claims description 4
- 230000006870 function Effects 0.000 description 4
- 108700005075 Regulator Genes Proteins 0.000 description 3
- 241000894006 Bacteria Species 0.000 description 2
- 238000010219 correlation analysis Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 239000013604 expression vector Substances 0.000 description 2
- 230000008844 regulatory mechanism Effects 0.000 description 2
- 206010006187 Breast cancer Diseases 0.000 description 1
- 208000026310 Breast neoplasm Diseases 0.000 description 1
- 108020004414 DNA Proteins 0.000 description 1
- 241000206602 Eukaryota Species 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 238000000342 Monte Carlo simulation Methods 0.000 description 1
- 240000004808 Saccharomyces cerevisiae Species 0.000 description 1
- IIRWWTKISYTTBL-SFHVURJKSA-N arbutamine Chemical compound C([C@H](O)C=1C=C(O)C(O)=CC=1)NCCCCC1=CC=C(O)C=C1 IIRWWTKISYTTBL-SFHVURJKSA-N 0.000 description 1
- 229960001488 arbutamine Drugs 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000010261 cell growth Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000012010 growth Effects 0.000 description 1
- 229940088597 hormone Drugs 0.000 description 1
- 239000005556 hormone Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 108020004999 messenger RNA Proteins 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 102000037983 regulatory factors Human genes 0.000 description 1
- 108091008025 regulatory factors Proteins 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 230000008054 signal transmission Effects 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
- 150000003384 small molecules Chemical class 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000011144 upstream manufacturing Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B45/00—ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Biology (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Chemical & Material Sciences (AREA)
- Data Mining & Analysis (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application discloses a method and a system for constructing a gene regulation network based on multi-time-lag causal entropy, wherein input time sequence gene expression data are divided into time windows with different time lags; for gene expression data of t time slices, respectively constructing a gene expression matrix under t-tau time windows, for each pair of genes, calculating multi-time-lag transfer entropy of a target gene under the t time windows and genes under the previous t-tau time windows to obtain a multi-time-lag gene correlation matrix, wherein elements of the matrix represent the probability of edges between genes, clustering edges of the matrix into two types through k-means, filtering out low-probability edge clusters, calculating multi-time-lag condition transfer entropy under conditional genes for each of the rest edges, and filtering out indirectly regulated edges with maximum causal entropy smaller than a threshold value to obtain a final network structure. The application effectively improves the accuracy of inference.
Description
Technical Field
The application relates to the field of bioinformatics, in particular to a construction method of a complex biological network.
Background
In organisms, cells are the fundamental unit of all tissue structure and function. The DNA information of all cells in a living body is the same, but cells of different tissues and organs have different differences, because a complex regulatory gene regulatory mechanism exists in the cells, so that the expression of different cells shows a plurality of specificities. Among them, the mechanism controlling gene expression is collectively called gene expression regulation. There are also differences in the manner in which different organisms perform gene regulation. In prokaryotes, environmental stimuli have a critical impact on gene expression, and by contact with the external environment, prokaryotes adapt to different environments by turning on and off the expression of a portion of the genes. Eukaryotic gene regulation is more complex than prokaryotic organisms. Regulation in eukaryotes is mainly affected by hormones, the cell growth cycle, and environmental factors are greatly reduced. Specific characteristics of gene regulation include (1) complex structure; (2) changeable regulation and control modes: there is both one-to-one regulation between genes and one-to-many or many-to-one multi-factor regulation. (3) Type diversity may be involved in a variety of types, such as DNA, mRNA, protein, small molecules, and the like. And (4) dynamically changing the regulation and control relation. Therefore, the gene regulation mechanism is one of important foundation for researching growth and development rules and basic morphological structure of animals and plants
Based on different types of gene expression data, the construction of gene regulation networks by computational means has become one of the important challenges of systems biology. Common computing methods to construct gene regulation networks cover a number of fields of theory, including correlation analysis methods, bayesian networks, feature selection methods, and boolean networks. These determine the gene regulation relationship by analyzing the correlation between the genes or the relationship of the expression level between the genes by modeling, and finally construct a regulation network.
The construction of gene regulation networks by correlation analysis is one of the most intuitive methods. Researchers analyzed the relativity between genes by means of pearson correlation coefficient, mutual information, etc. Among the most popular is the baseAnd constructing a gene regulation network of mutual information. Compared with the pearson correlation coefficient, the mutual information can reveal the nonlinear regulation and control relationship among genes. Margolin et al propose an ARACNE algorithm, using Data Processing Inequality (DPI) to determine if any in a triplet (X1, X2, X3)Then there is a relationship of I (X 1 ;X 3 )≤min[I(X 1 ;X 2 ),I(X 2 ;X 3 )]. ARACNE calculates mutual information I for any pair of genes and uses threshold I 0 Margolin et al consider that only I.gtoreq.I 0 And the regulation relationship exists between the gene pairs. Meyer et al further propose the MRNET algorithm based on ARACNE. The algorithm uses a policy of maximum relevance/minimum redundancy (MRMR) and uses a greedy algorithm to pick out node X j The node needs to meet and target node Y and have the maximum difference Score in information with respect to other selected node sets S. Patrick for a pair of nodes { X i ,X j Larger MRMR values are used as their weights. In further studies by Luo et al, it was considered that the gene expression regulation relationship was generally more than 3 genes, i.e., for the target gene T, there were generally more than two regulatory genes. Based on this hypothesis, they propose a new algorithm MI3, scoring the target gene T and the two regulatory genes R1 and R2 by a correlation part and a coordination part, to discover higher order interactions. Zhang et al aggregate condition mutual information and path consistency algorithm (path consistency algorithm, PCA) propose a network construction algorithm CMI-PCA. CMI-PCA adopts multivariate condition mutual information to test and filter indirect regulation and control relation. Zhao et al propose a new mutual information estimation method PMI aiming at the problem that the regulation and control relation is underestimated in calculation by CMI-PCA. The breakthrough of the gene regulation network construction method based on mutual information is to filter indirect regulation relations.
Bayesian networks are another common method of gene regulation network construction. Bayesian networks quantify attributes of biological directed networks, combining graph theory and probability theoryThe method of the two aspects. The difficulty of the Bayesian network in the gene regulation network can be mainly divided into two parts of structure learning and parameter learning. Werhli and Husmeier integrate gene expression data and a priori knowledge of multiple sources. By constructing the energy function E (G) in combination with Gibbs distribution as learning of the bayesian network structure, accuracy of the bayesian network structure is improved. At the same time, they also use the Markov chain Monte Carlo Method (MCMC) to estimate the hyper-parameters in different prior knowledge. Qin et al added Ontology Fingerprint on a priori knowledge to assess similarity between genes, inferring cell type specific signaling networks. In the algorithm process, a bayesian network is established on a given normalized signal transmission network, a heuristic search algorithm is adopted, edges are added and deleted according to Ontology Fingerprint similarity, and the coincidence degree of a candidate model and observed data is calculated as a selection index. In the process of bayesian network structure learning, qin adopts BIC as a choice of parameters, and Monte Carlo EM algorithm is used to infer hidden states of nodes in the network, and further estimate candidate model parameters. Hill et al inspired by the "minimal number of upstream regulatory factors" that considered the degree of gene entry in the network d max =4, thereby effectively reducing the uncertainty of the network. Hill's method is extremely effective in breast cancer cell lines, while constructing a network with AUC values up to 0.82 in the yeast complex network given a priori knowledge of the network structure. Li et al propose a dynamic Bayesian algorithm MMHO-DBN combining a high-order timing model and Max-Min hill-climing heuristic search. The MMHO-DBN adopts a local search method to improve, and adopts Dynamic Max-Min Parent (DMMP) to obtain a Parent node set with great possibility, so that the space of a candidate network structure is effectively reduced.
At present, a plurality of gene regulation network construction methods are proposed, but the method is limited by the complexity of gene regulation, and the precision of the method still has a great improvement space. Among these, the following problems are mainly present: (1) How to design an effective algorithm to filter indirect regulation and control relations among genes; (2) How to combine other biological information, and improve the precision of network construction.
Disclosure of Invention
The application aims to solve the technical problem of providing a gene regulation network construction method based on multi-time-lag causal entropy and improves the network construction precision aiming at the defects of the prior art.
In order to solve the technical problems, the application adopts the following technical scheme: a method and a system for constructing a gene regulation network based on multi-time-lag causal entropy comprise the following steps:
1) Dividing the input time sequence gene expression data into different time windows according to time lag tau;
2) Respectively constructing time sequence gene expression matrixes under t-tau time windows for the gene expression data of t time slices after dividing the windows, wherein the time sequence gene expression matrixes from t-tau to t-1;
3) For each gene in the time sequence gene expression matrix under t-tau time windows, selecting an expression profile under the t window by a target gene, selecting an expression profile under the t-tau window to t-1 time window by a regulatory factor, and calculating multi-time-lag transfer entropy between genes to obtain a gene correlation matrix;
4) The method comprises the steps of dividing edge clusters into two types for a full-communication network of a gene correlation matrix, filtering out one type of edges with low probability values, calculating multi-time-lag causal entropy of each of the remaining edges under different condition genes, and filtering out indirectly regulated edges with maximum causal entropy lower than a threshold value theta to obtain a final gene regulation network.
In step 1), the regulation and control relationship can be more accurately identified by using multiple time delays, and different time windows G tau are divided according to the time lag tau, wherein the expression is as follows:
wherein ,representing the expression value of the gene N in the time window expression matrix G tau under the time window T of the sample M; t indicates in which moving time windowA gene expression vector under the mouth; n represents the subscript of the genes, N ε the number of genes; m represents the subscript of the sample cells, M ε the number of samples.
In step 3), in order to more accurately identify the regulation and control relation under multiple time lags, the transfer entropy is promoted to a mode under multiple time lags, and multiple time lags between genes transfer entropy T X→Y The calculation formula of (2) is as follows:
T X→Y =I(Y t ,X t-1:t-τ |Y t-1:t-τ )
=H(Y t |Y t-1:t-τ )-H(Y t |Y t-1:t-τ ,X t-1:t-τ );
wherein I(Yt ,X t-1:t-τ |Y t-1:t-τ ) Represents Y t ,X t-1:t-τ In the condition variable Y t-1:t-τ Conditional mutual information of (a):
wherein Px, y, z (x, y, z) represent joint probability densities, pz (z) represent marginal probability densities, px, z (x, z) represent marginal probability densities between variables x, z; x is X t-1:t-τ Representing the expression value of gene x under the time window from t-1 to t- τ, H (|. Cndot.) represents conditional entropy:
where P (x, y) represents the joint probability and P (x) represents the marginal probability density.
In step 4), in order to filter out the indirect regulation and control effect, a path consistency algorithm based on causal entropy is used, and the specific implementation process of filtering out the indirect regulation and control edge with the maximum causal entropy lower than the threshold value theta comprises the following steps:
1) For the gene correlation matrix G zero-order Filtering edges of low expression values, dividing edge clusters into two clusters according to k-means, and filtering edges in the clusters with low probability values; the elements in the gene correlation matrix represent the probability that regulatory relationships exist between genesA rate;
2) Filtering indirectly regulated edges based on a path consistency algorithm, regarding each edge (X, Y) existing in the filtered network, regarding adjacent points Z existing on each edge as a condition gene if the edges (Y, Z) exist and the edges (X, Z) exist, and calculating causal entropy CE under the condition gene X→Y|Z =I(Y t ;X t-1:t-τ |Z t-1:t-τ );
3) Conditional Gene set K ε { K for genes X, Y 1 ,K 2 ,K 3 …K n Filtering the maximum causal entropy max of a plurality of conditional genes Z∈K {CE X→Y|Z Edges less than the threshold θ.
The threshold of the present application may be set to θ=0.03.
The application also provides a gene regulation network construction system based on the multi-time-lag causal entropy, which comprises the following steps:
an input unit for dividing the inputted time-series gene expression data into different time windows according to the time lag tau;
the gene expression matrix construction unit is used for respectively constructing time sequence gene expression matrixes under t-tau time windows for the gene expression data of t time slices after window division, and the gene expression matrixes from t-tau to t-1;
the gene correlation matrix construction unit is used for selecting an expression profile under a t window for each gene in the time sequence gene expression matrix under t-tau time windows, selecting an expression profile under the t window from a t-tau window to a t-1 time window by a regulating factor, and calculating multi-time-lag transfer entropy between genes to obtain a gene correlation matrix;
and the clustering unit is used for clustering the full-communication network of the gene correlation matrix, classifying the edges into two types, filtering out one type of edges with low probability values, calculating multi-time-lag causal entropy of each of the rest edges under different condition genes, and filtering out indirectly regulated edges with the maximum causal entropy lower than a threshold value theta to obtain the final gene regulation network.
Compared with the prior art, the application has the following beneficial effects: the method is suitable for the situation that the time slices of real time sequence gene expression data are too few, can calculate the regulation and control relation of genes under a plurality of time slices, and filters indirectly regulated edges through conditional transfer entropy, thereby effectively improving the network construction precision.
Drawings
FIG. 1 is a flow chart of the NIMCE of the application;
FIG. 2 is a graph comparing the methods NIMCE and GENIE3, jump3, fastBMA based on PR curves and area under them (AUPR);
FIG. 3 is a comparative graph of the NIMCE and GENIE3, jump3, fastBMA methods evaluated based on the Recall, precision method.
Detailed Description
1. Construction of time window Gene expression matrices
Reading in a time sequence gene table data file, wherein G represents a gene expression matrix, and an expression matrix G tau under a moving time window is expressed as follows:
wherein ,representing the expression value of the gene N in the time window expression matrix G tau under the time window T of the sample M; t represents the gene expression vector under which moving time window; n represents the subscript of the gene (N.epsilon.the number of genes) and M represents the subscript of the sample cells (M.epsilon.the number of samples).
2. Construction of a Gene correlation matrix
For each pair of genes, the target gene selects an expression profile under a t window, the regulatory factor selects an expression profile under a t-tau window to t-1 time window, and multi-time-lag transfer entropy between the regulatory factor and the target gene is calculated:
wherein I(Yt ,X t-1:t-τ |Y t-1:t-τ ) Represents Y t ,X t-1:t-τ In the condition variable Y t-1:t-τ Conditional mutual information of (a):
where Px, y, z (x, y, z) represent joint probability densities, pz (z) represent marginal probability densities, px, z (x, z) represent marginal probability densities between the variables x, z.
X t-1:t-τ Representing the expression value of gene x in the time window from t-1 to t- τ, gene y is similarly represented, and H (|. Cndot.) represents conditional entropy:
where P (x, y) represents the joint probability and P (x) represents the marginal probability density.
In NIMCE, to effectively calculate the distribution of probabilities in equation (2), we calculate the probability density of continuous variables using Kernel Density Estimation (KDE):
wherein Xp Represents X t-1:t-τ ,Y p Represents Y t-1:t-τ L=m (T- τ) denotes the number of samples under the time-shift matrix, f h (x) Is a kernel density function defined as follows:
where h represents the size of the baseband and K is a kernel function.
3. Edge of indirect regulation and control by filtration
For the gene correlation matrix G zero-order The elements in the matrix represent the probability of regulatory relationship among genes, edge clusters are divided into two clusters according to k-means, and low-level filtering is performedEdges in the cluster of probability values;
filtering indirectly regulated edges based on a path consistency algorithm, regarding each edge (X, Y) existing in the filtered network, regarding adjacent points Z existing on each edge as a condition gene if the edges (Y, Z) exist and the edges (X, Z) exist, and calculating causal entropy CE under the condition gene X→Y|Z =I(Y t ;X t-1:t-τ |Z t-1:t-τ );
Conditional Gene set K ε { K for genes X, Y 1 ,K 2 ,K 3 …K n Filtering the maximum causal entropy max among a plurality of conditional genes Z∈K {CE x→Y|z Edges less than the threshold θ (artificially set super parameter, default to 0.03).
4. Experiment verification
To verify the effectiveness of the method of the present application, we tested on GeneNetWave generated 5 simulated data and compared with random forest based GENIE3 and tree method based Jump3 and dynamic bayesian fastBMA. The GeneNetWave dataset was extracted from each subnet generated from the E-coli or cerevisiae gene regulatory network time series perturbation data for the DREAM4 challenge. We generated 5 datasets using GNW, where each dataset contained 50 genes, each sample contained 10 samples, containing a total of 21 time-point temporal expression data.
To evaluate the continuity and accuracy of the inferred results, we used several metrics for comparison, namely the AUPR value and the Recall, precision value. The AUPR value is the area under the line of the PR curve, the Recall value calculation formula is the ratio of the number of predicted correct edges to the number of true directed edges, the Precision value calculation formula is the ratio of the number of predicted correct edges to the number of predicted edges, and the experimental results of the AUPR value and the Recall value are shown in fig. 2 and 3 respectively.
As can be seen from fig. 2 and 3, our method is superior to other methods at different samples, whether based on AUPR values or Recall, precision values. Thus, we propose a method NIMCE with good stability. Experiments show that the time complexity of the Jump3 increases exponentially under the condition of larger network scale, and the calculation cannot be basically performed, and the result can be obtained in a shorter time by using the NIMCE method.
Claims (3)
1. The gene regulation network construction method based on the multi-time-lag causal entropy is characterized by comprising the following steps of:
1) Dividing the input time sequence gene expression data into different time windows according to time lag tau;
2) Respectively constructing time sequence gene expression matrixes under t-tau time windows for the gene expression data of t time slices after dividing the windows, wherein the time sequence gene expression matrixes from t-tau to t-1;
3) For each gene in the time sequence gene expression matrix under t-tau time windows, selecting an expression profile under the t window by a target gene, selecting an expression profile under the t-tau window to t-1 time window by a regulatory factor, and calculating multi-time-lag transfer entropy between genes to obtain a gene correlation matrix;
4) Dividing edge clusters into two types for a full-communication network of a gene correlation matrix, filtering out one type of edges with low probability values, calculating multi-time-lag causal entropy of each of the remaining edges under different condition genes, and filtering out indirectly regulated edges with maximum causal entropy lower than a threshold value theta to obtain a final gene regulation network; the specific implementation process of filtering out the indirectly regulated edges with the maximum causal entropy lower than the threshold value theta comprises the following steps:
for the gene correlation matrix G zero-order Filtering edges of low expression values, dividing edge clusters into two clusters according to k-means, and filtering edges in the clusters with low probability values; the elements in the gene correlation matrix represent the probability of regulatory relationships between genes;
filtering indirectly regulated edges based on a path consistency algorithm, regarding each edge (X, Y) existing in the filtered network, regarding adjacent points Z existing on each edge as a condition gene if the edges (Y, Z) exist and the edges (X, Z) exist, and calculating causal entropy CE under the condition gene X→Y|Z =I(Y t ;X t-1:t-τ |Z t-1:t-τ );
Conditional Gene set K ε { K for genes X, Y 1 ,K 2 ,K 3 ,…,K n Filtering the maximum causal entropy max of a plurality of conditional genes Z∈K {CE X→Y|Z Edges less than a threshold θ;
in step 1), the expressions for dividing different time windows gτ according to the time lag τ are:
wherein ,representing the expression value of the gene N in the time window expression matrix G tau under the time window T of the sample M; n represents the total number of genes; m represents the total number of sample cells;
in step 3), the multiple-time-lag transfer entropy T between genes X→Y The calculation formula of (2) is as follows:
T X→Y =I(Y t ;X t-1:t-τ |Y t-1:t-τ )
=H(Y t |Y t-1:t-τ )-H(Y t |Y t-1:t-τ ,X t-1:t-τ );
wherein I(Yt ;X t-1:t-τ |Y t-1:t-τ ) Represents Y t ,X t-1:t-τ In the condition variable Y t-1:t-τ Conditional mutual information of (a):
wherein pX,Y,Z (x, y, z) represents the joint probability density, p Z (z) represents the marginal probability density, p X,Z (x, z) represents the marginal probability density between the variables x, z;
X t-1:t-τ representing the expression value of gene X in the time window from t-1 to t-tau, H (|. Cndot.) represents the conditional entropy:
where p (x, y) represents the joint probability and p (x) represents the marginal probability density.
2. The method for constructing a gene regulation network based on multi-time-lag causal entropy according to claim 1, wherein the threshold θ is 0.03.
3. The gene regulation network construction system based on the multi-time-lag causal entropy is characterized by comprising the following steps:
an input unit for dividing the inputted time-series gene expression data into different time windows according to the time lag tau;
the gene expression matrix construction unit is used for respectively constructing time sequence gene expression matrixes under t-tau time windows for the gene expression data of t time slices after window division, and the gene expression matrixes from t-tau to t-1;
the gene correlation matrix construction unit is used for selecting an expression profile under a t window for each gene in the time sequence gene expression matrix under t-tau time windows, selecting an expression profile under the t window from a t-tau window to a t-1 time window by a regulating factor, and calculating multi-time-lag transfer entropy between genes to obtain a gene correlation matrix;
the clustering unit is used for clustering edges of the full-communication network of the gene correlation matrix into two types, filtering out one type of edges with low probability values, calculating multi-time-lag causal entropy of each of the remaining edges under different condition genes, and filtering out indirectly regulated edges with maximum causal entropy lower than a threshold value theta to obtain a final gene regulation network;
the specific implementation process for filtering out the indirectly regulated edges with the maximum causal entropy lower than the threshold value theta comprises the following steps:
for the gene correlation matrix G zero-order Filtering edges of low expression values, dividing edge clusters into two clusters according to k-means, and filtering edges in the clusters with low probability values; the elements in the gene correlation matrix represent the probability of regulatory relationships between genes;
filtering indirectly regulated edges based on a path consistency algorithm, regarding each edge (X, Y) existing in the filtered network, regarding adjacent points Z existing on each edge as a condition gene if the edges (Y, Z) exist and the edges (X, Z) exist, and calculating causal entropy CE under the condition gene X→Y|Z =I(Y t ;X t-1:t-τ |Z t-1:t-τ );
Conditional Gene set K ε { K for genes X, Y 1 ,K 2 ,K 3 ,…,K n Filtering the maximum causal entropy max of a plurality of conditional genes Z∈K {CE X→Y|Z Edges less than a threshold θ;
the expression for dividing different time windows G tau according to the time lag tau is as follows:
wherein ,representing the expression value of the gene N in the time window expression matrix G tau under the time window T of the sample M; n represents the total number of genes; m represents the total number of sample cells;
multi-time-lag transfer entropy T between genes X→Y The calculation formula of (2) is as follows:
T X→Y =I(Y t ;X t-1:t-τ |Y t-1:t-τ )
=H(Y t |Y t-1:t-τ )-H(Y t |Y t-1:t-τ ,X t-1:t-τ );
wherein I(Yt ;X t-1:t-τ |Y t-1:t-τ ) Represents Y t ,X t-1:t-τ In the condition variable Y t-1:t-τ Conditional mutual information of (a):
wherein pX,Y,Z (x, y, z) representsJoint probability density, p Z (z) represents the marginal probability density, p X,Z (x, z) represents the marginal probability density between the variables x, z;
X t-1:t-τ representing the expression value of gene X in the time window from t-1 to t-tau, H (|. Cndot.) represents the conditional entropy:
where p (x, y) represents the joint probability and p (x) represents the marginal probability density.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010013036.9A CN111223523B (en) | 2020-01-06 | 2020-01-06 | Gene regulation network construction method and system based on multi-time-lag causal entropy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010013036.9A CN111223523B (en) | 2020-01-06 | 2020-01-06 | Gene regulation network construction method and system based on multi-time-lag causal entropy |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111223523A CN111223523A (en) | 2020-06-02 |
CN111223523B true CN111223523B (en) | 2023-10-03 |
Family
ID=70811155
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010013036.9A Active CN111223523B (en) | 2020-01-06 | 2020-01-06 | Gene regulation network construction method and system based on multi-time-lag causal entropy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111223523B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113610760B (en) * | 2021-07-05 | 2024-03-12 | 河海大学 | Cell image segmentation tracing method based on U-shaped residual neural network |
CN113889180B (en) * | 2021-09-30 | 2024-05-24 | 山东大学 | Biomarker identification method and system based on dynamic network entropy |
CN114925837B (en) * | 2022-03-23 | 2024-04-16 | 华中农业大学 | Gene regulation network construction method based on mixed entropy optimization mutual information |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003058549A (en) * | 2001-08-21 | 2003-02-28 | Mamoru Kato | Computer readable recording medium with program recorded thereon for estimating control relation between genes from gene expression quantity data and gene arrangement data |
CN108491686A (en) * | 2018-03-30 | 2018-09-04 | 中南大学 | A kind of gene regulatory network construction method based on two-way XGBoost |
KR20190054386A (en) * | 2017-11-13 | 2019-05-22 | 한양대학교 산학협력단 | Genome analysis method based on modularization |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050256652A1 (en) * | 2004-05-16 | 2005-11-17 | Sai-Ping Li | Reconstruction of gene networks from time-series microarray data |
-
2020
- 2020-01-06 CN CN202010013036.9A patent/CN111223523B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003058549A (en) * | 2001-08-21 | 2003-02-28 | Mamoru Kato | Computer readable recording medium with program recorded thereon for estimating control relation between genes from gene expression quantity data and gene arrangement data |
KR20190054386A (en) * | 2017-11-13 | 2019-05-22 | 한양대학교 산학협력단 | Genome analysis method based on modularization |
CN108491686A (en) * | 2018-03-30 | 2018-09-04 | 中南大学 | A kind of gene regulatory network construction method based on two-way XGBoost |
Non-Patent Citations (6)
Title |
---|
Gene regulatory networks on transfer entropy (GRNTE): a novel approach to reconstruct gene regulatory interactions applied to a case study for the plant pathogen phytophthora infestans;Juan Camilo Castro et al.;《Theoretical Biology and Medical Modelling》;全文 * |
On the interplay between entropy and robustness of gene regulatory networks;Bor-Sen Chen et al.;《Entropy in Genetics and Computational Biology》;全文 * |
Xiang Chen et al..A novel method of gene regulatory network structure inference from gene knock-out expression data.《Tsinghua Science and Technology》.2019,第24卷446-455. * |
Xiujun Zhang et al..Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information.《Bioinformatics》.2011,第28卷98-104. * |
几何模式动态贝叶斯网络推理基因调控网络;王开军;张军英;赵峰;张宏怡;;西安电子科技大学学报(06);全文 * |
王文杰 等.基因组学数据的网络构建与分析方法.《中国卫生统计》.2017,第34卷177-180+184. * |
Also Published As
Publication number | Publication date |
---|---|
CN111223523A (en) | 2020-06-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111223523B (en) | Gene regulation network construction method and system based on multi-time-lag causal entropy | |
Jarboui et al. | Combinatorial particle swarm optimization (CPSO) for partitional clustering problem | |
Yu et al. | Zinb-based graph embedding autoencoder for single-cell rna-seq interpretations | |
Hu et al. | Comprehensive learning particle swarm optimization based memetic algorithm for model selection in short-term load forecasting using support vector regression | |
Zhou et al. | A Bayesian connectivity-based approach to constructing probabilistic gene regulatory networks | |
Genovese et al. | False discovery control with p-value weighting | |
Gebert et al. | Modeling gene regulatory networks with piecewise linear differential equations | |
Maraziotis | A semi-supervised fuzzy clustering algorithm applied to gene expression data | |
Li et al. | A novel complex network community detection approach using discrete particle swarm optimization with particle diversity and mutation | |
CN114022693B (en) | Single-cell RNA-seq data clustering method based on double self-supervision | |
EP2354988A1 (en) | Gene clustering program, gene clustering method, and gene cluster analyzing device | |
Bahrepour et al. | An adaptive ordered fuzzy time series with application to FOREX | |
Wang et al. | Learning large-scale fuzzy cognitive maps using an evolutionary many-task algorithm | |
Zhang et al. | A novel power-driven fractional accumulated grey model and its application in forecasting wind energy consumption of China | |
Kahraman | A novel and powerful hybrid classifier method: Development and testing of heuristic k-nn algorithm with fuzzy distance metric | |
Zeng et al. | A novel HMM-based clustering algorithm for the analysis of gene expression time-course data | |
Zhu et al. | Deep-gknock: nonlinear group-feature selection with deep neural networks | |
Fu et al. | An improved multi-objective marine predator algorithm for gene selection in classification of cancer microarray data | |
Sartori | Penalized regression: Bootstrap confidence intervals and variable selection for high-dimensional data sets | |
Huang et al. | Identification of fuzzy inference systems using a multi-objective space search algorithm and information granulation | |
Deng et al. | EXAMINE: A computational approach to reconstructing gene regulatory networks | |
Örkçü et al. | A hybrid applied optimization algorithm for training multi-layer neural networks in data classification | |
CN113486952A (en) | Multi-factor model optimization method of gene regulation and control network | |
Aalto et al. | Continuous time Gaussian process dynamical models in gene regulatory network inference | |
Ergul et al. | DOPGA: A new fitness assignment scheme for multi-objective evolutionary algorithms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |