CN111223523B

CN111223523B - Gene regulation network construction method and system based on multi-time-lag causal entropy

Info

Publication number: CN111223523B
Application number: CN202010013036.9A
Authority: CN
Inventors: 李敏; 冯浩楠; 郑瑞清
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2020-01-06
Filing date: 2020-01-06
Publication date: 2023-10-03
Anticipated expiration: 2040-01-06
Also published as: CN111223523A

Abstract

The application discloses a method and a system for constructing a gene regulation network based on multi-time-lag causal entropy, wherein input time sequence gene expression data are divided into time windows with different time lags; for gene expression data of t time slices, respectively constructing a gene expression matrix under t-tau time windows, for each pair of genes, calculating multi-time-lag transfer entropy of a target gene under the t time windows and genes under the previous t-tau time windows to obtain a multi-time-lag gene correlation matrix, wherein elements of the matrix represent the probability of edges between genes, clustering edges of the matrix into two types through k-means, filtering out low-probability edge clusters, calculating multi-time-lag condition transfer entropy under conditional genes for each of the rest edges, and filtering out indirectly regulated edges with maximum causal entropy smaller than a threshold value to obtain a final network structure. The application effectively improves the accuracy of inference.

Description

Gene regulation network construction method and system based on multi-time-lag causal entropy

Technical Field

The application relates to the field of bioinformatics, in particular to a construction method of a complex biological network.

Background

In organisms, cells are the fundamental unit of all tissue structure and function. The DNA information of all cells in a living body is the same, but cells of different tissues and organs have different differences, because a complex regulatory gene regulatory mechanism exists in the cells, so that the expression of different cells shows a plurality of specificities. Among them, the mechanism controlling gene expression is collectively called gene expression regulation. There are also differences in the manner in which different organisms perform gene regulation. In prokaryotes, environmental stimuli have a critical impact on gene expression, and by contact with the external environment, prokaryotes adapt to different environments by turning on and off the expression of a portion of the genes. Eukaryotic gene regulation is more complex than prokaryotic organisms. Regulation in eukaryotes is mainly affected by hormones, the cell growth cycle, and environmental factors are greatly reduced. Specific characteristics of gene regulation include (1) complex structure; (2) changeable regulation and control modes: there is both one-to-one regulation between genes and one-to-many or many-to-one multi-factor regulation. (3) Type diversity may be involved in a variety of types, such as DNA, mRNA, protein, small molecules, and the like. And (4) dynamically changing the regulation and control relation. Therefore, the gene regulation mechanism is one of important foundation for researching growth and development rules and basic morphological structure of animals and plants

Based on different types of gene expression data, the construction of gene regulation networks by computational means has become one of the important challenges of systems biology. Common computing methods to construct gene regulation networks cover a number of fields of theory, including correlation analysis methods, bayesian networks, feature selection methods, and boolean networks. These determine the gene regulation relationship by analyzing the correlation between the genes or the relationship of the expression level between the genes by modeling, and finally construct a regulation network.

The construction of gene regulation networks by correlation analysis is one of the most intuitive methods. Researchers analyzed the relativity between genes by means of pearson correlation coefficient, mutual information, etc. Among the most popular is the baseAnd constructing a gene regulation network of mutual information. Compared with the pearson correlation coefficient, the mutual information can reveal the nonlinear regulation and control relationship among genes. Margolin et al propose an ARACNE algorithm, using Data Processing Inequality (DPI) to determine if any in a triplet (X1, X2, X3)Then there is a relationship of I (X ₁ ；X ₃ )≤min[I(X ₁ ；X ₂ ),I(X ₂ ；X ₃ )]. ARACNE calculates mutual information I for any pair of genes and uses threshold I ₀ Margolin et al consider that only I.gtoreq.I ₀ And the regulation relationship exists between the gene pairs. Meyer et al further propose the MRNET algorithm based on ARACNE. The algorithm uses a policy of maximum relevance/minimum redundancy (MRMR) and uses a greedy algorithm to pick out node X _j The node needs to meet and target node Y and have the maximum difference Score in information with respect to other selected node sets S. Patrick for a pair of nodes { X _i ,X _j Larger MRMR values are used as their weights. In further studies by Luo et al, it was considered that the gene expression regulation relationship was generally more than 3 genes, i.e., for the target gene T, there were generally more than two regulatory genes. Based on this hypothesis, they propose a new algorithm MI3, scoring the target gene T and the two regulatory genes R1 and R2 by a correlation part and a coordination part, to discover higher order interactions. Zhang et al aggregate condition mutual information and path consistency algorithm (path consistency algorithm, PCA) propose a network construction algorithm CMI-PCA. CMI-PCA adopts multivariate condition mutual information to test and filter indirect regulation and control relation. Zhao et al propose a new mutual information estimation method PMI aiming at the problem that the regulation and control relation is underestimated in calculation by CMI-PCA. The breakthrough of the gene regulation network construction method based on mutual information is to filter indirect regulation relations.

Bayesian networks are another common method of gene regulation network construction. Bayesian networks quantify attributes of biological directed networks, combining graph theory and probability theoryThe method of the two aspects. The difficulty of the Bayesian network in the gene regulation network can be mainly divided into two parts of structure learning and parameter learning. Werhli and Husmeier integrate gene expression data and a priori knowledge of multiple sources. By constructing the energy function E (G) in combination with Gibbs distribution as learning of the bayesian network structure, accuracy of the bayesian network structure is improved. At the same time, they also use the Markov chain Monte Carlo Method (MCMC) to estimate the hyper-parameters in different prior knowledge. Qin et al added Ontology Fingerprint on a priori knowledge to assess similarity between genes, inferring cell type specific signaling networks. In the algorithm process, a bayesian network is established on a given normalized signal transmission network, a heuristic search algorithm is adopted, edges are added and deleted according to Ontology Fingerprint similarity, and the coincidence degree of a candidate model and observed data is calculated as a selection index. In the process of bayesian network structure learning, qin adopts BIC as a choice of parameters, and Monte Carlo EM algorithm is used to infer hidden states of nodes in the network, and further estimate candidate model parameters. Hill et al inspired by the "minimal number of upstream regulatory factors" that considered the degree of gene entry in the network d _max =4, thereby effectively reducing the uncertainty of the network. Hill's method is extremely effective in breast cancer cell lines, while constructing a network with AUC values up to 0.82 in the yeast complex network given a priori knowledge of the network structure. Li et al propose a dynamic Bayesian algorithm MMHO-DBN combining a high-order timing model and Max-Min hill-climing heuristic search. The MMHO-DBN adopts a local search method to improve, and adopts Dynamic Max-Min Parent (DMMP) to obtain a Parent node set with great possibility, so that the space of a candidate network structure is effectively reduced.

At present, a plurality of gene regulation network construction methods are proposed, but the method is limited by the complexity of gene regulation, and the precision of the method still has a great improvement space. Among these, the following problems are mainly present: (1) How to design an effective algorithm to filter indirect regulation and control relations among genes; (2) How to combine other biological information, and improve the precision of network construction.

Disclosure of Invention

The application aims to solve the technical problem of providing a gene regulation network construction method based on multi-time-lag causal entropy and improves the network construction precision aiming at the defects of the prior art.

In order to solve the technical problems, the application adopts the following technical scheme: a method and a system for constructing a gene regulation network based on multi-time-lag causal entropy comprise the following steps:

1) Dividing the input time sequence gene expression data into different time windows according to time lag tau;

2) Respectively constructing time sequence gene expression matrixes under t-tau time windows for the gene expression data of t time slices after dividing the windows, wherein the time sequence gene expression matrixes from t-tau to t-1;

3) For each gene in the time sequence gene expression matrix under t-tau time windows, selecting an expression profile under the t window by a target gene, selecting an expression profile under the t-tau window to t-1 time window by a regulatory factor, and calculating multi-time-lag transfer entropy between genes to obtain a gene correlation matrix;

4) The method comprises the steps of dividing edge clusters into two types for a full-communication network of a gene correlation matrix, filtering out one type of edges with low probability values, calculating multi-time-lag causal entropy of each of the remaining edges under different condition genes, and filtering out indirectly regulated edges with maximum causal entropy lower than a threshold value theta to obtain a final gene regulation network.

In step 1), the regulation and control relationship can be more accurately identified by using multiple time delays, and different time windows G tau are divided according to the time lag tau, wherein the expression is as follows:

wherein ,representing the expression value of the gene N in the time window expression matrix G tau under the time window T of the sample M; t indicates in which moving time windowA gene expression vector under the mouth; n represents the subscript of the genes, N ε the number of genes; m represents the subscript of the sample cells, M ε the number of samples.

In step 3), in order to more accurately identify the regulation and control relation under multiple time lags, the transfer entropy is promoted to a mode under multiple time lags, and multiple time lags between genes transfer entropy T _X→Y The calculation formula of (2) is as follows:

T _X→Y ＝I(Y _t ,X _t-1:t-τ |Y _t-1:t-τ )

＝H(Y _t |Y _t-1:t-τ )-H(Y _t |Y _t-1:t-τ ,X _t-1:t-τ )；

wherein I(Y_t ,X _t-1:t-τ |Y _t-1:t-τ ) Represents Y _t ，X _t-1:t-τ In the condition variable Y _t-1:t-τ Conditional mutual information of (a):

wherein Px, y, z (x, y, z) represent joint probability densities, pz (z) represent marginal probability densities, px, z (x, z) represent marginal probability densities between variables x, z; x is X _t-1:t-τ Representing the expression value of gene x under the time window from t-1 to t- τ, H (|. Cndot.) represents conditional entropy:

where P (x, y) represents the joint probability and P (x) represents the marginal probability density.

In step 4), in order to filter out the indirect regulation and control effect, a path consistency algorithm based on causal entropy is used, and the specific implementation process of filtering out the indirect regulation and control edge with the maximum causal entropy lower than the threshold value theta comprises the following steps:

1) For the gene correlation matrix G _zero-order Filtering edges of low expression values, dividing edge clusters into two clusters according to k-means, and filtering edges in the clusters with low probability values; the elements in the gene correlation matrix represent the probability that regulatory relationships exist between genesA rate;

2) Filtering indirectly regulated edges based on a path consistency algorithm, regarding each edge (X, Y) existing in the filtered network, regarding adjacent points Z existing on each edge as a condition gene if the edges (Y, Z) exist and the edges (X, Z) exist, and calculating causal entropy CE under the condition gene _X→Y|Z ＝I(Y _t ；X _t-1:t-τ |Z _t-1:t-τ )；

3) Conditional Gene set K ε { K for genes X, Y ₁ ,K ₂ ,K ₃ …K _n Filtering the maximum causal entropy max of a plurality of conditional genes _Z∈K {CE _X→Y|Z Edges less than the threshold θ.

The threshold of the present application may be set to θ=0.03.

The application also provides a gene regulation network construction system based on the multi-time-lag causal entropy, which comprises the following steps:

an input unit for dividing the inputted time-series gene expression data into different time windows according to the time lag tau;

the gene expression matrix construction unit is used for respectively constructing time sequence gene expression matrixes under t-tau time windows for the gene expression data of t time slices after window division, and the gene expression matrixes from t-tau to t-1;

the gene correlation matrix construction unit is used for selecting an expression profile under a t window for each gene in the time sequence gene expression matrix under t-tau time windows, selecting an expression profile under the t window from a t-tau window to a t-1 time window by a regulating factor, and calculating multi-time-lag transfer entropy between genes to obtain a gene correlation matrix;

and the clustering unit is used for clustering the full-communication network of the gene correlation matrix, classifying the edges into two types, filtering out one type of edges with low probability values, calculating multi-time-lag causal entropy of each of the rest edges under different condition genes, and filtering out indirectly regulated edges with the maximum causal entropy lower than a threshold value theta to obtain the final gene regulation network.

Compared with the prior art, the application has the following beneficial effects: the method is suitable for the situation that the time slices of real time sequence gene expression data are too few, can calculate the regulation and control relation of genes under a plurality of time slices, and filters indirectly regulated edges through conditional transfer entropy, thereby effectively improving the network construction precision.

Drawings

FIG. 1 is a flow chart of the NIMCE of the application;

FIG. 2 is a graph comparing the methods NIMCE and GENIE3, jump3, fastBMA based on PR curves and area under them (AUPR);

FIG. 3 is a comparative graph of the NIMCE and GENIE3, jump3, fastBMA methods evaluated based on the Recall, precision method.

Detailed Description

1. Construction of time window Gene expression matrices

Reading in a time sequence gene table data file, wherein G represents a gene expression matrix, and an expression matrix G tau under a moving time window is expressed as follows:

wherein ,representing the expression value of the gene N in the time window expression matrix G tau under the time window T of the sample M; t represents the gene expression vector under which moving time window; n represents the subscript of the gene (N.epsilon.the number of genes) and M represents the subscript of the sample cells (M.epsilon.the number of samples).

2. Construction of a Gene correlation matrix

For each pair of genes, the target gene selects an expression profile under a t window, the regulatory factor selects an expression profile under a t-tau window to t-1 time window, and multi-time-lag transfer entropy between the regulatory factor and the target gene is calculated:

where Px, y, z (x, y, z) represent joint probability densities, pz (z) represent marginal probability densities, px, z (x, z) represent marginal probability densities between the variables x, z.

X _t-1:t-τ Representing the expression value of gene x in the time window from t-1 to t- τ, gene y is similarly represented, and H (|. Cndot.) represents conditional entropy:

In NIMCE, to effectively calculate the distribution of probabilities in equation (2), we calculate the probability density of continuous variables using Kernel Density Estimation (KDE):

wherein X^p Represents X _t-1:t-τ ，Y ^p Represents Y _t-1:t-τ L=m (T- τ) denotes the number of samples under the time-shift matrix, f _h (x) Is a kernel density function defined as follows:

where h represents the size of the baseband and K is a kernel function.

3. Edge of indirect regulation and control by filtration

For the gene correlation matrix G _zero-order The elements in the matrix represent the probability of regulatory relationship among genes, edge clusters are divided into two clusters according to k-means, and low-level filtering is performedEdges in the cluster of probability values;

filtering indirectly regulated edges based on a path consistency algorithm, regarding each edge (X, Y) existing in the filtered network, regarding adjacent points Z existing on each edge as a condition gene if the edges (Y, Z) exist and the edges (X, Z) exist, and calculating causal entropy CE under the condition gene _X→Y|Z ＝I(Y _t ；X _t-1:t-τ |Z _t-1:t-τ )；

Conditional Gene set K ε { K for genes X, Y ₁ ,K ₂ ,K ₃ …K _n Filtering the maximum causal entropy max among a plurality of conditional genes _Z∈K {CE _x→Y|z Edges less than the threshold θ (artificially set super parameter, default to 0.03).

4. Experiment verification

To verify the effectiveness of the method of the present application, we tested on GeneNetWave generated 5 simulated data and compared with random forest based GENIE3 and tree method based Jump3 and dynamic bayesian fastBMA. The GeneNetWave dataset was extracted from each subnet generated from the E-coli or cerevisiae gene regulatory network time series perturbation data for the DREAM4 challenge. We generated 5 datasets using GNW, where each dataset contained 50 genes, each sample contained 10 samples, containing a total of 21 time-point temporal expression data.

To evaluate the continuity and accuracy of the inferred results, we used several metrics for comparison, namely the AUPR value and the Recall, precision value. The AUPR value is the area under the line of the PR curve, the Recall value calculation formula is the ratio of the number of predicted correct edges to the number of true directed edges, the Precision value calculation formula is the ratio of the number of predicted correct edges to the number of predicted edges, and the experimental results of the AUPR value and the Recall value are shown in fig. 2 and 3 respectively.

As can be seen from fig. 2 and 3, our method is superior to other methods at different samples, whether based on AUPR values or Recall, precision values. Thus, we propose a method NIMCE with good stability. Experiments show that the time complexity of the Jump3 increases exponentially under the condition of larger network scale, and the calculation cannot be basically performed, and the result can be obtained in a shorter time by using the NIMCE method.

Claims

1. A gene regulatory network construction method based on multi-delay causal entropy, which is characterized by including the following steps:

1) Divide the input time series gene expression data into different time windows according to the time lag τ;

2) For the gene expression data of t time slices after dividing the window, construct a temporal gene expression matrix under t-τ time windows, a gene expression matrix from t-τ to t-1;

3) For each gene in the temporal gene expression matrix under t-τ time windows, select the expression profile under the t window for the target gene, and select the expression profile under the t-τ window to the t-1 time window for the regulatory factor, and calculate Multi-time lag transfer entropy between genes to obtain a gene correlation matrix;

4) For the fully connected network of the gene correlation matrix, divide the edge clusters into two categories, filter out the edges with low probability values, and calculate the multi-delay causal entropy under different condition genes for each remaining edge. Filter out the indirectly regulated edges whose maximum causal entropy is lower than the threshold θ to obtain the final gene regulation network; among them, the specific implementation process of filtering out the indirectly regulated edges whose maximum causal entropy is lower than the threshold θ includes:

For the gene correlation matrix G _zero-order , edges with low expression values are filtered, edges are clustered into two clusters according to k-means, and edges in clusters with low probability values are filtered out; the elements in the gene correlation matrix represent The probability that there is a regulatory relationship between genes;

Based on the path consistency algorithm, the indirectly controlled edges are filtered. For each edge (X, Y) that exists in the filtered network, if there is an edge (Y, Z) and an edge (X, Z) exists for each edge. The adjacent point Z is regarded as a conditional gene, and the causal entropy CE under the conditional gene is calculated as _X→Y|Z =I(Y _t ; X _t-1:t-τ |Z _t-1:t-τ );

For _the conditional gene set K∈{ _K ₁ ,K ₂ ,K ₃ ,…,K _n } of genes The edge of the threshold θ;

In step 1), the expression of dividing different time windows Gτ according to the time delay τ is:

in, Represents the expression value of gene N in the time window expression matrix Gτ under the time window T of sample M; N represents the total number of genes; M represents the total number of sample cells;

In step 3), the calculation formula of the multi-delay transfer entropy T _X→Y between genes is:

T _X→Y =I(Y _t ;X _t-1:t-τ |Y _t-1:t-τ )

=H(Y _t |Y _t-1:t-τ )-H(Y _t |Y _t-1:t-τ ,X _t-1:t-τ );

Among them, I(Y _t ; X _t-1:t-τ _| Y t _-1 _:t-τ ) represents Y _t , and information:

Among them, p _{X, Y, Z} (x, y, z) represents the joint probability density, p _Z (z) represents the marginal probability density, and p _{X, Z} (x, z) represents the marginal probability density between variables x and z;

X _t-1:t-τ represents the expression value of gene X in the time window from t-1 to t-τ, and H(·|·) represents the conditional entropy:

where p(x,y) represents the joint probability and p(x) represents the marginal probability density.

2. The gene regulation network construction method based on multi-delay causal entropy according to claim 1, characterized in that the threshold θ is 0.03.

3. A gene regulatory network construction system based on multi-delay causal entropy, which is characterized by including the following steps:

The input unit is used to divide the input time series gene expression data into different time windows according to the time lag τ;

The gene expression matrix construction unit is used to construct temporal gene expression matrices under t-τ time windows for the gene expression data of t time slices after dividing the windows, and a gene expression matrix from t-τ to t-1;

Gene correlation matrix construction unit, used for each gene in the temporal gene expression matrix under t-τ time windows, the target gene selects the expression profile under the t window, and the regulatory factor selects the t-τ window to t-1 time Expression profile under the window, calculate the multi-lag transfer entropy between genes, and obtain the gene correlation matrix;

The clustering unit is used to classify the fully connected network of the gene correlation matrix into two categories, filter out the edges with low probability values, and calculate the multi-time duration under different condition genes for each remaining edge. Hysteresis causal entropy, filter out the indirect regulatory edges with maximum causal entropy lower than the threshold θ, and obtain the final gene regulation network;

The specific implementation process of filtering out the indirectly regulated edges whose maximum causal entropy is lower than the threshold θ includes:

The expression of dividing different time windows Gτ according to the time delay τ is:

The calculation formula of the multi-delay transfer entropy T _X→Y between genes is:

T _X→Y =I(Y _t ;X _t-1:t-τ |Y _t-1:t-τ )

=H(Y _t |Y _t-1:t-τ )-H(Y _t |Y _t-1:t-τ ,X _t-1:t-τ );