CN111223523A

CN111223523A - Gene regulation and control network construction method and system based on multi-time-lag causal entropy

Info

Publication number: CN111223523A
Application number: CN202010013036.9A
Authority: CN
Inventors: 李敏; 冯浩楠; 郑瑞清
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2020-01-06
Filing date: 2020-01-06
Publication date: 2020-06-02
Anticipated expiration: 2040-01-06
Also published as: CN111223523B

Abstract

The invention discloses a gene regulation and control network construction method and a system based on multi-time-lag causal entropy, wherein input time sequence gene expression data is divided into time windows under different time lags; for gene expression data of t time slices, gene expression matrixes under t-tau time windows are respectively constructed, for each pair of genes, multi-time-lag transfer entropy of a target gene under the t time window and genes under the t-tau time windows before is calculated, a gene correlation matrix under multi-time lag is obtained, elements of the matrix represent the probability of edges between the genes, the edges of the matrix are clustered into two classes through k-means, low-probability edge clusters are filtered, multi-time-lag conditional transfer entropy under conditional genes is calculated for each remaining edge, indirectly regulated edges with the maximum causal entropy smaller than a threshold value are filtered, and a final network structure is obtained. The invention effectively improves the accuracy of inference.

Description

Gene regulation and control network construction method and system based on multi-time-lag causal entropy

Technical Field

The invention relates to the field of bioinformatics, and relates to a construction method of a complex biological network.

Background

In an organism, cells are the fundamental unit of all tissue structural and functional components. The DNA information of all cells in a living body is the same, but the cells of different tissues and organs have different differences, because a complex regulatory gene regulation mechanism exists in the cells, so that the expression of different cells presents a plurality of specificities. The mechanism controlling gene expression is collectively referred to as gene expression regulation. There are also differences in the way different organisms perform gene regulation. In prokaryotes, environmental stimulation plays a crucial role in the expression of genes, and prokaryotes are adapted to different environments by opening and closing the expression of a part of genes through contact with the external environment. Eukaryotic genes are more complex to regulate than prokaryotic genes. Regulation in eukaryotes is mainly influenced by hormones, the cell growth cycle, and environmental factors are greatly reduced. Specific features of gene regulation include (1) complex structure; (2) the regulation and control mode is changeable: there is both one-to-one gene regulation and one-to-many or many-to-one multifactorial regulation. (3) The types of the DNA gene are various, and the DNA gene can be participated in various types such as mRNA, protein, small molecules and the like. (4) The regulation relationship dynamically changes. Therefore, the gene regulation mechanism is one of the important bases for researching the growth and development rules and basic morphological structures of animals and plants

The computational construction of gene regulatory networks based on different types of gene expression data has become one of the important challenges in system biology. Common calculation methods for constructing gene regulatory networks cover a variety of fields of theory, including correlation analysis methods, bayesian networks, feature selection methods, and boolean networks. These determine the gene regulation relationship by analyzing the correlation between each two genes or analyzing the relationship of expression levels between genes by a modeling method, and finally construct a regulation network.

Correlation analysis to construct gene regulatory networks is one of the most intuitive methods. Researchers analyze the association between genes by using Pearson's correlation coefficient, mutual information, and the like. Among them, the most popular approach is gene regulatory network construction based on mutual information. Mutual information is more revealed than Pearson correlation coefficientThe non-linear regulation and control relationship between genes. Margolin et al propose the ARACNE algorithm, which uses Data Processing Inevaluation (DPI) to judge if there is any relation in a ternary relation group (X1, X2, X3)

Then the relationship I (X)₁；X₃)≤min[I(X₁；X₂),I(X₂；X₃)]. ARACNE calculates mutual information I for arbitrary paired genes, and uses threshold I₀Margolin et al only think that I is greater than or equal to I₀There is a regulatory relationship between the gene pairs. Meyer et al further proposed the MRNET algorithm based on ARACNE. The algorithm uses a strategy of maximum reservance/minimum redundacy (MRMR) and selects a node X by a greedy algorithm_jThe difference Score between the node to be matched and the target node Y and the mutual information of the node to be matched and the other selected node set S is maximum. Patrick for a pair of nodes { X_i,X_jUse the larger MRMR values as their weights. In further studies, Luo et al considered that the gene expression regulation relationship generally exceeds 3 genes, that is, there are generally two or more regulatory genes for the target gene T. Based on this hypothesis, they proposed a new algorithm MI3, which scores the target gene T and the two regulatory genes R1 and R2 through a correlation part and a coordination part, to find higher order interactions. Zhang et al sets a conditional mutual information and Path Consistency Algorithm (PCA), and proposes a network construction algorithm CMI-PCA. CMI-PCA adopts multivariate condition mutual information inspection to filter indirect regulation and control relation. Zhao et al propose a new mutual information estimation method PMI for the problem that the CMI-PCA has under-estimation in the calculation of the regulation relationship. The gene regulation network construction method based on mutual information breaks through the indirect regulation relationship.

A bayesian network is another common method for constructing a gene regulatory network. The Bayesian network quantifies the attributes of the biological directed network, and combines the methods of two aspects of graph theory and probability theory. The difficulties of the Bayesian network in the gene regulation network can be mainly classified into the structureLearning and parameter learning. Werhli and Husmeier integrate gene expression data and a priori knowledge of multiple sources. By constructing the energy function E (G) and combining Gibbs distribution as the learning of the Bayesian network structure, the accuracy of the Bayesian network construction is improved. At the same time, they also estimated the hyperparameters in different prior knowledge using markov chain Monte Carlo Method (MCMC). Qin et al added the similarity between the oncogene-specific genes to the previous knowledge and inferred the cell type-specific signaling network. In the algorithm process, the Bayesian network is established on a given normalized signal transmission network, a heuristic search algorithm is adopted, edges are added and deleted according to the similarity of the Ontology Fingerprint, and the conformity degree of the candidate model and the observation data is calculated to be used as a selection index. In the process of learning the Bayesian network structure, Qin adopts BIC as the selection of parameters, and uses Monte Carlo EM algorithm to infer the hidden state of nodes in the network, and further estimates candidate model parameters. Hill et al were inspired by "very few upstream regulatory factors" to consider the degree of introgression of genes in the network d_maxThe uncertainty of the network is effectively reduced by 4. The Hill method can achieve the AUC value of 0.82 when constructing the network in the yeast integrated network under the condition of giving a priori knowledge of the network structure, and is also extremely effective in breast cancer cell lines. Li et al propose a dynamic Bayesian algorithm MMHO-DBN combining a high-order time sequence model and Max-Min hill-bounding heuristic search. The MMHO-DBN adopts the improvement of a local search method, adopts Dynamic Max-Min Parent (DMMP) to obtain a father node set with great possibility, and effectively reduces the space of a candidate network structure.

At present, a plurality of gene regulation network construction methods are proposed, but the method is limited by the complexity of gene regulation, and the precision of the method still has a great space for improvement. Among them, the following problems mainly exist: (1) how to design an effective algorithm to filter indirect regulation and control relation among genes; (2) how to combine other biological information improves the precision of network construction.

Disclosure of Invention

The invention aims to solve the technical problem that aiming at the defects of the prior art, the invention provides a gene regulation and control network construction method based on multi-time-lag causal entropy, and the precision of network construction is improved.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a gene regulation network construction method and system based on multi-time-lag causal entropy comprises the following steps:

1) dividing input time sequence gene expression data into different time windows according to the time lag tau;

2) respectively constructing time sequence gene expression matrixes under t-tau time windows and gene expression matrixes from t-tau to t-1 for gene expression data of t time slices after the windows are divided;

3) for each gene in the time sequence gene expression matrix under t-tau time windows, selecting an expression spectrum under the t window by a target gene, selecting an expression spectrum from the t-tau window to the t-1 time window by a regulatory factor, and calculating multi-time-lag transfer entropy between the genes to obtain a gene correlation matrix;

4) and for the fully-connected network of the gene correlation matrix, clustering edges into two types, filtering one type of edge with low probability value, calculating multi-time-lag causal entropy under genes with different conditions for each remaining edge, and filtering the indirectly-controlled edge with the maximum causal entropy lower than a threshold theta to obtain the final gene control network.

In the step 1), the regulation relationship can be identified more accurately by using multiple time delays, and the expression for dividing different time windows G tau according to the time lag tau is as follows:

wherein ,

representing the expression values of the gene N in the time window expression matrix G tau under the time window T of the sample M; t represents the gene expression vector under which moving time window; n represents the subscript of the gene, and N belongs to the number of the gene; m represents the index of the sample cell, and M is the sampleThe number of (2).

In step 3), in order to more accurately identify the regulation and control relationship under multiple time lags, the transfer entropy is popularized to a mode under multiple time lags, and the multiple time lags between genes transfer entropy T_X→YThe calculation formula of (2) is as follows:

T_X→Y＝I(Y_t,X_t-1:t-τ|Y_t-1:t-τ)

＝H(Y_t|Y_t-1:t-τ)-H(Y_t|Y_t-1:t-τ,X_t-1:t-τ)；

wherein I(Y_t,X_t-1:t-τ|Y_t-1:t-τ) Represents Y_t，X_t-1:t-τUnder the condition variable Y_t-1:t-τConditional mutual information of (1):

wherein Px, y, z (x, y, z) represents a joint probability density, pz (z) represents a marginal probability density, and Px, z (x, z) represents a marginal probability density between variables x, z; x_t-1:t-τRepresents the expression value of the gene x under the time window from t-1 to t-tau, H (. |) represents the conditional entropy:

where P (x, y) represents the joint probability and P (x) represents the marginal probability density.

In the step 4), in order to filter out the indirect regulation effect, a path consistency algorithm based on causal entropy is used, and the specific implementation process of filtering out the indirectly regulated edge with the maximum causal entropy lower than the threshold theta comprises the following steps:

1) for gene correlation matrix G_zero-orderFiltering the edges with low expression values, clustering the edges into two clusters according to k-means, and filtering the edges in the clusters with low probability values; elements in the gene correlation matrix represent the probability of regulatory relationships between genes;

2) filtering the indirectly regulated edges based on a path consistency algorithm, and filtering each edge (X, Y) in the filtered network if the edge (X, Y) existsThe edge (Y, Z) and the adjacent point Z of each edge of the edge (X, Z) are taken as the condition genes, and the causal entropy CE under the condition genes is calculated_X→Y|Z＝I(Y_t；X_t-1:t-τ|Z_t-1:t-τ)；

3) For the gene X, Y the conditional gene set K ∈ { K }₁,K₂,K₃…K_nFilter the maximum causal entropy max of multiple conditional genes_Z∈K{CE_X→Y|ZThe edge smaller than the threshold value theta.

The threshold value of the present application may be set to 0.03.

The invention also provides a gene regulation and control network construction system based on multi-time-lag causal entropy, which comprises the following steps:

an input unit for dividing input time series gene expression data into different time windows according to the time lag tau;

the gene expression matrix construction unit is used for respectively constructing time sequence gene expression matrixes under t-tau time windows and gene expression matrixes from t-tau to t-1 for the gene expression data of t time slices after the windows are divided;

the gene correlation matrix construction unit is used for selecting an expression spectrum under a t window for each gene in the time sequence gene expression matrix under the t-tau time windows, selecting the expression spectrum from the t-tau window to the t-1 time window by using a regulatory factor, and calculating the multi-time-lag transfer entropy between the genes to obtain a gene correlation matrix;

and the clustering unit is used for clustering the edges of the fully-connected network of the gene correlation matrix into two types, filtering one type of edge with low probability value, calculating the multi-time-lag causal entropy under different condition genes for each remaining edge, and filtering the indirectly-controlled edge with the maximum causal entropy lower than the threshold theta to obtain the final gene control network.

Compared with the prior art, the invention has the beneficial effects that: the method is suitable for the situation that the real time sequence gene expression data time slices are too few, can calculate the regulation and control relation of the genes under a plurality of time slices, filters the indirectly regulated edges through the conditional transfer entropy, and effectively improves the precision of network construction.

Drawings

FIG. 1 is a flow chart of the NIMCE of the present invention;

FIG. 2 is a graph comparing methods NIMCE and GENIE3, Jump3, fastBMA method based on PR curves and the area under them (AUPR);

FIG. 3 is a graph comparing the evaluation of methods NIMCE and GENIE3, Jump3, fastBMA method based on Recalll, Precision method.

Detailed Description

First, construction of time window gene expression matrix

Reading a time sequence gene table data file, wherein G represents a gene expression matrix, and an expression matrix G tau under a moving time window is represented as:

wherein ,

representing the expression values of the gene N in the time window expression matrix G tau under the time window T of the sample M; t represents the gene expression vector under which moving time window; n represents the index of the gene (N.epsilon.number of genes) and M represents the index of the sample cell (M.epsilon.number of samples).

Second, construct the gene correlation matrix

For each pair of genes, selecting an expression spectrum under a t window from a target gene, selecting an expression spectrum from a t-tau window to a t-1 time window from a regulatory factor, and calculating the multi-time-lag transfer entropy between the regulatory factor and the target gene:

where Px, y, z (x, y, z) represents the joint probability density, pz (z) represents the marginal probability density, and Px, z (x, z) represents the marginal probability density between the variables x, z.

X_t-1:t-τThe expression value of the gene x under the time window from t-1 to t-tau is represented, the expression of the gene y is similar, and H (. |) represents the conditional entropy:

In NIMCE, to efficiently calculate the distribution of probabilities in equation (2), we calculate the probability density of continuous variables using Kernel Density Estimation (KDE):

wherein X^pRepresents X_t-1:t-τ，Y^pRepresents Y_t-1:t-τWhere L ═ M (T- τ) denotes the number of samples in the time shift matrix, f_h(x) Is a kernel density function defined as follows:

where h denotes the size of the baseband and K is a kernel function.

Third, filtering the indirectly regulated and controlled edge

For gene correlation matrix G_zero-orderElements in the matrix represent the probability of the existence of regulation and control relation between genes, edges are clustered into two clusters according to k-means, and the edges in the clusters with low probability values are filtered;

filtering the indirectly regulated edges based on a path consistency algorithm, regarding each edge (X, Y) in the filtered network, if the edge (Y, Z) exists and the adjacent point Z of the edge (X, Z) to each edge, as a conditional gene, and calculating the causal entropy CE under the conditional gene_X→Y|Z＝I(Y_t；X_t-1:t-τ|Z_t-1:t-τ)；

For the gene X, Y the conditional gene set K ∈ { K }₁,K₂,K₃…K_nFilter the maximum causal entropy max among multiple conditional genes_Z∈K{CE_x→Y|zThe edge smaller than the threshold θ (override parameter set by human, default 0.03).

Fourth, experimental verification

To validate the method of the invention, we performed tests on 5 simulation data generated by GeneNetWave and compared and analyzed with GENIE3 based on random forest and Jump3 based on tree method and dynamic bayesian fastBMA. The GeneNetWave dataset was extracted from each subnet generated by E-coli or cerevisiae gene regulation network time series perturbation data for DREAM4 challenges. We generated 5 datasets using GNW, where each dataset contained 50 genes, each sample contained 10 samples, for a total of 21 time points of time expression data.

In order to evaluate the continuity and accuracy of the inference results, several indicators of AUPR value and Recall, Precision value were used for comparison. The AUPR value is the area under the line of the PR curve, the Recall value calculation formula is the ratio of the predicted correct edge number to the real directed edge number, the Precision value calculation formula is the ratio of the predicted correct edge number to the predicted edge number, and the AUPR value, Recall and Precision value are shown in FIG. 2 and FIG. 3 respectively.

As can be seen from fig. 2 and 3, our method outperforms other methods in terms of Precision values, whether based on the values of AUPR or Recall, under different samples. It can be seen that our proposed method NIMCE has good stability. Experiments show that under the condition of large network scale, time complexity is exponentially increased and basically cannot be calculated, and the NIMCE method can still obtain results in a short time.

Claims

1. A gene regulation network construction method based on multi-time-lag causal entropy is characterized by comprising the following steps:

2. The method for constructing the gene regulatory network based on the multi-time-lag causal entropy as claimed in claim 1, wherein in step 1), the expression for dividing different time windows G τ according to the time lag τ is as follows:

wherein ,

representing the expression values of the gene N in the time window expression matrix G tau under the time window T of the sample M; t represents the gene expression vector under which moving time window; n represents the subscript N of the gene belongs to the number of the genes; m represents the subscript M ∈ number of sample cells.

3. The method for constructing the gene regulatory network based on the multi-lag causal entropy as claimed in claim 1, wherein in the step 3), the multi-lag transition entropy T between the genes is_X→YThe calculation formula of (2) is as follows:

T_X→Y＝I(Y_t,X_t-1:t-τ|Y_t-1:t-τ)

＝H(Y_t|Y_t-1:t-τ)-H(Y_t|Y_t-1::t-τ,X_t-1:t-τ)；

wherein Px, y, z (x, y, z) represents a joint probability density, pz (z) represents a marginal probability density, and Px, z (x, z) represents a marginal probability density between variables x, z;

X_t-1:t-τrepresents the expression value of the gene x under the time window from t-1 to t-tau, H (. |) represents the conditional entropy:

4. The method for constructing the gene regulatory network based on the multi-lag causal entropy as claimed in claim 1, wherein the specific implementation process of filtering out the indirectly regulated edges with the maximum causal entropy lower than the threshold θ in the step 4) comprises:

2) filtering the indirectly regulated edges based on a path consistency algorithm, regarding each edge (X, Y) existing in the filtered network, and if the edge (Y, Z) exists and the adjacent point Z existing on each edge by the edge (X, Z), taking the adjacent point Z as a condition gene, and calculating the conditionCausal entropy CE under genes_X→Y|Z＝I(Y_t；X_t-1:t-τ|Z_t-1:t-τ)；

5. The method for constructing the gene regulatory network based on the multi-lag causal entropy as claimed in claim 1, wherein the threshold is 0.03.

6. A gene regulation and control network construction system based on multi-time-lag causal entropy is characterized by comprising the following steps: