CN110517724B

CN110517724B - Method for deducing gene regulation network by using single cell transcription and gene knockout data

Info

Publication number: CN110517724B
Application number: CN201910636618.XA
Authority: CN
Inventors: 王会青; 董春林; 廉元元
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2019-07-15
Filing date: 2019-07-15
Publication date: 2020-05-22
Anticipated expiration: 2039-07-15
Also published as: CN110517724A

Abstract

The invention discloses a method for deducing a gene regulation and control network by using single cell transcription and gene knockout data, which classifies genes by analyzing steady-state expression data before and after gene knockout and uses the genes as priori knowledge to reduce space-time complexity and improve deduction accuracy; calculating the distribution distance between genes by using single cell transcription data, establishing a multi-dimensional regression model of multiple time points for each gene by combining the gene classification result, and calculating a relation factor in the model by adopting a mathematical method; performing corresponding algorithm analysis on the gene knockout data to remove part of false positive judgments and make up for the defect of analyzing dynamic data; the method effectively solves the problem of high computational complexity in analyzing the time-series single-cell data, and improves the accuracy of deducing the gene regulation network.

Description

Method for deducing gene regulation network by using single cell transcription and gene knockout data

Technical Field

The invention relates to the field of gene regulation network research and analysis, in particular to a method for deducing a gene regulation network by using single cell transcription and gene knockout data.

Background

The gene regulation network is used as a tool for explaining and analyzing gene data, and can reveal the regulation relationship among genes, proteins and small molecules, understand the physiological activities and functions in biological cells, the interaction in the channels and how to change the organism. The research on gene regulation network from the single cell transcription can reveal the detailed expression dynamic and functional relationship of cells and clarify the function of intercellular variation in different key physiological processes. However, single cell expression data is time process data with high time resolution, which brings high space-time complexity to the inference gene regulation network algorithm and simultaneously loses the analysis of steady-state gene expression level, which reduces the accuracy of the final experimental result.

Disclosure of Invention

To overcome the above-mentioned deficiencies of the prior art, the present invention provides a method for inferring gene regulatory networks using single cell transcription and gene knockout data.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a method of inferring gene regulatory networks using single cell transcription and gene knockout data, comprising: respectively collecting steady-state gene transcription expression data in a plurality of cells by using a gene knockout technology, and collecting the transcription expression data of a single cell at a plurality of time points after stimulating each cell; analyzing steady-state expression data before and after gene knockout to classify the genes; calculating the distribution distance between genes by using single cell transcription data, establishing a multi-dimensional regression model of multiple time points for each gene by combining the prior knowledge of gene classification, and calculating a relation factor in the model by adopting a mathematical method; performing corresponding algorithm analysis on the gene knockout data to remove part of false positive judgment; determining a gene regulatory network based on the inferred intergenic relationships.

Wherein, the step of collecting steady-state gene transcription expression data in a plurality of cells respectively by using a gene knockout technology, and collecting transcription expression data of a single cell at a plurality of time points after stimulating each cell comprises:

using the GNW mock gene knockout experiment, steady state gene expression data were obtained for both wild-type and knockout organisms, where the wild-type expression data were from the original strain of the organism that did not undergo any mutations. Knock-out data are from strains of the organism from which one or more genes have been knocked out or attenuated by the original strain;

gene expression values collected from a plurality of single cells at a plurality of time points constitute time-series single cell expression data.

Wherein, analyzing steady-state expression data before and after gene knockout to classify the genes, comprising the following steps:

the gene knockout technology is utilized to disturb each gene according to the method, and the steady state expression values of all genes at corresponding moments are collected, the genes are divided into four classes, namely URG \ RRG \ NRG \ ISG, and the specific method is as follows:

(a) regulatory Gene (URG) under regulation: disturbing all other genes has no influence on the current gene, but the disturbance of the gene influences the expression level of other genes;

(b) regulated Regulatory Gene (RRG): when any other gene is disturbed, the expression level of the current gene fluctuates, and the disturbance of the gene can influence the expression level of other genes;

(c) non-regulatory gene (NRG): when any other gene is disturbed, the expression level of the current gene fluctuates, but the disturbance of the gene does not affect the expression level of any other gene;

(d) independent genes (ISG): perturbation of all other genes has no effect on the current gene, and perturbation of the gene does not affect the expression level of any other gene.

The method comprises the following steps of calculating distribution distances among genes by using single cell transcription data, establishing a multi-dimensional regression model of multiple time points for each gene by combining prior knowledge of gene classification, and calculating relation factors in the model by adopting a mathematical method, wherein the method comprises the following steps:

calculating the distribution distance between genes by using the time sequence single cell expression data;

using the expression distribution of other genes in a given time window to 'predict' the expression distribution of the target gene in the next time window, namely establishing a multi-dimensional regression model of each gene at multiple time points;

the model comprises an intergenic action relation factor, namely the solved solution vector, the solution vector is obtained by utilizing a least square method with a penalty term, and a larger factor represents that the corresponding regulation and control inference has higher confidence.

Wherein, the distribution distance between genes is calculated by using the single cell transcription data, and the method comprises the following steps:

time series single cell expression data contains multiple genes, multiple time points, multiple cells. Let g be the base factor, n be the number of measured time points, C_tThe number of cells in the sample at the t-th time point (

t

1, 2.., n). The data will include n data matrices

Matrix elements

Is the transcriptional expression value of gene j, i.e., the number of mRNA molecules of gene j in the ith cell at the kth time point.

The information contained in the single-cell gene expression data set, particularly the change of gene expression distribution is used for carrying out GRN inference, and the expression distribution distance of the gene at two time points is firstly calculated to quantify the time change condition of each individual gene expression. The distance of gene j at time t is quantified as follows:

DD_j，t＝max|F_t+1(A_j)-F_t(A_j)|

wherein, F_t(A_j) Is represented by A_jThe cumulative distribution function over time t is the cumulative gene expression level of gene j from time 0 to time t; DD (DD) with high heat dissipating capacity_j，tIndicates the expression change level of gene j at time t and t + 1.

The method comprises the following steps of establishing a multi-dimensional regression model of multiple time points for each gene by combining the prior knowledge of gene classification, wherein the method comprises the following steps:

the expression profiles of other genes in a given time window are used to "predict" the expression profile of the target gene in the next time window to account for the regulation of a gene by other genes. The linear relationship of gene j at time t +1 is as follows:

DD_j，t+1＝α_1，jDD_1，t+α_2，jDD_2，t+...+α_g，jDD_g，t

wherein, α_g，jExpressing the regulation relation action factor of gene g to gene j; the linear relationship of gene j at all times is shown to give the following matrix:

where g denotes the base factor in the network, n denotes the number of measured time points, DD_j，n-1Indicates the gene expression level of gene j at the n-1 st point in time.

The α vector in the matrix is the functional relation factor between each gene to be solved, the value of α vector is determined by using the result of gene classification, and then all solution vectors are obtained by mathematical calculation.

The method for removing part of false positive judgments by performing corresponding algorithm analysis on gene knockout data comprises the following steps:

analyzing gene knockout data to remove the direct regulation and control relationship between two misjudged genes in the preliminary GRN inference;

analyzing the gene knockout data to remove the inference that the indirect regulation is misjudged as the direct regulation in the preliminary GRN inference;

wherein, analyzing the gene knockout data to remove the direct regulation and control relationship between two misjudged genes in the preliminary GRN inference comprises the following steps:

steady-state gene expression values were collected in wild-type and knockout strains, respectively. If the expression level of the latter gene except the knocked-out gene is changed more than that of the former gene, the knocked-out gene has a regulatory effect on the gene, and conversely, the knocked-out gene has no regulatory effect. Therefore, the method can remove part of the preliminarily inferred false positive results, and comprises the following specific steps:

(1) collecting the expression levels of all genes of a wild strain by using a gene chip technology;

(2) knocking out each gene by using a gene knocking-out technology, and simultaneously collecting the expression levels of all genes of the strain;

(3) and comparing the change conditions of the expression levels of the genes before and after the knockout, and judging whether the genes are changed by adopting a certain measurement mode, thereby judging whether the current genes are regulated and controlled by the knocked-out genes.

Taking the difference between the two as a measure, assuming that GK is the expression value of the knockout strain and GW is the expression value of the wild strain, then: GK_i，j-GW_jWhen the expression level of the gene j is considered to be changed when the expression level is more than α, the gene j is regulated by the gene i, wherein GK_i，jThe expression value of gene j, 0 < i < 11, 0 < j < 11, α, for gene i knockout should be determined based on the overall profile of the gene expression levels collectedSetting the regulation relation which does not exist actually to be 0, outputting the regulation relation matrix again, and taking the matrix as the input of further analysis and processing.

Wherein, analyzing the knockout data to remove the inference of indirect regulation as direct regulation in the preliminary GRN inference, comprises the steps of:

firstly, the regulation relation to be judged needs to be determined, namely the direct regulation relation which may exist or does not exist between two genes with the indirect regulation relation in the regulation network is called as 'uncertain regulation'. The implementation of this step is to determine the upper GU and lower GL sets of the gene regulation network, the upper set can be obtained from the GRN inferred earlier, and the lower set is to remove all the "uncertain regulation". And continuously updating the upper and lower limit sets through subsequent steps until the two sets are identical.

The method comprises the steps of disturbing a regulated gene after disconnecting an indirect regulation path of uncertain regulation, and observing the change of the expression level of the gene in the current GRN to determine whether a direct regulation relation really exists between the two genes. To accomplish this, it is necessary to find the best genome that can break the indirect pathway.

Wherein finding the optimal genome capable of interrupting the indirect pathway comprises the steps of:

the set determined by the following rules is grouped as "edge separation":

1.S₁(i, j) ═ GU's i child ∩ GU's j ancestor

2.S₂(i, j) ═ GU i progeny ∩ GU j parents

3.S₃(i, j) ═ GU of j in ∩ GU of i's descendants

When a plurality of indirect regulation exists in the network, the side separation of the indirect regulation is found out, and then the common set with the largest number is counted, namely the best genome to be knocked out.

Wherein determining a gene regulatory network based on the inferred intergenic relationships comprises the steps of:

determining a parameter factor according to the regulation quantity in the real network;

and determining a final gene regulation network according to the parameter factors and the probability value in the inferred relation factor table.

Compared with the prior art, the invention has the beneficial effects that: the method for deducing the gene regulation and control network by using the single cell transcription and gene knockout data classifies the genes by analyzing the steady-state expression data before and after gene knockout, and the genes are used as priori knowledge to reduce the space-time complexity and improve the deduction accuracy; calculating the distribution distance between genes by using single cell transcription data, establishing a multi-dimensional regression model of multiple time points for each gene by combining the gene classification result, and calculating a relation factor in the model by adopting a mathematical method; performing corresponding algorithm analysis on the gene knockout data to remove part of false positive judgments and make up for the defect of analyzing dynamic data; the method effectively solves the problem of high computational complexity in analyzing the time-series single-cell data, and improves the accuracy of deducing the gene regulation network.

Drawings

FIG. 1 is a schematic flow diagram of a method for inferring a gene regulatory network from single cell transcription and gene knockout data according to the present invention;

FIG. 2 is a schematic diagram of a time-series single cell expression profile, which is data used in the method for deducing a gene regulatory network by using single cell transcription and gene knockout data provided by the invention.

FIG. 3 is a single cell expression value of a gene at multiple time points in a method for inferring gene regulatory networks from single cell transcription and gene knockout data according to the present invention.

FIG. 4 is a line graph of the level of gene expression changes at several time points for several genes in a method for inferring gene regulatory networks using single cell transcription and gene knockout data as provided by the present invention.

FIG. 5 is a schematic diagram of a final inference result of a certain gene regulatory network in the method for inferring the gene regulatory network from single-cell transcription and gene knockout data according to the present invention.

Detailed Description

The technical solution of the present invention will be further described in more detail with reference to the following embodiments. It is to be understood that the described embodiments are merely a few embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-5, fig. 1 is a schematic flow chart of a method for inferring a gene regulatory network from single cell transcription and gene knockout data according to the present invention. The method comprises the following steps:

s110: the gene knockout technology is utilized to collect the steady-state gene transcription expression data in a plurality of cells respectively, and after each cell is stimulated, the transcription expression data of a single cell are collected at a plurality of time points.

The step S110 includes:

1. using the GNW mock gene knockout experiment, steady state gene expression data were obtained for both wild-type and knockout organisms, where the wild-type expression data were from the original strain of the organism that did not undergo any mutations. Knock-out data are from strains of the organism from which one or more genes have been knocked out or attenuated by the original strain;

2. gene expression values collected from a plurality of single cells at a plurality of time points constitute time-series single cell expression data.

S120: and analyzing steady-state expression data before and after gene knockout to classify the genes.

S130: calculating the distribution distance between genes by using single cell transcription data, establishing a multi-dimensional regression model of multiple time points for each gene by combining the prior knowledge of gene classification, and calculating a relation factor in the model by adopting a mathematical method.

Calculating intergenic distribution distance by using single cell transcription data, comprising the steps of:

t

1, 2.., n). The data will include n data matrices

Matrix elements

The information contained in the single-cell gene expression data set, particularly the change of gene expression distribution is used for carrying out GRN inference, and the expression distribution distance of the gene at two time points is firstly calculated to quantify the time change condition of each individual gene expression. The distribution distance of gene j at time t is quantified by equation (1).

DD_j，t＝max|F_t+1(A_j)-F_t(A_j)|(1)

The distribution distance obtained by the formula (1) is used for predicting the expression distribution of the target gene in the next time window by using the expression distribution of other genes in the given time window so as to explain the regulation and control condition of a certain gene by other genes. The distribution distance of gene j at time t +1 is expressed as a linear relationship in equation (2).

DD_j，t+1＝α_1，jDD_1，t+α_2，jDD_2，t+...+α_g，jDD_g，t(2)

Wherein, α_g，jExpressing the regulation relation action factor of gene g to gene j, expressing the linear relation of gene j at all time to obtain the matrix in formula (3), where the α vector in the matrix is the action relation factor between genes to be solved, and using the gene classification result to determine the value of part α vector, and then calculating all solution vectors by mathematical calculation.

S140: and (4) performing corresponding algorithm analysis on the gene knockout data to remove part of false positive judgment.

The step S140 includes:

1. analysis of gene knockout data removes the direct regulatory relationship between two genes that were misjudged in the preliminary GRN inference.

Taking the difference of the two as a measure, and assuming that GK is the expression value of a knockout strain and GW is the expression value of a wild strain, then: GK_i，j-GW_jWhen the expression level of the gene j is considered to be changed when the expression level is more than α, the gene j is regulated by the gene i, wherein GK_i，jExpressing the expression value of gene j when gene i is knocked out, wherein the values of 0 < i < 11, 0 < j < 11 and α are determined according to the overall condition of the acquired gene expression level, setting the inferred regulation relation which does not exist actually as 0, re-outputting a regulation relation matrix, and taking the matrix as the input of further analysis and processing.

2. Analysis of gene knockout data removes the inference that indirect regulation is misjudged as direct regulation in preliminary GRN inference.

The set determined by the following rules is grouped as "edge separation":

1.S₁(i, j) ═ GU's i child ∩ GU's j ancestor

2.S₂(i, j) ═ GU i progeny ∩ GU j parents

3.S₃(i, j) ═ GU of j in ∩ GU of i's descendants

S150: determining a gene regulatory network based on the inferred intergenic relationships, comprising the steps of:

Different from the prior art, the method for deducing the gene regulation and control network by using the single cell transcription and gene knockout data classifies the genes by analyzing the steady-state expression data before and after gene knockout, and the genes are used as priori knowledge to reduce the space-time complexity and improve the deduction accuracy; calculating the distribution distance between genes by using single cell transcription data, establishing a multi-dimensional regression model of multiple time points for each gene by combining the gene classification result, and calculating a relation factor in the model by adopting a mathematical method; performing corresponding algorithm analysis on the gene knockout data to remove part of false positive judgments and make up for the defect of analyzing dynamic data; the method effectively solves the problem of high computational complexity in analyzing the time-series single-cell data, and improves the accuracy of deducing the gene regulation network.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for inferring gene regulatory networks from single cell transcription and gene knockout data comprising:

step one, collecting steady-state gene transcription expression data in a plurality of cells respectively by using a gene knockout technology, and collecting the transcription expression data of a single cell at a plurality of time points after stimulating each cell, wherein the specific operation is as follows:

using a GNW-simulated gene knockout experiment to obtain steady-state gene expression data for wild-type and knockout organisms, wherein the wild-type expression data is from an original strain of the organism that did not undergo any mutation; knock-out data are from strains of the organism from which one or more genes have been knocked out or attenuated by the original strain;

gene expression values collected from a plurality of single cells at a plurality of time points constitute time-series single cell expression data;

step two, analyzing steady-state expression data before and after gene knockout to classify the genes;

calculating distribution distances among genes by using the single-cell transcription data, establishing a multi-dimensional regression model of multiple time points for each gene by combining the prior knowledge of gene classification, and calculating a relation factor in the model by adopting a mathematical method;

step four, performing corresponding algorithm analysis on the gene knockout data to remove part of false positive judgment, and the method comprises the following steps:

analyzing the direct regulation relationship between two genes misjudged in the gene knockout data removal preliminary GRN inference, and analyzing the direct regulation relationship between two genes misjudged in the gene knockout data removal preliminary GRN inference, comprising the steps of:

collecting steady-state gene expression values in wild-type strains and knockout strains respectively; if the expression level of other genes except the knocked-out gene in the latter is changed greatly compared with that in the former, the knocked-out gene has a regulation effect on the gene, and otherwise, the knocked-out gene has no regulation effect; therefore, the method can remove part of the preliminarily inferred false positive results, and comprises the following specific steps:

(3) comparing the change of the expression level of each gene before and after knockout, and judging whether the gene is changed by adopting a certain measurement mode so as to judge whether the current gene is regulated and controlled by the knocked-out gene;

taking the difference between the two as a measure, assuming that GK is the expression value of the knockout strain and GW is the expression value of the wild strain, then: GK_i,j-GW_j>α when the expression level of gene j is changed, gene j is regulated by gene i, wherein GK_i,jExpression value of Gene j at the time of knocking-out Gene i, 0<i<11,0<j<Setting the inferred regulation relation which does not exist actually as 0, and outputting a regulation relation matrix again, wherein the matrix is used as the input of further analysis and processing;

analysis of gene knockout data to remove inferences from preliminary GRN inferences that misjudge indirect regulation as direct regulation, comprising the steps of:

firstly, determining a regulation relation to be judged, namely a direct regulation relation which may exist or does not exist between two genes with an indirect regulation relation in a regulation network, which is called as 'uncertain regulation'; firstly, determining an upper limit GU and a lower limit GL set of a gene regulation network, wherein the upper limit set can be obtained by the GRN inferred in the early stage, and the lower limit set is obtained by removing all 'uncertain regulation'; continuously updating the upper and lower limit sets through subsequent steps until the two sets are equal;

disturbing a regulated gene after disconnecting an indirect regulation path of 'uncertain regulation', and observing the change of the gene expression level in the current GRN to determine whether a direct regulation relation really exists between the two genes; to achieve this process, it is necessary to find the best genome capable of breaking the indirect pathway, comprising the steps of:

the set determined by the following rules is grouped as "edge separation":

1. S₁(i, j) ═ GU's i child ∩ GU's j ancestor

2. S₂(i, j) ═ GU i progeny ∩ GU j parents

3. S₃(i, j) ═ GU of j in ∩ GU of i's descendants

When a plurality of indirect regulation exists in the network, edge separation of the indirect regulation is found out, and then the common set with the largest number is counted, namely the best genome to be knocked out;

determining a gene regulatory network based on the inferred intergenic relationships, comprising the steps of:

determining a final gene regulation network according to the parameter factor and the probability value in the inferred relation factor table;

and step five, determining a gene regulation network according to the inferred relationship between the genes.

2. The method for inferring a gene regulatory network from single cell transcription and gene knockout data as claimed in claim 1, wherein said step two, analyzing steady state expression data before and after gene knockout to classify genes specifically comprises:

3. The method for deducing a gene regulatory network by using single cell transcription and gene knockout data as claimed in claim 1, wherein the third step is to calculate the distribution distance between genes by using single cell transcription data, establish a multi-dimensional regression model of multiple time points for each gene by combining the prior knowledge of gene classification, calculate the relation factor in the model by using a mathematical method, and comprises the steps of:

the model comprises an intergenic action relation factor, namely the solved solution vector, and the solution vector is solved by utilizing a least square method with a penalty term, wherein a larger factor represents that the corresponding regulation and control inference has higher confidence.

4. The method for inferring gene regulatory networks from single cell transcription and gene knockout data as claimed in claim 3, wherein said step three, calculating intergenic distribution distance using single cell transcription data, comprises the steps of:

the time series single cell expression data comprises a plurality of genes, a plurality of time points and a plurality of cells; let g be the base factor, n be the number of measured time points, C_tThe number of cells in the sample at the t-th time point (t ═ 1, 2.., n); the data will include n data matrices

Matrix elements

Is the transcriptional expression value of gene j, i.e., the number of mRNA molecules of gene j in the ith cell at the kth time point;

using information contained in gene expression data sets of a single cell, in particular the distribution of gene expressionThe GRN inference is carried out on the change, firstly, the expression distribution distance of the genes at two time points is calculated to quantify the time change condition of each individual gene expression; distance quantification DD for gene j at time t_j，tThe following were used:

DD_j，t＝max|F_t+1(A_j)-F_t(A_j)|

5. The method for inferring gene regulatory networks from single-cell transcription and gene knockout data as claimed in claim 3, wherein in the third step, a multi-dimensional regression model of multiple time points is established for each gene in combination with prior knowledge of gene classification, comprising the steps of:

using the expression distribution of other genes in a given time window to 'predict' the expression distribution of a target gene in the next time window so as to explain the regulation and control condition of a certain gene by other genes; the linear relationship of gene j at time t +1 is as follows:

DD_j，t+1＝α_1，jDD_1，t+α_2，jDD_2，t+…+α_g，jDD_g，t

where g denotes the base factor in the network, n denotes the number of measured time points, DD_j，n-1Expressing the gene expression level of the gene j at the (n-1) th moment, determining the value of part α vector by α vector in the matrix, and calculating with a mathematic meterAll solution vectors are calculated.