CN110517724B - Method for deducing gene regulation network by using single cell transcription and gene knockout data - Google Patents

Method for deducing gene regulation network by using single cell transcription and gene knockout data Download PDF

Info

Publication number
CN110517724B
CN110517724B CN201910636618.XA CN201910636618A CN110517724B CN 110517724 B CN110517724 B CN 110517724B CN 201910636618 A CN201910636618 A CN 201910636618A CN 110517724 B CN110517724 B CN 110517724B
Authority
CN
China
Prior art keywords
gene
genes
expression
data
regulation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910636618.XA
Other languages
Chinese (zh)
Other versions
CN110517724A (en
Inventor
王会青
董春林
廉元元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiyuan University of Technology
Original Assignee
Taiyuan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiyuan University of Technology filed Critical Taiyuan University of Technology
Priority to CN201910636618.XA priority Critical patent/CN110517724B/en
Publication of CN110517724A publication Critical patent/CN110517724A/en
Application granted granted Critical
Publication of CN110517724B publication Critical patent/CN110517724B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • G16B5/20Probabilistic models

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Physiology (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method for deducing a gene regulation and control network by using single cell transcription and gene knockout data, which classifies genes by analyzing steady-state expression data before and after gene knockout and uses the genes as priori knowledge to reduce space-time complexity and improve deduction accuracy; calculating the distribution distance between genes by using single cell transcription data, establishing a multi-dimensional regression model of multiple time points for each gene by combining the gene classification result, and calculating a relation factor in the model by adopting a mathematical method; performing corresponding algorithm analysis on the gene knockout data to remove part of false positive judgments and make up for the defect of analyzing dynamic data; the method effectively solves the problem of high computational complexity in analyzing the time-series single-cell data, and improves the accuracy of deducing the gene regulation network.

Description

Method for deducing gene regulation network by using single cell transcription and gene knockout data
Technical Field
The invention relates to the field of gene regulation network research and analysis, in particular to a method for deducing a gene regulation network by using single cell transcription and gene knockout data.
Background
The gene regulation network is used as a tool for explaining and analyzing gene data, and can reveal the regulation relationship among genes, proteins and small molecules, understand the physiological activities and functions in biological cells, the interaction in the channels and how to change the organism. The research on gene regulation network from the single cell transcription can reveal the detailed expression dynamic and functional relationship of cells and clarify the function of intercellular variation in different key physiological processes. However, single cell expression data is time process data with high time resolution, which brings high space-time complexity to the inference gene regulation network algorithm and simultaneously loses the analysis of steady-state gene expression level, which reduces the accuracy of the final experimental result.
Disclosure of Invention
To overcome the above-mentioned deficiencies of the prior art, the present invention provides a method for inferring gene regulatory networks using single cell transcription and gene knockout data.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a method of inferring gene regulatory networks using single cell transcription and gene knockout data, comprising: respectively collecting steady-state gene transcription expression data in a plurality of cells by using a gene knockout technology, and collecting the transcription expression data of a single cell at a plurality of time points after stimulating each cell; analyzing steady-state expression data before and after gene knockout to classify the genes; calculating the distribution distance between genes by using single cell transcription data, establishing a multi-dimensional regression model of multiple time points for each gene by combining the prior knowledge of gene classification, and calculating a relation factor in the model by adopting a mathematical method; performing corresponding algorithm analysis on the gene knockout data to remove part of false positive judgment; determining a gene regulatory network based on the inferred intergenic relationships.
Wherein, the step of collecting steady-state gene transcription expression data in a plurality of cells respectively by using a gene knockout technology, and collecting transcription expression data of a single cell at a plurality of time points after stimulating each cell comprises:
using the GNW mock gene knockout experiment, steady state gene expression data were obtained for both wild-type and knockout organisms, where the wild-type expression data were from the original strain of the organism that did not undergo any mutations. Knock-out data are from strains of the organism from which one or more genes have been knocked out or attenuated by the original strain;
gene expression values collected from a plurality of single cells at a plurality of time points constitute time-series single cell expression data.
Wherein, analyzing steady-state expression data before and after gene knockout to classify the genes, comprising the following steps:
the gene knockout technology is utilized to disturb each gene according to the method, and the steady state expression values of all genes at corresponding moments are collected, the genes are divided into four classes, namely URG \ RRG \ NRG \ ISG, and the specific method is as follows:
(a) regulatory Gene (URG) under regulation: disturbing all other genes has no influence on the current gene, but the disturbance of the gene influences the expression level of other genes;
(b) regulated Regulatory Gene (RRG): when any other gene is disturbed, the expression level of the current gene fluctuates, and the disturbance of the gene can influence the expression level of other genes;
(c) non-regulatory gene (NRG): when any other gene is disturbed, the expression level of the current gene fluctuates, but the disturbance of the gene does not affect the expression level of any other gene;
(d) independent genes (ISG): perturbation of all other genes has no effect on the current gene, and perturbation of the gene does not affect the expression level of any other gene.
The method comprises the following steps of calculating distribution distances among genes by using single cell transcription data, establishing a multi-dimensional regression model of multiple time points for each gene by combining prior knowledge of gene classification, and calculating relation factors in the model by adopting a mathematical method, wherein the method comprises the following steps:
calculating the distribution distance between genes by using the time sequence single cell expression data;
using the expression distribution of other genes in a given time window to 'predict' the expression distribution of the target gene in the next time window, namely establishing a multi-dimensional regression model of each gene at multiple time points;
the model comprises an intergenic action relation factor, namely the solved solution vector, the solution vector is obtained by utilizing a least square method with a penalty term, and a larger factor represents that the corresponding regulation and control inference has higher confidence.
Wherein, the distribution distance between genes is calculated by using the single cell transcription data, and the method comprises the following steps:
time series single cell expression data contains multiple genes, multiple time points, multiple cells. Let g be the base factor, n be the number of measured time points, CtThe number of cells in the sample at the t-th time point ( t 1, 2.., n). The data will include n data matrices
Figure GDA0002418835090000031
Matrix elements
Figure GDA0002418835090000032
Is the transcriptional expression value of gene j, i.e., the number of mRNA molecules of gene j in the ith cell at the kth time point.
The information contained in the single-cell gene expression data set, particularly the change of gene expression distribution is used for carrying out GRN inference, and the expression distribution distance of the gene at two time points is firstly calculated to quantify the time change condition of each individual gene expression. The distance of gene j at time t is quantified as follows:
DDj,t=max|Ft+1(Aj)-Ft(Aj)|
wherein, Ft(Aj) Is represented by AjThe cumulative distribution function over time t is the cumulative gene expression level of gene j from time 0 to time t; DD (DD) with high heat dissipating capacityj,tIndicates the expression change level of gene j at time t and t + 1.
The method comprises the following steps of establishing a multi-dimensional regression model of multiple time points for each gene by combining the prior knowledge of gene classification, wherein the method comprises the following steps:
the expression profiles of other genes in a given time window are used to "predict" the expression profile of the target gene in the next time window to account for the regulation of a gene by other genes. The linear relationship of gene j at time t +1 is as follows:
DDj,t+1=α1,jDD1,t2,jDD2,t+...+αg,jDDg,t
wherein, αg,jExpressing the regulation relation action factor of gene g to gene j; the linear relationship of gene j at all times is shown to give the following matrix:
Figure GDA0002418835090000033
where g denotes the base factor in the network, n denotes the number of measured time points, DDj,n-1Indicates the gene expression level of gene j at the n-1 st point in time.
The α vector in the matrix is the functional relation factor between each gene to be solved, the value of α vector is determined by using the result of gene classification, and then all solution vectors are obtained by mathematical calculation.
The method for removing part of false positive judgments by performing corresponding algorithm analysis on gene knockout data comprises the following steps:
analyzing gene knockout data to remove the direct regulation and control relationship between two misjudged genes in the preliminary GRN inference;
analyzing the gene knockout data to remove the inference that the indirect regulation is misjudged as the direct regulation in the preliminary GRN inference;
wherein, analyzing the gene knockout data to remove the direct regulation and control relationship between two misjudged genes in the preliminary GRN inference comprises the following steps:
steady-state gene expression values were collected in wild-type and knockout strains, respectively. If the expression level of the latter gene except the knocked-out gene is changed more than that of the former gene, the knocked-out gene has a regulatory effect on the gene, and conversely, the knocked-out gene has no regulatory effect. Therefore, the method can remove part of the preliminarily inferred false positive results, and comprises the following specific steps:
(1) collecting the expression levels of all genes of a wild strain by using a gene chip technology;
(2) knocking out each gene by using a gene knocking-out technology, and simultaneously collecting the expression levels of all genes of the strain;
(3) and comparing the change conditions of the expression levels of the genes before and after the knockout, and judging whether the genes are changed by adopting a certain measurement mode, thereby judging whether the current genes are regulated and controlled by the knocked-out genes.
Taking the difference between the two as a measure, assuming that GK is the expression value of the knockout strain and GW is the expression value of the wild strain, then: GKi,j-GWjWhen the expression level of the gene j is considered to be changed when the expression level is more than α, the gene j is regulated by the gene i, wherein GKi,jThe expression value of gene j, 0 < i < 11, 0 < j < 11, α, for gene i knockout should be determined based on the overall profile of the gene expression levels collectedSetting the regulation relation which does not exist actually to be 0, outputting the regulation relation matrix again, and taking the matrix as the input of further analysis and processing.
Wherein, analyzing the knockout data to remove the inference of indirect regulation as direct regulation in the preliminary GRN inference, comprises the steps of:
firstly, the regulation relation to be judged needs to be determined, namely the direct regulation relation which may exist or does not exist between two genes with the indirect regulation relation in the regulation network is called as 'uncertain regulation'. The implementation of this step is to determine the upper GU and lower GL sets of the gene regulation network, the upper set can be obtained from the GRN inferred earlier, and the lower set is to remove all the "uncertain regulation". And continuously updating the upper and lower limit sets through subsequent steps until the two sets are identical.
The method comprises the steps of disturbing a regulated gene after disconnecting an indirect regulation path of uncertain regulation, and observing the change of the expression level of the gene in the current GRN to determine whether a direct regulation relation really exists between the two genes. To accomplish this, it is necessary to find the best genome that can break the indirect pathway.
Wherein finding the optimal genome capable of interrupting the indirect pathway comprises the steps of:
the set determined by the following rules is grouped as "edge separation":
1.S1(i, j) ═ GU's i child ∩ GU's j ancestor
2.S2(i, j) ═ GU i progeny ∩ GU j parents
3.S3(i, j) ═ GU of j in ∩ GU of i's descendants
When a plurality of indirect regulation exists in the network, the side separation of the indirect regulation is found out, and then the common set with the largest number is counted, namely the best genome to be knocked out.
Wherein determining a gene regulatory network based on the inferred intergenic relationships comprises the steps of:
determining a parameter factor according to the regulation quantity in the real network;
and determining a final gene regulation network according to the parameter factors and the probability value in the inferred relation factor table.
Compared with the prior art, the invention has the beneficial effects that: the method for deducing the gene regulation and control network by using the single cell transcription and gene knockout data classifies the genes by analyzing the steady-state expression data before and after gene knockout, and the genes are used as priori knowledge to reduce the space-time complexity and improve the deduction accuracy; calculating the distribution distance between genes by using single cell transcription data, establishing a multi-dimensional regression model of multiple time points for each gene by combining the gene classification result, and calculating a relation factor in the model by adopting a mathematical method; performing corresponding algorithm analysis on the gene knockout data to remove part of false positive judgments and make up for the defect of analyzing dynamic data; the method effectively solves the problem of high computational complexity in analyzing the time-series single-cell data, and improves the accuracy of deducing the gene regulation network.
Drawings
FIG. 1 is a schematic flow diagram of a method for inferring a gene regulatory network from single cell transcription and gene knockout data according to the present invention;
FIG. 2 is a schematic diagram of a time-series single cell expression profile, which is data used in the method for deducing a gene regulatory network by using single cell transcription and gene knockout data provided by the invention.
FIG. 3 is a single cell expression value of a gene at multiple time points in a method for inferring gene regulatory networks from single cell transcription and gene knockout data according to the present invention.
FIG. 4 is a line graph of the level of gene expression changes at several time points for several genes in a method for inferring gene regulatory networks using single cell transcription and gene knockout data as provided by the present invention.
FIG. 5 is a schematic diagram of a final inference result of a certain gene regulatory network in the method for inferring the gene regulatory network from single-cell transcription and gene knockout data according to the present invention.
Detailed Description
The technical solution of the present invention will be further described in more detail with reference to the following embodiments. It is to be understood that the described embodiments are merely a few embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-5, fig. 1 is a schematic flow chart of a method for inferring a gene regulatory network from single cell transcription and gene knockout data according to the present invention. The method comprises the following steps:
s110: the gene knockout technology is utilized to collect the steady-state gene transcription expression data in a plurality of cells respectively, and after each cell is stimulated, the transcription expression data of a single cell are collected at a plurality of time points.
The step S110 includes:
1. using the GNW mock gene knockout experiment, steady state gene expression data were obtained for both wild-type and knockout organisms, where the wild-type expression data were from the original strain of the organism that did not undergo any mutations. Knock-out data are from strains of the organism from which one or more genes have been knocked out or attenuated by the original strain;
2. gene expression values collected from a plurality of single cells at a plurality of time points constitute time-series single cell expression data.
S120: and analyzing steady-state expression data before and after gene knockout to classify the genes.
The gene knockout technology is utilized to disturb each gene according to the method, and the steady state expression values of all genes at corresponding moments are collected, the genes are divided into four classes, namely URG \ RRG \ NRG \ ISG, and the specific method is as follows:
(a) regulatory Gene (URG) under regulation: disturbing all other genes has no influence on the current gene, but the disturbance of the gene influences the expression level of other genes;
(b) regulated Regulatory Gene (RRG): when any other gene is disturbed, the expression level of the current gene fluctuates, and the disturbance of the gene can influence the expression level of other genes;
(c) non-regulatory gene (NRG): when any other gene is disturbed, the expression level of the current gene fluctuates, but the disturbance of the gene does not affect the expression level of any other gene;
(d) independent genes (ISG): perturbation of all other genes has no effect on the current gene, and perturbation of the gene does not affect the expression level of any other gene.
S130: calculating the distribution distance between genes by using single cell transcription data, establishing a multi-dimensional regression model of multiple time points for each gene by combining the prior knowledge of gene classification, and calculating a relation factor in the model by adopting a mathematical method.
Calculating intergenic distribution distance by using single cell transcription data, comprising the steps of:
time series single cell expression data contains multiple genes, multiple time points, multiple cells. Let g be the base factor, n be the number of measured time points, CtThe number of cells in the sample at the t-th time point ( t 1, 2.., n). The data will include n data matrices
Figure GDA0002418835090000071
Matrix elements
Figure GDA0002418835090000072
Is the transcriptional expression value of gene j, i.e., the number of mRNA molecules of gene j in the ith cell at the kth time point.
The information contained in the single-cell gene expression data set, particularly the change of gene expression distribution is used for carrying out GRN inference, and the expression distribution distance of the gene at two time points is firstly calculated to quantify the time change condition of each individual gene expression. The distribution distance of gene j at time t is quantified by equation (1).
DDj,t=max|Ft+1(Aj)-Ft(Aj)|(1)
Wherein, Ft(Aj) Is represented by AjThe cumulative distribution function over time t is the cumulative gene expression level of gene j from time 0 to time t; DD (DD) with high heat dissipating capacityj,tIndicates the expression change level of gene j at time t and t + 1.
The distribution distance obtained by the formula (1) is used for predicting the expression distribution of the target gene in the next time window by using the expression distribution of other genes in the given time window so as to explain the regulation and control condition of a certain gene by other genes. The distribution distance of gene j at time t +1 is expressed as a linear relationship in equation (2).
DDj,t+1=α1,jDD1,t2,jDD2,t+...+αg,jDDg,t(2)
Wherein, αg,jExpressing the regulation relation action factor of gene g to gene j, expressing the linear relation of gene j at all time to obtain the matrix in formula (3), where the α vector in the matrix is the action relation factor between genes to be solved, and using the gene classification result to determine the value of part α vector, and then calculating all solution vectors by mathematical calculation.
Figure GDA0002418835090000081
Where g denotes the base factor in the network, n denotes the number of measured time points, DDj,n-1Indicates the gene expression level of gene j at the n-1 st point in time.
S140: and (4) performing corresponding algorithm analysis on the gene knockout data to remove part of false positive judgment.
The step S140 includes:
1. analysis of gene knockout data removes the direct regulatory relationship between two genes that were misjudged in the preliminary GRN inference.
Steady-state gene expression values were collected in wild-type and knockout strains, respectively. If the expression level of the latter gene except the knocked-out gene is changed more than that of the former gene, the knocked-out gene has a regulatory effect on the gene, and conversely, the knocked-out gene has no regulatory effect. Therefore, the method can remove part of the preliminarily inferred false positive results, and comprises the following specific steps:
(1) collecting the expression levels of all genes of a wild strain by using a gene chip technology;
(2) knocking out each gene by using a gene knocking-out technology, and simultaneously collecting the expression levels of all genes of the strain;
(3) and comparing the change conditions of the expression levels of the genes before and after the knockout, and judging whether the genes are changed by adopting a certain measurement mode, thereby judging whether the current genes are regulated and controlled by the knocked-out genes.
Taking the difference of the two as a measure, and assuming that GK is the expression value of a knockout strain and GW is the expression value of a wild strain, then: GKi,j-GWjWhen the expression level of the gene j is considered to be changed when the expression level is more than α, the gene j is regulated by the gene i, wherein GKi,jExpressing the expression value of gene j when gene i is knocked out, wherein the values of 0 < i < 11, 0 < j < 11 and α are determined according to the overall condition of the acquired gene expression level, setting the inferred regulation relation which does not exist actually as 0, re-outputting a regulation relation matrix, and taking the matrix as the input of further analysis and processing.
2. Analysis of gene knockout data removes the inference that indirect regulation is misjudged as direct regulation in preliminary GRN inference.
Firstly, the regulation relation to be judged needs to be determined, namely the direct regulation relation which may exist or does not exist between two genes with the indirect regulation relation in the regulation network is called as 'uncertain regulation'. The implementation of this step is to determine the upper GU and lower GL sets of the gene regulation network, the upper set can be obtained from the GRN inferred earlier, and the lower set is to remove all the "uncertain regulation". And continuously updating the upper and lower limit sets through subsequent steps until the two sets are identical.
The method comprises the steps of disturbing a regulated gene after disconnecting an indirect regulation path of uncertain regulation, and observing the change of the expression level of the gene in the current GRN to determine whether a direct regulation relation really exists between the two genes. To accomplish this, it is necessary to find the best genome that can break the indirect pathway.
The set determined by the following rules is grouped as "edge separation":
1.S1(i, j) ═ GU's i child ∩ GU's j ancestor
2.S2(i, j) ═ GU i progeny ∩ GU j parents
3.S3(i, j) ═ GU of j in ∩ GU of i's descendants
When a plurality of indirect regulation exists in the network, the side separation of the indirect regulation is found out, and then the common set with the largest number is counted, namely the best genome to be knocked out.
S150: determining a gene regulatory network based on the inferred intergenic relationships, comprising the steps of:
determining a parameter factor according to the regulation quantity in the real network;
and determining a final gene regulation network according to the parameter factors and the probability value in the inferred relation factor table.
Different from the prior art, the method for deducing the gene regulation and control network by using the single cell transcription and gene knockout data classifies the genes by analyzing the steady-state expression data before and after gene knockout, and the genes are used as priori knowledge to reduce the space-time complexity and improve the deduction accuracy; calculating the distribution distance between genes by using single cell transcription data, establishing a multi-dimensional regression model of multiple time points for each gene by combining the gene classification result, and calculating a relation factor in the model by adopting a mathematical method; performing corresponding algorithm analysis on the gene knockout data to remove part of false positive judgments and make up for the defect of analyzing dynamic data; the method effectively solves the problem of high computational complexity in analyzing the time-series single-cell data, and improves the accuracy of deducing the gene regulation network.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (5)

1. A method for inferring gene regulatory networks from single cell transcription and gene knockout data comprising:
step one, collecting steady-state gene transcription expression data in a plurality of cells respectively by using a gene knockout technology, and collecting the transcription expression data of a single cell at a plurality of time points after stimulating each cell, wherein the specific operation is as follows:
using a GNW-simulated gene knockout experiment to obtain steady-state gene expression data for wild-type and knockout organisms, wherein the wild-type expression data is from an original strain of the organism that did not undergo any mutation; knock-out data are from strains of the organism from which one or more genes have been knocked out or attenuated by the original strain;
gene expression values collected from a plurality of single cells at a plurality of time points constitute time-series single cell expression data;
step two, analyzing steady-state expression data before and after gene knockout to classify the genes;
calculating distribution distances among genes by using the single-cell transcription data, establishing a multi-dimensional regression model of multiple time points for each gene by combining the prior knowledge of gene classification, and calculating a relation factor in the model by adopting a mathematical method;
step four, performing corresponding algorithm analysis on the gene knockout data to remove part of false positive judgment, and the method comprises the following steps:
analyzing the direct regulation relationship between two genes misjudged in the gene knockout data removal preliminary GRN inference, and analyzing the direct regulation relationship between two genes misjudged in the gene knockout data removal preliminary GRN inference, comprising the steps of:
collecting steady-state gene expression values in wild-type strains and knockout strains respectively; if the expression level of other genes except the knocked-out gene in the latter is changed greatly compared with that in the former, the knocked-out gene has a regulation effect on the gene, and otherwise, the knocked-out gene has no regulation effect; therefore, the method can remove part of the preliminarily inferred false positive results, and comprises the following specific steps:
(1) collecting the expression levels of all genes of a wild strain by using a gene chip technology;
(2) knocking out each gene by using a gene knocking-out technology, and simultaneously collecting the expression levels of all genes of the strain;
(3) comparing the change of the expression level of each gene before and after knockout, and judging whether the gene is changed by adopting a certain measurement mode so as to judge whether the current gene is regulated and controlled by the knocked-out gene;
taking the difference between the two as a measure, assuming that GK is the expression value of the knockout strain and GW is the expression value of the wild strain, then: GKi,j-GWj>α when the expression level of gene j is changed, gene j is regulated by gene i, wherein GKi,jExpression value of Gene j at the time of knocking-out Gene i, 0<i<11,0<j<Setting the inferred regulation relation which does not exist actually as 0, and outputting a regulation relation matrix again, wherein the matrix is used as the input of further analysis and processing;
analysis of gene knockout data to remove inferences from preliminary GRN inferences that misjudge indirect regulation as direct regulation, comprising the steps of:
firstly, determining a regulation relation to be judged, namely a direct regulation relation which may exist or does not exist between two genes with an indirect regulation relation in a regulation network, which is called as 'uncertain regulation'; firstly, determining an upper limit GU and a lower limit GL set of a gene regulation network, wherein the upper limit set can be obtained by the GRN inferred in the early stage, and the lower limit set is obtained by removing all 'uncertain regulation'; continuously updating the upper and lower limit sets through subsequent steps until the two sets are equal;
disturbing a regulated gene after disconnecting an indirect regulation path of 'uncertain regulation', and observing the change of the gene expression level in the current GRN to determine whether a direct regulation relation really exists between the two genes; to achieve this process, it is necessary to find the best genome capable of breaking the indirect pathway, comprising the steps of:
the set determined by the following rules is grouped as "edge separation":
1. S1(i, j) ═ GU's i child ∩ GU's j ancestor
2. S2(i, j) ═ GU i progeny ∩ GU j parents
3. S3(i, j) ═ GU of j in ∩ GU of i's descendants
When a plurality of indirect regulation exists in the network, edge separation of the indirect regulation is found out, and then the common set with the largest number is counted, namely the best genome to be knocked out;
determining a gene regulatory network based on the inferred intergenic relationships, comprising the steps of:
determining a parameter factor according to the regulation quantity in the real network;
determining a final gene regulation network according to the parameter factor and the probability value in the inferred relation factor table;
and step five, determining a gene regulation network according to the inferred relationship between the genes.
2. The method for inferring a gene regulatory network from single cell transcription and gene knockout data as claimed in claim 1, wherein said step two, analyzing steady state expression data before and after gene knockout to classify genes specifically comprises:
the gene knockout technology is utilized to disturb each gene according to the method, and the steady state expression values of all genes at corresponding moments are collected, the genes are divided into four classes, namely URG \ RRG \ NRG \ ISG, and the specific method is as follows:
(a) regulatory Gene (URG) under regulation: disturbing all other genes has no influence on the current gene, but the disturbance of the gene influences the expression level of other genes;
(b) regulated Regulatory Gene (RRG): when any other gene is disturbed, the expression level of the current gene fluctuates, and the disturbance of the gene can influence the expression level of other genes;
(c) non-regulatory gene (NRG): when any other gene is disturbed, the expression level of the current gene fluctuates, but the disturbance of the gene does not affect the expression level of any other gene;
(d) independent genes (ISG): perturbation of all other genes has no effect on the current gene, and perturbation of the gene does not affect the expression level of any other gene.
3. The method for deducing a gene regulatory network by using single cell transcription and gene knockout data as claimed in claim 1, wherein the third step is to calculate the distribution distance between genes by using single cell transcription data, establish a multi-dimensional regression model of multiple time points for each gene by combining the prior knowledge of gene classification, calculate the relation factor in the model by using a mathematical method, and comprises the steps of:
calculating the distribution distance between genes by using the time sequence single cell expression data;
using the expression distribution of other genes in a given time window to 'predict' the expression distribution of the target gene in the next time window, namely establishing a multi-dimensional regression model of each gene at multiple time points;
the model comprises an intergenic action relation factor, namely the solved solution vector, and the solution vector is solved by utilizing a least square method with a penalty term, wherein a larger factor represents that the corresponding regulation and control inference has higher confidence.
4. The method for inferring gene regulatory networks from single cell transcription and gene knockout data as claimed in claim 3, wherein said step three, calculating intergenic distribution distance using single cell transcription data, comprises the steps of:
the time series single cell expression data comprises a plurality of genes, a plurality of time points and a plurality of cells; let g be the base factor, n be the number of measured time points, CtThe number of cells in the sample at the t-th time point (t ═ 1, 2.., n); the data will include n data matrices
Figure FDA0002418835080000041
Matrix elements
Figure FDA0002418835080000042
Is the transcriptional expression value of gene j, i.e., the number of mRNA molecules of gene j in the ith cell at the kth time point;
using information contained in gene expression data sets of a single cell, in particular the distribution of gene expressionThe GRN inference is carried out on the change, firstly, the expression distribution distance of the genes at two time points is calculated to quantify the time change condition of each individual gene expression; distance quantification DD for gene j at time tj,tThe following were used:
DDj,t=max|Ft+1(Aj)-Ft(Aj)|
wherein, Ft(Aj) Is represented by AjThe cumulative distribution function over time t is the cumulative gene expression level of gene j from time 0 to time t; DD (DD) with high heat dissipating capacityj,tIndicates the expression change level of gene j at time t and t + 1.
5. The method for inferring gene regulatory networks from single-cell transcription and gene knockout data as claimed in claim 3, wherein in the third step, a multi-dimensional regression model of multiple time points is established for each gene in combination with prior knowledge of gene classification, comprising the steps of:
using the expression distribution of other genes in a given time window to 'predict' the expression distribution of a target gene in the next time window so as to explain the regulation and control condition of a certain gene by other genes; the linear relationship of gene j at time t +1 is as follows:
DDj,t+1=α1,jDD1,t2,jDD2,t+…+αg,jDDg,t
wherein, αg,jExpressing the regulation relation action factor of gene g to gene j; the linear relationship of gene j at all times is shown to give the following matrix:
Figure FDA0002418835080000043
where g denotes the base factor in the network, n denotes the number of measured time points, DDj,n-1Expressing the gene expression level of the gene j at the (n-1) th moment, determining the value of part α vector by α vector in the matrix, and calculating with a mathematic meterAll solution vectors are calculated.
CN201910636618.XA 2019-07-15 2019-07-15 Method for deducing gene regulation network by using single cell transcription and gene knockout data Active CN110517724B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910636618.XA CN110517724B (en) 2019-07-15 2019-07-15 Method for deducing gene regulation network by using single cell transcription and gene knockout data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910636618.XA CN110517724B (en) 2019-07-15 2019-07-15 Method for deducing gene regulation network by using single cell transcription and gene knockout data

Publications (2)

Publication Number Publication Date
CN110517724A CN110517724A (en) 2019-11-29
CN110517724B true CN110517724B (en) 2020-05-22

Family

ID=68623240

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910636618.XA Active CN110517724B (en) 2019-07-15 2019-07-15 Method for deducing gene regulation network by using single cell transcription and gene knockout data

Country Status (1)

Country Link
CN (1) CN110517724B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116504314B (en) * 2023-06-27 2023-08-29 华东交通大学 Gene regulation network construction method based on cell dynamic differentiation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105184049A (en) * 2015-08-10 2015-12-23 上海交通大学 Microbial growth phenotype predication method based on control-metabolic network integration model
CN108090326A (en) * 2018-02-09 2018-05-29 国家卫生计生委科学技术研究所 The construction method of unicellular network regulation relation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6633819B2 (en) * 1999-04-15 2003-10-14 The Trustees Of Columbia University In The City Of New York Gene discovery through comparisons of networks of structural and functional relationships among known genes and proteins

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105184049A (en) * 2015-08-10 2015-12-23 上海交通大学 Microbial growth phenotype predication method based on control-metabolic network integration model
CN108090326A (en) * 2018-02-09 2018-05-29 国家卫生计生委科学技术研究所 The construction method of unicellular network regulation relation

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Large Scale Modeling of Genetic Networks Using Gene Knockout Data;A.S.K. Youseph et.al;《2018 Association for Computing Machinery》;20181231;第2-6页 *
Optimal design of gene knockout experiments for gene regulatory network inference;S. M. Minhaz Ud-Dean et.al;《Bioinformatics》;20151114;第2-4页 *
SINCERITIES: inferring gene regulatory networks from time-stamped single cell transcriptional expression profiles;Nan Papili Gao et.al;《Bioinformatics》;20170914;第34卷(第2期);第259-265页 *

Also Published As

Publication number Publication date
CN110517724A (en) 2019-11-29

Similar Documents

Publication Publication Date Title
CN111785328B (en) Coronavirus sequence identification method based on gated cyclic unit neural network
CN101719195B (en) Inference method of stepwise regression gene regulatory network
CN111638707B (en) Intermittent process fault monitoring method based on SOM clustering and MPCA
Guh Robustness of the neural network based control chart pattern recognition system to non‐normality
CN115580445B (en) Unknown attack intrusion detection method, unknown attack intrusion detection device and computer readable storage medium
CN110517724B (en) Method for deducing gene regulation network by using single cell transcription and gene knockout data
CN113364751A (en) Network attack prediction method, computer-readable storage medium, and electronic device
CN116522263A (en) Abnormal cell detection method and device
CN110555530B (en) Distributed large-scale gene regulation and control network construction method
CN115831263A (en) Method and system for optimizing qualification rate of electrolytic refining high-purity indium product
KR100686399B1 (en) Lightweight intrusion detection method through correlation based hybrid feature selection
Zhang et al. Dbiecm-an evolving clustering method for streaming data clustering
CN109215738A (en) The prediction technique of Alzheimer&#39;s disease related gene
CN116246705B (en) Analysis method and device for whole genome sequencing data
CN107766887A (en) A kind of local weighted deficiency of data mixes clustering method
CN114863994B (en) Pollution assessment method, device, electronic equipment and storage medium
CN111816259B (en) Incomplete multi-study data integration method based on network representation learning
CN111583990B (en) Gene regulation network inference method combining sparse regression and elimination rule
CN111863136A (en) Integrated system and method for correlation analysis among multiple sets of chemical data
CN112765219A (en) Stream data abnormity detection method for skipping steady region
Qu et al. Biogeographical Ancestry Inference from Genotype: A Comparison of Ancestral Informative SNPs and Genome-wide SNPs
CN111681706A (en) Method for detecting chronic disease risk gene
Zhang et al. Potentiality of risk SNPs identification based on GSP theory
CN115618921B (en) Knowledge distillation method, apparatus, electronic device, and storage medium
Cha et al. ROC analysis of an erythroblast morphologic scoring system to improve identification of fetal cells in maternal blood

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant