CN111091866B - Method for identifying long-chain non-coding ribonucleic acid-transcription factor-gene regulation and control die body - Google Patents

Method for identifying long-chain non-coding ribonucleic acid-transcription factor-gene regulation and control die body Download PDF

Info

Publication number
CN111091866B
CN111091866B CN201911147482.2A CN201911147482A CN111091866B CN 111091866 B CN111091866 B CN 111091866B CN 201911147482 A CN201911147482 A CN 201911147482A CN 111091866 B CN111091866 B CN 111091866B
Authority
CN
China
Prior art keywords
gene
expression
lncrna
data
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911147482.2A
Other languages
Chinese (zh)
Other versions
CN111091866A (en
Inventor
李爱民
刘雅君
刘光明
费蓉
周红芳
黑新宏
王磊
赵中明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN201911147482.2A priority Critical patent/CN111091866B/en
Publication of CN111091866A publication Critical patent/CN111091866A/en
Application granted granted Critical
Publication of CN111091866B publication Critical patent/CN111091866B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The method for identifying the long-chain non-coding ribonucleic acid-transcription factor-gene regulation and control die body comprises the following steps of 1, obtaining gene expression data; step 2, filtering gene expression data; step 3, obtaining a regulation relation between the biomolecules; step 4, obtaining high-expression and low-expression long-chain non-coding RNA; step 5, designing a multiple linear regression model; step 6, result processing; the method can be used for identifying the lncRNA-TF-gene regulation and control motif in the complex disease, and obtaining reliable lncRNA, TF and gene expression data and reliable TF-gene regulation and control relation data by adopting strict filtering conditions, so that the output result of a subsequent multiple linear regression model is more reliable and credible, and the system error is reduced.

Description

Method for identifying long-chain non-coding ribonucleic acid-transcription factor-gene regulation and control die body
Technical Field
The invention belongs to the technical field of identification of lncRNA-TF-gene regulatory motifs, and particularly relates to an effective method for identifying lncRNA-TF-gene regulatory motifs based on a multiple linear regression model.
Background
At present, many serious diseases threaten the health of people and even endanger life. Cancer is undoubtedly one of the most interesting complex diseases among a large number of major diseases. From a global perspective, about 15% of deaths are caused by cancer. The diagnosis and treatment of cancer still faces significant challenges. The Chinese national cancer center of 1 month in 2019 publishes the 'Chinese Journal of Oncology' with the '2019 latest cancer report'. The report indicates that: on average, 7.5 people per minute in China were diagnosed with cancer. With the increasing aging process of the population in China, the cancer rate of the same year is rising, and the number of people suffering from cancer and dying is increasing. Cancer prevention and treatment work has attracted great attention from all parties. A very challenging worldwide problem is faced by us: the mechanism of cancer development and development is discussed and studied in an attempt to find an effective method for preventing, diagnosing, monitoring and treating cancer. Cancer is a chronic complex disease associated with gene mutations, including epigenetic changes, DNA deletions and additions, copy number variations, chromosomal translocations, and the like. Non-coding RNAs are a class of RNAs that cannot be translated into proteins. Common non-coding RNAs include: miRNA, siRNA, piRNA, lncRNA, circRNA, and the like. Studies show that non-coding RNA has important physiological functions in various cancers, and particularly, long-chain non-coding RNA (lncRNA) has important functions on the occurrence, development, metastasis and the like of the cancers. In recent years, many lncrnas have been found to be aberrantly expressed or mutated in cancer by high throughput sequencing in combination with bioinformatic analysis. The current research proves that some lncRNA are oncogenes and can be used as cancer markers for assisting the diagnosis and treatment of cancers.
Long non-coding RNA (lncRNA) generally refers to RNA that is not capable of coding for a protein and is more than 200 nucleotides in length. lncRNA regulates the expression of other coding or non-coding genes through various forms: transcriptional regulation, post-transcriptional regulation, epigenetic regulation, and the like. Research has shown that lncRNA can act as ceRNA (endogenous competitive RNA), bind to miRNA and act as miRNA sponge in cells. Thereby reducing the activity of miRNA and indirectly up-regulating the expression of miRNA related target genes.
A number of recent articles have discovered that such non-coding RNAs play an important role in tumor cancer regulation, and these articles are successively published in the journal of "Science", "Cell", "Molecular Cell", etc. Although lncrnas are known to be important biomolecules in cancer, their contribution to cancer remains largely unclear. Several studies have shown that lncRNA can mediate gene expression. However, few studies have explored the effect of lncRNA on the modulation of TF-gene interactions by their involvement in cancer through lncRNA-mediated mirnas.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a method for identifying an lncRNA-TF-gene regulation and control motif based on a multiple linear regression model, and aims to obtain reliable lncRNA, TF and gene expression data and reliable TF-gene regulation and control relation data by adopting strict filtering conditions, so that the output result of the subsequent multiple linear regression model is more reliable and credible, and the system error is reduced; grouping the samples according to the expression level of lncRNA, and then comparing the high expression group with the low expression group; designing a multiple linear regression model to fit the influence of lncRNA on TF-gene; the lncRNA-TF-gene regulatory motif can be used for analyzing the regulatory mechanism among biological molecules, researching the mechanism of occurrence and development of diseases and discovering new disease markers.
In order to achieve the purpose, the invention adopts the technical scheme that: a method for identifying lncRNA-TF-gene regulatory motifs based on a multiple linear regression model comprises the following steps:
step 1, obtaining Gene expression data
Downloading from the XENA database various cancer related genes and their numbers, diagnostic basis data, cancer genomic map providing transcriptome sequencing data in 33 common cancer tissues and paracancerous tissues, these samples in cancer genomic map being available for analysis of expression profile, raw RNA-seq sequencing data provided by cancer genomic map being used to calculate the expression levels of lncRNA, TF and gene, the expression levels being expressed in terms of the number of fragments matched to each kilobase of exon per million reads matched;
step 2, Filtering of Gene expression data
lncRNA (long non-coding RNA), TF (transcription factor) and gene (gene) were filtered, first only lncRNA, TF and gene with FPKM expression level greater than 1 in at least 50% of the samples were retained, and biomolecules not satisfying the filtering conditions were discarded in the subsequent analysis step; secondly, the whole gene expression data is divided into a plurality of data sets according to the category of the biological molecules, wherein the data sets are respectively as follows: filtered lncRNA expression data, filtered transcription factor expression data and filtered gene expression data;
step 3, obtaining the regulation relation between the biomolecules
TF-gene mutual regulation relation data are obtained from a TRANSFAC database and a TRRUST database, and the intersection of the data obtained from the two databases is selected, so that the data are more reliable; for TF-gene, further filtering, TF in TF-gene must be expressed, i.e. FPKM for TF must be greater than 1 in at least half of the samples, gene must also be expressed, gene FPKM must be greater than 1 in at least half of the samples;
step 4, obtaining high-expression and low-expression long-chain non-coding RNA
Sequencing the existing cancer samples according to the expression value of each lncRNA from low to high, wherein each lncRNA is independently analyzed, for each lncRNA, sequencing all samples according to the expression level of the lncRNA, wherein one third of low-expression samples are considered as an lncRNA low-expression group, one third of high-expression samples are considered as an lncRNA high-expression group, and the rest samples are considered as a middle-expression group;
step 5, designing a multiple linear regression model
The multiple linear regression model is as follows:
Eg~Et+Gl+Et:Glequation 1
Wherein E is Expression level of Expression, G is Group, G is gene, t is TF, l is lncRNA, E isgIndicating the expression level of the gene, EtIndicating the expression level of a transcription factor, GlIs a grouping of samples, including a low group and a high group, Et:GlThe interaction between the transcription factor and the lncRNA group is expressed, and the lncRNA which has obvious influence on the TF-gene can be obtained through the model, and the lncRNA-TF-gene regulation die body can be obtained;
copy number variation affects gene expression to a large extent, and formula 1 needs to be modified, and the formula after modification is as follows:
Eg~Et+Gl+C+Et:Glequation 2
Wherein C represents a copy number variation of a transcription factor or gene, and if C corresponds to a p-value smaller than0.05, indicating that CNV has a significant effect on gene Expression level, at this time, excluding this TF-gene, E is Expression level, G is Group, G is gene, t is TF, l is lncRNA, E isgIndicating the expression level of the gene, EtIndicating the expression level of a transcription factor, GlIs a grouping of samples, including a low group and a high group, Et:GlRepresents the interaction between a transcription factor and the lncRNA grouping;
step 6, result processing
For lncRNA-TF-gene obtained in step 5, analyzing p-value of each parameter, and keeping the regulatory motif satisfying p-value <0.05 of Gl and p-value >0.05 of C, and performing multiple checks, wherein the final result is FDR < 0.05.
The present application employs an algorithm suitable for identifying lncRNA-TF-gene regulatory motifs. The Transcription Factor (English name is Transcription Factor, abbreviated as TF) can regulate the Transcription efficiency of a protein coding gene (gene, Chinese name of gene is gene), and the regulation relation is marked as TF-gene. Meanwhile, long-chain non-coding RNA is marked as lncRNA, the efficiency of (modular) TF regulation and control of gene can be adjusted, and the regulation and control relation is called lncRNA-TF-gene regulation and control motif.
The invention has the beneficial effects that:
the invention adopts the latest and most authoritative database as a reliable data source, adopts strict data screening standards to ensure that the data is accurate, and adopts a multiple linear regression model to identify the regulation and control relationship among lncRNA, TF and gene for the first time, so the invention has the advantages of novel scheme and accurate result.
In The present invention, The Cancer Genome Atlas (TCGA) pan-Cancer data was analyzed in depth to determine The lncRNA-TF-gene regulatory motif. Authoritative databases such as TCGA, NCBI (National Center for Biotechnology Information), EBI (The European Bioinformatics institute), GTEx (The Genotype-Tissue Expression) and The like provide a large amount of high-quality gene Expression data, and The data lay a foundation for researching regulation and control of The IncRNA on TF-gene. TF-gene regulatory relationships can be obtained from the TRANSFAC and TRRUST databases. Based on the expression profile data of lncRNA, TF and gene, linear regression was applied to fit the effect of lncRNA on TF-gene interaction. The regulatory relationship between these molecules was analyzed by looking at the changes (up-or down-regulation) in the relative expression levels of lncRNA, TF, gene.
According to the invention, the obvious influence of Copy Number Variation (CNV) on gene expression is considered, so that the TF-gene regulation relation change caused by CNV is excluded. The invention can be used for identifying lncRNA-TF-gene regulatory motifs in complex diseases, and the regulatory motifs can be used for revealing that lncRNA participates in the occurrence and development of cancers through a multi-level complex regulatory mechanism and can also provide a new target for diagnosis and treatment.
The invention designs a multiple linear regression model to systematically identify lncRNA-TF-gene regulatory motifs widely existing in various cancer types. The method and results are very useful for researchers exploring lncRNA function through cancer next generation sequencing applications. The methods and resources provided by the invention will help to study lncRNA function in various cancer types.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
A method for identifying lncRNA-TF-gene regulatory motifs based on a multiple linear regression model comprises the following steps:
step 1, obtaining gene expression data:
downloading various cancer related genes and their quantities, diagnostic basis data from a database developed by the XeNA university of California with the website named XENA. database https:// XENA browser.net /), a cancer genomic map providing transcriptome sequencing data in 33 common and paracancerous tissues, these samples in the cancer genomic map being useful for analyzing expression profiles, raw sequencing RNA-seq (RNA-seq i.e., transcriptome sequencing technology) sequencing data provided by the cancer genomic map being useful for calculating expression levels of lncR NA (lncRNA represents long non-coding RNA), TF (Transcription factors, TF represents Transcription factors), and gene (gene represents a gene), the expression levels being expressed as the number of fragments per kilobase that match an exon per million matched reads;
the cancer genome map (TCGA) provides transcriptome sequencing data in 33 common cancer tissues and paraneoplastic tissues, these samples in TCGA can be used to analyze expression profiles, the raw RNA-seq sequencing data provided by TCGA can be used to calculate the expression levels of incrna, TF and gene, expressed as FPKM (number of fragments matched to each kilobase of an exon per million matched reads);
genes and their quantitative data relating to various types of cancer can be downloaded from the XENA, and the concrete websites are: https:// xena browser. Tcatt — RSEM _ gene _ fpkm & host, https:// toil.xenahubs.net, clinical diagnostic basis data (phenotype) for these samples can also be downloaded from XENA (specific website: https:// xenahowser.net/datapages/;
step 2, Filtering of Gene expression data
To ensure accurate and reliable data, lncRNA (long non-coding RNA), TF (transcription factor) and gene (genes) were filtered, first, only FPKM (FPKM, all known in English as Fragments Per Kilo base of transcript Per Million Fragments mapped, Chinese means the number of Fragments matched to each Kilobase of an exon Per 1 Million matched reads) was kept for lncRNA, TF and gene expression levels greater than 1 in at least 50% of the samples, and biomolecules that did not meet the filtering conditions were discarded in the subsequent analysis steps; secondly, the whole gene expression data is divided into a plurality of data sets according to the category of the biological molecules, wherein the data sets are respectively as follows: filtered lncRNA expression data, filtered transcription factor expression data and filtered gene expression data;
step 3, obtaining the regulation relation between the biomolecules
TF-gene (TF-gene represents transcription factor-gene) mutual regulation relation data are obtained from a TRANSFAC database (TRANSFAC is the name of the database, the website address of the database is http:// gene-regulation. com/pub/databases. html) and a TRRUST database (TRRUST is the name of the database, the website address of the database is https:// www.grnpedia.org/trRUst /), and the intersection of the data obtained by the two databases is taken, so that the data are more reliable; for TF-gene, further filtering, TF in TF-gene must be expressed, that is, FPKM (FPKM, which is called fragment Per basis of transcript Per Million Fragments mapped, in Chinese meaning: the number of Fragments matching each Kilobase of an exon Per 1 Million matched reads) of TF must have a value greater than 1 in at least half of the samples, gene must also be expressed, and gene FPKM must have a value greater than 1 in at least half of the samples;
TF-gene mutual regulation relationship data can be obtained from the TRANSFAC database (TRANScription FAC database, http:// genetic display. com/trans /) and the TRRUST database (TRANScription relationship Relationships: Unravaged by Sennced-based Text, https:// www.grnpedia.org/tr /). And the intersection of the data obtained by the two databases is taken, so that the data is more reliable. For TF-genes we need further filtering, since under certain conditions some TF or gene is not necessarily expressed. That is, the TF-gene regulatory relationship is not necessarily true in special cases. TF in TF-gene must be expressed, that is, FPKM of TF must have a value greater than 1 in at least half of samples, and gene must also be expressed, that is, FPKM of gene must have a value greater than 1 in at least half of samples. Through the filtering conditions, the reliable regulation and control relation of the TF-gene is required, the expression of the TF and the gene is required, and the TF-gene and the gene are both unavailable, so that the data is more reliable;
step 4, obtaining high-expression and low-expression long-chain non-coding RNA
Sequencing the existing cancer samples according to the expression value of each lncRNA (long non-coding RNA) from low to high, wherein each lncRNA is independently analyzed, for each lncRNA, all samples are sequenced according to the expression level of the lncRNA, one third of low-expression samples are considered as an lncRNA low-expression group, one third of high-expression samples are considered as an lncRNA high-expression group, and the rest samples are considered as a medium-expression group;
to determine the effect of long non-coding RNA (lncRNA) on TF-gene, changes in lncRNA expression levels were observed. When lncRNA is high-expression and low-expression, whether the regulation relation of TF-gene is affected, such as changing TF-gene from positive regulation to negative regulation, or just reversely, changing negative regulation to positive regulation, or changing weak regulation to strong regulation, etc. Existing cancer samples are ranked by the expression level of each incrna from low to high, noting that each incrna is analyzed independently. For each lncRNA, all samples were ranked according to the lncRNA expression level, and one-third (33%) of the underexpressed samples were considered to be lncRNA underexpression group. One third (33%) of the high expression samples were considered as lncRNA high expression group and the rest as medium expression group, where the threshold was one third due to: if the threshold value is too small, the number of samples obtained is small, so that the result is unreliable in regression analysis; if the threshold is too large, samples with intermediate expression levels are also taken into account, and such high expression and low expression samples are indistinguishable;
step 5, designing a multiple linear regression model
The effect of lncRNA on TF-gene can be considered in several forms, such as the effect of TF on gene expression, the effect of lncRNA on gene expression, and the effect of the interaction of TF and lncRNA on gene expression, and the multiple linear regression model is as follows:
Eg~Et+Gl+Et:Glequation 1
Here, E is Expression level of Expression, G is Group, G is gene, t is TF, l is lncRNA, EgIndicating the expression level of the gene, EtIndicating the expression level of a transcription factor, GlIs a grouping of samples (low and high groups), Et:GlThe interaction between the transcription factor and the lncRNA group is expressed, and the lncRNA which has obvious influence on the TF-gene can be obtained through the model, so that an lncRNA-TF-gene regulation motif can be obtained;
copy Number Variation (CNV) greatly affects gene expression, and therefore, formula 1 needs to be modified as follows:
Eg~Et+Gl+C+Et:Glequation 2
Wherein C represents a copy number variation of a transcription factor or gene. If the p-value corresponding to C is less than 0.05, the CNV is shown to have a significant influence on the expression level of the gene, and at the moment, the TF-gene needs to be excluded; e is Expression level of Expression, G is Group, G is gene, t is TF, l is lncRNA, EgIndicating the expression level of the gene, EtIndicating the expression level of a transcription factor, GlIs a grouping of samples, including a low group and a high group, Et:GlRepresents the interaction between a transcription factor and the lncRNA grouping;
step 6, result processing
For the last step to obtain lncRNA-TF-gene (lncRNA-TF-gene represents: long non-coding RNA-transcription factor-gene), p-value of each parameter needs to be analyzed to satisfy GlP-value of<P-value of 0.05 and C>The 0.05 regulatory motif needs to be retained and multiple checks, FDR, need to be done<A final result was 0.05.
The data used 33 common cancer samples in TCGA, which have complete expression data of lncRNA, TF and gene on the same sample, and cancer samples and normal samples (tissues beside cancer) as controls, and the number of samples is large. We need to study the lncRNA-TF-gene regulatory motif, so there must be expression data of three types of biomolecules (lncRNA, TF, gene); the cancer sample and the normal sample can be used for subsequent comparison, whether the lncRNA is highly expressed or lowly expressed (differential expression) in the cancer sample is observed, and whether the lncRNA is a cancer-related biomarker can be further analyzed; in designing the model, the expression level of the sample-mounted lncRNA is divided into three classes, namely low, medium and high, which account for one third, and if the total number of samples is too small, the accuracy of the result is affected. From the above analysis it can be seen that the selection of a suitable data set (sample) is crucial.
The sample data filtering also affects the accuracy of the final result. If the expression level obtained using the raw data directly, part of the data is invalid. For example: the lncRNA is often expressed at a lower level than the gene encoding the protein. In certain diseases, some lncrnas are not expressed, at levels of 0 or close to 0 in how many samples. Such lncRNA, if ranked by expression level and taken part in the analysis, showed that the results were unreliable. Therefore, we only retained lncRNA, TF and gene with FPKM expression levels greater than 1 in at least 50% of the samples.
The TF-gene interrelationship must be reliable. TRANSFAC and TRRUST are two authoritative TF regulatory databases. The intersection of the TF-gene regulatory relationships using the two databases is more reliable. In addition, TF or gene is not necessarily expressed in a particular cancer (or in a particular disease), and if one is not expressed, the effect of lncRNA on their regulatory relationship cannot be determined, and therefore such non-expressed TF-gene is not considered.
For each lncRNA, all samples were ranked according to lncRNA expression level. One third (33%) of the low expression samples were considered to be lncRNA low expression group. One third (33%) of the high expression samples were considered as lncRNA high expression group. The remaining samples were considered as the median expression set. The reason why the threshold value here is one third is that: if the threshold value is too small, the number of samples obtained is small, so that the result is unreliable in regression analysis; if the threshold is too large, samples with intermediate expression levels are also taken into account, and such high expression is indistinguishable from low expression samples. In addition, we observed the number of cancer samples in TCGA, with 24 cancers with a sample number exceeding 90. If the threshold is set to one third, then there are 30 samples of high and low expression, respectively, which is reasonable and allows for efficient statistical analysis.
The multiple linear regression model was used for the first time to analyze the lncRNA-TF-gene regulatory motif. Regression models can be used to analyze the correlation of gene expression, and existing literature is mainly used to analyze the regulatory or correlation between two genes. In the invention, the regression model is used for analyzing the regulation and control relation among the three types of biomolecules, and the design difficulty is higher than that of the two types of biomolecules. It is crucial that the interaction between TF and genes is considered to reflect the influence of IncRNA in nature, not only the influence of TF on gene expression and the influence of genes by groups (i.e., high and low expression of IncRNA) on gene expression, but also the influence of the interaction between TF and groups on gene expression. In addition, we also consider the effect of CNV on gene expression. The prior literature has demonstrated that the effect of CNV on gene expression is not negligible. Therefore, the influence of CNV is reflected in equation 2. If CNV is not considered, then some of the gene expression changes caused by CNV can be mistaken for lncRNA. Such results are clearly unreliable.

Claims (1)

1. A method for identifying a long-chain non-coding ribonucleic acid-transcription factor-gene regulatory motif, comprising the steps of:
step 1, obtaining Gene expression data
Downloading genes and their numbers, diagnostic basis data from the XENA database, the cancer genomic map providing transcriptome sequencing data in cancer and paracancerous tissues, the samples in the cancer genomic map used to analyze expression profiles, the raw RNA-seq sequencing data provided by the cancer genomic map used to calculate the expression levels of lncRNA, TF and gene, the expression levels being expressed as the number of fragments matched to each kilobase of an exon per million matched reads;
step 2, Filtering of Gene expression data
lncRNA, TF and gene were filtered, first only lncRNA, TF and gene with FPKM expression level greater than 1 in at least 50% of the samples were retained, and biomolecules not satisfying the filtering conditions were discarded in the subsequent analysis step; secondly, the whole gene expression data is divided into a plurality of data sets according to the category of the biological molecules, wherein the data sets are respectively as follows: filtered lncRNA expression data, filtered transcription factor expression data and filtered gene expression data;
step 3, obtaining the regulation relation between the biomolecules
TF-gene mutual regulation relation data are obtained from a TRANSFAC database and a TRRUST database, and the intersection of the data obtained from the two databases is selected, so that the data are more reliable; for TF-gene, further filtering, TF in TF-gene must be expressed, that is, FPKM of TF must have a value greater than 1 in at least half of samples, gene must also be expressed, FPKM of gene must have a value greater than 1 in at least half of samples;
step 4, obtaining high-expression and low-expression long-chain non-coding RNA
Sequencing the existing cancer samples according to the expression value of each lncRNA from low to high, wherein each lncRNA is independently analyzed, for each lncRNA, sequencing all samples according to the expression level of the lncRNA, wherein one third of low-expression samples are considered as an lncRNA low-expression group, one third of high-expression samples are considered as an lncRNA high-expression group, and the rest samples are considered as a middle-expression group;
step 5, designing a multiple linear regression model
The multiple linear regression model is as follows:
Eg~Et+Gl+Et:Glequation 1
Wherein E is Expression level of Expression, G is Group, G is gene, t is TF, l is lncRNA, E isgIndicating the expression level of the gene, EtIndicating the expression level of a transcription factor, GlIs a grouping of samples, including a low group and a high group, Et:GlThe interaction between the transcription factor and the lncRNA group is expressed, and the lncRNA which has obvious influence on the TF-gene can be obtained through the model, and the lncRNA-TF-gene regulation die body can be obtained;
copy number variation affects gene expression and requires modification of equation 1, which is followed by:
Eg~Et+Gl+C+Et:Glequation 2
Wherein C represents the copy number variation of a transcription factor or a gene, if the p-value corresponding to C is less than 0.05, the CNV shows that the CNV has obvious influence on the expression level of the gene, and at the moment, the TF-gene is excluded, and the IncRNA-TF-gene regulatory motif after copy number variation correction is obtained;
step 6, result processing
Analyzing p-value of each parameter for the lncRNA-TF-gene regulation motif obtained in the step 5 after copy number variation correction, and satisfying GlP-value of<P-value of 0.05 and C>The 0.05 regulatory motif needs to be retained and subjected to multiple checks, FDR<A final result was 0.05.
CN201911147482.2A 2019-11-21 2019-11-21 Method for identifying long-chain non-coding ribonucleic acid-transcription factor-gene regulation and control die body Active CN111091866B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911147482.2A CN111091866B (en) 2019-11-21 2019-11-21 Method for identifying long-chain non-coding ribonucleic acid-transcription factor-gene regulation and control die body

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911147482.2A CN111091866B (en) 2019-11-21 2019-11-21 Method for identifying long-chain non-coding ribonucleic acid-transcription factor-gene regulation and control die body

Publications (2)

Publication Number Publication Date
CN111091866A CN111091866A (en) 2020-05-01
CN111091866B true CN111091866B (en) 2022-03-15

Family

ID=70394094

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911147482.2A Active CN111091866B (en) 2019-11-21 2019-11-21 Method for identifying long-chain non-coding ribonucleic acid-transcription factor-gene regulation and control die body

Country Status (1)

Country Link
CN (1) CN111091866B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102286464A (en) * 2011-06-30 2011-12-21 眭维国 Uremia long-chain non-coding ribonucleic acid difference expression spectrum model and construction method thereof
CN107679367A (en) * 2017-09-20 2018-02-09 湖南大学 A kind of common regulated and control network functional module recognition methods and system based on the network node degree of association
CN108710783A (en) * 2018-05-23 2018-10-26 湖南女子学院 A kind of complex function module recognition method and system based on node relationship pair
WO2019074292A2 (en) * 2017-10-11 2019-04-18 (주)셀트리온 Expression cassette for production of high-expression and high-functionality target protein and use thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102286464A (en) * 2011-06-30 2011-12-21 眭维国 Uremia long-chain non-coding ribonucleic acid difference expression spectrum model and construction method thereof
CN107679367A (en) * 2017-09-20 2018-02-09 湖南大学 A kind of common regulated and control network functional module recognition methods and system based on the network node degree of association
WO2019074292A2 (en) * 2017-10-11 2019-04-18 (주)셀트리온 Expression cassette for production of high-expression and high-functionality target protein and use thereof
CN108710783A (en) * 2018-05-23 2018-10-26 湖南女子学院 A kind of complex function module recognition method and system based on node relationship pair

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
银杏叶片长链非编码RNA的鉴定和分析;夏笑;《中国优秀硕士学位论文全文数据库 农业科技辑》;20190115;D049-284 *

Also Published As

Publication number Publication date
CN111091866A (en) 2020-05-01

Similar Documents

Publication Publication Date Title
Chu et al. Nonsynonymous, synonymous and nonsense mutations in human cancer-related genes undergo stronger purifying selections than expectation
Hause et al. Classification and characterization of microsatellite instability across 18 cancer types
Li et al. Epigenome-wide association study of Alzheimer’s disease replicates 22 differentially methylated positions and 30 differentially methylated regions
EP3210142B1 (en) Assessment of tgf-cellular signaling pathway activity using mathematical modelling of target gene expression
Du et al. Integrative genomic analyses reveal clinically relevant long noncoding RNAs in human cancer
Bipolar Disorder Genome Study (BiGS) Consortium McMahon Francis J mcmahonf@ mail. nih. gov 1 b Akula Nirmala 1 Schulze Thomas G 1 2 Muglia Pierandrea 3 4 Tozzi Federica 3 Detera-Wadleigh Sevilla D 1 Steele CJM 1 Breuer René 2 Strohmaier Jana 2 Wendland Jens R 1 Mattheisen Manuel 5 6 7 Mühleisen Thomas W 5 6 Maier Wolfgang 8 Nöthen Markus M 5 6 Cichon Sven 5 6 Farmer Anne 9 Vincent John B 4 Holsboer Florian 10 Preisig Martin 11 Rietschel Marcella 2 6 Meta-analysis of genome-wide association data identifies a risk locus for major mood disorders on 3p21. 1
Edsgärd et al. GeneiASE: Detection of condition-dependent and static allele-specific expression from RNA-seq data without haplotype information
Harrington et al. RNA-Seq of human whole blood: Evaluation of globin RNA depletion on Ribo-Zero library method
Popovici et al. Selecting control genes for RT-QPCR using public microarray data
Huo et al. Tumor microenvironment characterization in head and neck cancer identifies prognostic and immunotherapeutically relevant gene signatures
Seco-Cervera et al. Small RNA-seq analysis of circulating miRNAs to identify phenotypic variability in Friedreich’s ataxia patients
CN109563544A (en) The diagnostic assay of urine monitoring for bladder cancer
Fox et al. Ensemble analyses improve signatures of tumour hypoxia and reveal inter-platform differences
Grunert et al. Altered microRNA and target gene expression related to Tetralogy of Fallot
Lopes-Ramos et al. Regulatory network of PD1 signaling is associated with prognosis in glioblastoma multiforme
Signorelli et al. Evaluation of blood gene expression levels in facioscapulohumeral muscular dystrophy patients
Valbuena et al. The 14q32 maternally imprinted locus is a major source of longitudinally stable circulating microRNAs as measured by small RNA sequencing
CN115851920A (en) Method for screening and function analysis of circRNA-methylated gene combined regulation and control network
McRae et al. Replicated effects of sex and genotype on gene expression in human lymphoblastoid cell lines
Al Gashaamy et al. MicroRNA expression in apical periodontitis and pulpal inflammation: a systematic review
CN111091866B (en) Method for identifying long-chain non-coding ribonucleic acid-transcription factor-gene regulation and control die body
CN104694641A (en) Method for predicating gene age and disease susceptibility and kit
KR20210104206A (en) Composition for diagnosing polymyositis or dermatomyositis
WO2022156610A1 (en) Prediction tool for determining sensitivity of liver cancer to drug and long-term prognosis of liver cancer on basis of genetic testing, and application thereof
Stempor et al. MMpred: functional miRNA–mRNA interaction analyses by miRNA expression prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant