CN112562785A - Method for screening key gene of endometrial cancer based on ATAC sequencing data and application - Google Patents

Method for screening key gene of endometrial cancer based on ATAC sequencing data and application Download PDF

Info

Publication number
CN112562785A
CN112562785A CN202011452042.0A CN202011452042A CN112562785A CN 112562785 A CN112562785 A CN 112562785A CN 202011452042 A CN202011452042 A CN 202011452042A CN 112562785 A CN112562785 A CN 112562785A
Authority
CN
China
Prior art keywords
gene
screening
genes
endometrial cancer
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011452042.0A
Other languages
Chinese (zh)
Inventor
卢美松
陈河兵
汤小晗
王军婷
李�昊
陶欢
伯晓晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
First Affiliated Hospital Of Harbin Medical University
Institute of Pharmacology and Toxicology of AMMS
Academy of Military Medical Sciences AMMS of PLA
Original Assignee
First Affiliated Hospital Of Harbin Medical University
Institute of Pharmacology and Toxicology of AMMS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by First Affiliated Hospital Of Harbin Medical University, Institute of Pharmacology and Toxicology of AMMS filed Critical First Affiliated Hospital Of Harbin Medical University
Priority to CN202011452042.0A priority Critical patent/CN112562785A/en
Publication of CN112562785A publication Critical patent/CN112562785A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/20Screening of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Abstract

The invention relates to the technical field of gene screening, and relates to a method for screening key genes of endometrial cancer based on ATAC sequencing data and application thereof. The screening method of the tumor key gene based on ATAC sequencing data provided by the invention comprises the steps of firstly selecting more than two indexes to sort peak positions obtained by ATAC sequencing; then, dividing the peak position sorting results of different indexes into groups according to the number of the peak positions, and selecting at least one group as a primary screening peak position group suspected to contain tumor key genes; then screening peak positions commonly contained in the primary screening peak position group to form a secondary screening peak position group; finally, according to the gene expression value of the target tumor, the genes corresponding to each peak position in the obtained secondary screening peak position groups are finally screened to obtain the tumor key genes. The method is based on ATAC sequencing data, and the key genes of the target tumor are screened by selecting a plurality of judgment indexes, so that the interference of human factors in the data processing process can be avoided, and the method has higher reliability.

Description

Method for screening key gene of endometrial cancer based on ATAC sequencing data and application
Technical Field
The invention relates to the technical field of gene screening, in particular to a method for screening key genes of endometrial cancer based on ATAC sequencing data and application thereof.
Background
At present, the screening method of the tumor key genes mainly comprises the following three methods: (1) using the mutant mice, performing forward gene screening and determining candidate driver genes highly related to human cancers; (2) based on the aspects of protein sequence, functional annotation, interaction network and the like, candidate tumor genes are sorted out by researching the characteristics of the genes in the tumor; (3) differential expression genes are extracted based on a public database, and key genes are screened by combining a protein interaction network (PPI).
However, the above screening methods for tumor key genes all have obvious defects and shortcomings. Firstly, because of the obvious difference among species, the effect of the candidate driving gene obtained when the mutant mouse is used for screening the key gene is obviously different from that of the mouse in the human body and is not accurate enough, and meanwhile, the method for screening the key gene by using the mutant mouse mainly aims at repeatedly mutated 'obvious mutant genes', and rare mutant genes cannot be obtained.
Secondly, since the protein sequence can only reflect the gene composition of the coding region, the functional annotation of the gene by the method of classifying candidate tumor genes by studying the characteristics of the gene in tumor based on the protein sequence, functional annotation and interaction network is mainly concentrated on the coding region of the gene, and about 97% of the non-coding region of the gene also has important effect on tumor occurrence and development, while this method cannot carry out functional annotation on the non-coding region, and the method has bias in studying and screening the gene with known functional annotation.
With the increasing abundance of gene databases, methods for extracting differentially expressed genes by using public databases and screening key genes by combining protein interaction networks gradually become important means for screening key genes of tumors, but because data sources are restricted by experimental conditions, technologies and selection standards of control groups, one or more groups of screened data cannot represent wide cases.
In view of this, the invention is particularly proposed.
Disclosure of Invention
The invention aims to provide a screening method of key genes of endometrial cancer, which is based on ATAC sequencing data, can avoid interference of human factors in a data processing process by selecting a plurality of judgment indexes to screen the key genes of target tumors, and has higher reliability.
The invention also provides application of the endometrial cancer key gene screening method in preparation of endometrial cancer key gene product screening.
The invention is realized by the following steps:
the invention provides a method for screening key genes of endometrial cancer based on ATAC sequencing data, which comprises the steps of firstly selecting more than two indexes to sequence peak positions obtained by ATAC sequencing; then, dividing the peak position sorting results of different indexes into groups according to the number of the peak positions, and selecting at least one group as a primary screening peak position group suspected to contain tumor key genes; then screening out the peak positions which are commonly contained in the primary screening peak position groups obtained by different indexes to form secondary screening peak position groups; finally, according to the gene expression value of endometrial cancer, the genes corresponding to all peak positions in the obtained secondary screening peak position groups are finally screened, and the tumor key genes are obtained.
The number of the peak positions obtained by ATAC sequencing is huge, and before analyzing the peak position data, the classification processing is necessary firstly.
In alternative embodiments, the indices that may be used to rank the ATAC peak position data include a combination of two or more of the number of transcription factors, chromatin opening signal intensity, the size of the peaks, or the location of the peaks on the genome (e.g., promoters, enhancers, or CPG islands).
Preferably, the indicators are transcription factor number and chromatin opening signal intensity.
The eukaryotic transcription initiation process is very complex, and usually requires the assistance of multiple protein factors, and the transcription factors and RNA polymerase form a transcription initiation complex which is jointly involved in the transcription initiation process. Transcription factors are a group of protein molecules that can specifically bind to a specific sequence upstream of the 5' end of a gene, thereby ensuring that a target gene is expressed at a specific intensity at a specific time and space. Many transcription factors serve as major regulatory factors and selection genes, controlling the process of cell type determination, developmental patterns, and control of specific pathways (e.g., immune responses). In the laboratory, transcription factors can promote cell differentiation, dedifferentiation, and transdifferentiation. The more open or active the gene sequence containing the greater number of transcription factors, and the mutation of the transcription factors and transcription factor binding sites are the main factors for human pathogenesis (or carcinogenesis). The transcription product of the mutant gene may enhance or lose some functions, and this dysregulation may play an important role in tumor formation and malignant progression, so that the sequence region containing a large number of transcription factors can be used as the screening region of the tumor key gene.
In an alternative embodiment, the method for obtaining the number of transcription factors comprises a method for identifying a binding site of a transcription factor.
Preferably, the recognition method of the transcription factor binding site comprises the steps of processing transcription factor data of a TFBS database to obtain a corresponding transcription factor motif, and recognizing the transcription factor binding site of the ATAC-seq open chromatin region through iForm.
Preferably, the TFBS database comprises one or more combinations of the trans fac, JASPAR or UniPROBE databases, each of which can simultaneously compromise coding regions of genes and non-coding regions of genes, and is suitable for identifying transcription factor binding sites.
Further preferably, the TFBS database is a trans fac database, which is a comprehensive eukaryotic transcription regulation database, and contains information of transcription factors, their binding sites on the genome and corresponding target genes, and its professional version collectively contains 12795 transcription factors, 26589 transcription factor binding sites and 51325 regulatory genes, including miRNA and its target sequence, ChIP experimental sequence fragments, and information of all relevant references for the collected data, promoter sequence, etc., and the contained transcription factor information is comprehensive, and is especially suitable for recognition of transcription factor binding sites.
Further preferably, the TFBS database is a trans fac, JASPAR and UniPROBE database, and when three databases are selected, the coverage range is wider, and more valuable targets can be found.
The chromatin opening signal intensity is a numerical value which can be obtained in an ATAC sequencing detection process and is used for representing the chromatin opening degree of a specific region, the stronger the chromatin opening signal is, the closer the distance between the chromatin opening signal intensity and a gene promoter region is, and more highly expressed genes are enriched in the region. The high mutation rate of the promoter region of the gene in these transcription active peaks will directly affect the transcription regulation function of the gene and may be a potential carcinogen.
Because the formation of tumor is related to the error mutation of normal gene, the invention uses the mutation ratio of gene as the judgment standard for judging whether the gene can be used as the key gene of target tumor, and the probability of causing tumor is higher for the gene with more mutation ratio than the gene with lower mutation ratio. On the basis, whether the index can be used for screening tumor key genes can be judged by judging the influence of the index on the gene mutation ratio. For example, as the number of transcription factors increases, the base mutation ratio of ATAC-peaks located in the gene promoter region is in a descending trend, and as the chromatin opening signal intensity increases, the total mutation ratio in the ATAC-seq data gradually increases, so that the number of the transcription factors and the chromatin opening signal intensity can be used as the indexes for screening the tumor key genes, and the number of the transcription factors and the chromatin opening signal intensity are respectively related to the gene mutation ratio in a negative proportion relation and a positive proportion relation, and the number of the transcription factors and the chromatin opening signal intensity can be used as the screening indexes to be mutually corrected, thereby improving the screening accuracy.
After ATAC peak positions are sorted by different indexes, the sequence positions in different sorts represent the degree of possibility that the peak positions contain tumor key genes under the selected index, primary screening peak position groups suspected to contain the tumor key genes can be screened out from the ATAC peak position sorting of each index by setting a proper threshold value, and the inaccuracy caused by screening key genes by adopting single data can be greatly avoided by comparing the primary screening peak position groups commonly contained in the different primary screening peak position groups screened out.
However, because the amount of the processed ATAC data is large, the threshold difficulty that not only can ensure the screening accuracy but also can reduce the workload is set for each index directly, so that in the optional embodiment of the present invention, the number of groups divided into groups is at least ten.
Because the number of the ATAC peak positions obtained by sequencing is huge, after the peak positions obtained by different indexes are divided into ten groups, the number of the ATAC peak positions in the obtained primary screening peak position group can be adjusted on the basis of ensuring the accuracy of gene screening by selecting a proper peak position group and carrying out the group number of subsequent detection, thereby ensuring the moderate workload. For example, when the screening is performed to obtain a few genes (e.g., three or five) most critical for endometrial cancer, then only one of the groups with the highest likelihood of containing the critical genes may be selected, so that the number of peak positions of the ATAC obtained is minimized while ensuring accuracy; when the screening is aimed at more comprehensively understanding key genes related to endometrial cancer, a plurality of groups can be selected, the number of the obtained ATACs is enlarged, so that the screening process is more comprehensive, the condition that the key genes of tumors formed by rare mutation are missed is avoided, therefore, different numbers of key genes can be obtained by selecting different thresholds, and the importance degree of the key genes obtained by the different thresholds on the endometrial cancer can be clearly judged.
In alternative embodiments, the gene expression value for endometrial cancer comprises the amount of gene expression for endometrial cancer and/or the amount of differential gene expression for a tumor of interest.
In an alternative embodiment of the present invention, the data of the gene expression level of endometrial Cancer is obtained from a TCGA database, and the data of the gene expression level of endometrial Cancer is obtained by searching the TCGA database (tumor Genome map) which is a joint-initiated item in 2006 by american nci (national Cancer institute) and nhgri (national Human Genome Research institute), and 36 Cancer types are currently studied in total, and clinical data, genomic variation, mRNA expression, miRNA expression, methylation and the like of various Human cancers (including subtypes of tumors) are included, which are important data sources for Cancer researchers. The method is the quickest and comprehensive way for obtaining the expression quantity of the tumor genes at present, and can effectively reduce the false positive rate in screening.
Preferably, when searching through the TCGA database, genes with FPKM >10 are selected.
The FPKM is the number of fragments from map to exon per 1K base in fragments per 1 million maps. It is considered that it is meaningful to say that FPKM is larger than 1, and FPKM larger than 10 means that the expression level of the gene in the tumor is high, and therefore, the gene can be suitably used in the present invention to avoid false positive rate due to detection error and the like.
The gene differential expression of endometrial cancer refers toThe variation of the expression level of the target gene between normal tissue and tumor tissue is usually defined as log2FC (fold change) threshold reflects the fold change in gene expression. In an alternative embodiment of the invention, log is selected2The gene of FC ≥ 2 is used as the key gene of endometrial cancer.
Preferably, the differential expression level of the endometrial cancer gene is obtained by searching a GEPIA database, and log is set during the searching of the GEPIA database2FCcutoff≥2;
Further preferably, log is set during retrieval of the GEPIA database2FC cutoff=3。
log2FCcutoff-3 is the multiple of the expression level of the same gene in normal tissue and endometrial cancer tissue to be more than 3 times.
In an alternative embodiment, the screening method further comprises a validation step to obtain tumor critical genes.
In alternative embodiments, the validation step comprises a survival analysis and/or a literature comparison.
The survival analysis is a method for researching the relation between the influence factors and the survival time and the outcome, whether the expression level of the screened key genes plays a key role in the generation and development process of the endometrial cancer can be verified through the survival analysis of the screened key genes of the tumor, and the verification method is high in accuracy.
The literature comparison means that extensive literature retrieval is carried out aiming at the screened key gene of the endometrial cancer, whether the gene is the key gene of the endometrial cancer in the existing report is checked, and if the gene is the key gene of the endometrial cancer, a large amount of repeated verification work can be saved.
The invention also provides the application of the screening method in preparing and screening tumor key gene products.
The invention has the following beneficial effects:
the invention provides a screening method of tumor key genes based on ATAC sequencing data, which adopts various indexes to carry out comprehensive evaluation and analysis on peak positions obtained by ATAC sequencing to obtain key genes of endometrial cancer.
The screening method of the tumor key genes provided by the invention can be applied to preparing products for screening the tumor key genes, and provides a new way for improving the efficiency and accuracy of gene screening work.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 shows the screening procedure for key genes in endometrial cancer in example 1;
FIG. 2 shows 210 primary screening genes obtained in example 1;
FIG. 3 shows 207 primary screening genes obtained in example 1;
FIG. 4 shows 148 genes obtained by screening the expression level of the genes in example 1;
FIG. 5 shows 16 genes obtained by screening genes differentially expressed in example 1;
FIG. 6 shows the survival analysis results of SCGB2A1 gene and SCGB1D2 gene in example 1;
FIG. 7 shows 268 prescreening genes obtained in example 2;
FIG. 8 shows 80 genes obtained by screening the gene expression levels in example 2;
FIG. 9 shows 5 genes selected in example 2 for differentially expressed genes;
FIG. 10 is part A of the 670 primary screening genes of example 3;
FIG. 11 is part B of the 670 primary screening genes of example 3;
FIG. 12 is part C of the 670 primary screening genes of example 3;
FIG. 13 shows part A of 220 genes obtained by screening the gene expression level in example 3;
FIG. 14 shows part B of 220 genes obtained by screening the gene expression level in example 3;
FIG. 15 shows 20 key genes obtained by the gene differential expression screening of example 3.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below. The examples, in which specific conditions are not specified, were conducted under conventional conditions or conditions recommended by the manufacturer. The reagents or instruments used are not indicated by the manufacturer, and are all conventional products available commercially.
The features and properties of the present invention are described in further detail below with reference to examples.
Example 1
The screening steps for the key genes of endometrial cancer in this example are shown in FIG. 1 and comprise:
s1: ATAC sequencing data acquisition
This example obtained endometrial cancer ATAC sequencing data from the TCGA database, and removed the erroneous sequencing results aligned to the Y chromosome, resulting in a total of 104723 peaks for further analysis in this example.
The genomic coordinates of the above peaks are hg38, and for the next comparison with TFBS data, the genome was converted from hg38 to hg19, yielding a total of 104400 peaks.
S2: transcription factor number and chromatin opening signal intensity ordering
The ATAC-seq peaks obtained in S1 are scanned and compared with the binding site information of 488 transcription factors obtained from three databases of TRANSFAC, JASPAR and UniPROBE to obtain the number of the transcription factors at each peak position, and the number of the transcription factors at each peak position is defined as the TC value of the complexity of the transcription factors.
Defining the chromatin opening signal intensity Score value of each peak position as an SC value; and averagely dividing TC and SC into 10 groups according to the numerical values, namely TC 0-TC 9 and SC 0-SC 9, and obtaining two indexes representing different dimensional information of the interval.
S3: prescreening gene acquisition
Combining the characteristics of two different dimensional indexes in the ATAC-seq data obtained from the S2, obtaining 588 ATAC-seq peaks belonging to TC0 and SC9 at the same time by screening, and obtaining 417 prescreening genes located in the peaks, as shown in FIG. 2 and FIG. 3, 210 prescreening genes are shown in FIG. 2, and another 207 prescreening genes are shown in FIG. 3.
S4: screening of Gene expression level
The primary screening genes are screened according to FPKM >10 by using the expression quantity data of the tumor genes in the TCGA database to obtain 148 genes, as shown in 4.
S5: gene differential expression screening
For the 148 genes obtained in the step S4, the difference expression gene data in the GEPIA database is used to select the log2FC ═ 3 was screened again to obtain 16 key genes of endometrial cancer, as shown in fig. 5, CDCA8, DUSP1, HSPB6, IGSF9, KRT7, PAX8, PRSS22, PRSS8, S100a14, SCGB1D2, SCGB2a1, SCNN1A, SYNE4, TPX2, TXNIP, UBE2C, respectively.
S6: key gene verification
Through the found significant correlation of genes SCGB2A1(10.1111/j.1525-1438.2007.01137.x), TPX2(10.3892/or.2020.7648), UBE2C (10.1158/1541-7786.MCR-19-0561), DUSP1(10.4103/0366-6999.181954), IGSF9(10.1155/2018/2439527), PAX8(10.1089/dna.2019.5148), S100A14(10.1016/j.intimp.2020.106735), TXINIP (10.21873/anticanres.13664) and the proliferation and invasion of endometrial cancer, the muscle layer infiltration and differentiation degree, the FIGO staging progress, the Estrogen Receptor (ER) expression, poor prognosis and the like, the screening method provided by the embodiment is proved to be capable of accurately obtaining key genes of the endometrial cancer.
The high expression of the gene SCNN1A (10.1089/cbr.2019.2824) is related to the poor overall survival and the progression-free survival of the ovarian cancer patients; the high expression of the gene CDCA8(10.7717/peerj.9078) is reported to be related to poor prognosis of patients in bladder cancer, and can promote the development of tumors; the gene HSPB6(10.1371/journal. bone.0151907) has the capacity of inducing migration and invasion of hepatoma carcinoma cells; the gene KRT7(10.1016/j.gene.2020.144947) is highly expressed in ovarian cancer and may be associated with reduced survival rate; PRSS22 (10.1186/1471-. SCGB1D2 (10.1186/1471-; the gene SYNE4 is reported to be related to hearing impairment (10.4274/balkanmedj.2017.0946), and for the 7 genes, although there is no clear evidence that the genes are directly related to endometrial cancer, the genes are all pathogenic genes, and most of the genes are related to tumors, especially gynecological tumors, so that the gene is possible to be used as a key gene of endometrial cancer, and a new development target is provided for subsequent research. For PRSS8(10.1038/s 41388-018-0453-3; 10.1159/000453136) gene, the report in the literature only indicates that the gene is possibly a novel tumor suppressor gene, and the report that PRSS8 gene reduction is related to malignant progression and EMT in various tumors does not indicate that the gene is related to endometrial cancer, and the tumor key gene screening method provided by the invention provides a new way for discovering the tumor key gene and achieves practical effects.
In order to further verify the effectiveness of the screening method provided in this embodiment, survival analysis is performed on the SCGB2a1 gene among the above 8 genes that are clearly reported as being key genes of endometrial cancer together with the SCGB1D2 among the above 8 genes that may be endometrial cancer, and it is verified that P ═ 0.044 (as shown in a in fig. 6) of the SCGB2a1 gene and P ═ 0.011 (as shown in B in fig. 6) of the SCGB1D2 gene are both less than 0.05 by using a survival analysis method through the GEPIA database, which proves that both the SCGB2a1 gene and the SCGB1D2 gene are significantly correlated with prognosis of patients with endometrial cancer.
Example 2
The present embodiment provides a screening method similar to that in embodiment 1, except that, in step S3, TC0 and SC0 are selected for screening key genes, which includes the following steps:
s3: prescreening gene acquisition
Referring to the peak position grouping results in example 1, 1240 ATAC-seq peaks belonging to TC0 and SC0 at the same time were obtained by screening, and 268 primary screening genes located in these peaks were obtained, as shown in FIG. 7.
S4: screening of Gene expression level
Screening the primary screening genes according to FPKM >10 by using the expression quantity data of the tumor genes in the TCGA database to obtain 80 genes, as shown in figure 8.
S5: gene differential expression screening
For the 80 genes obtained in the step S4, the difference expression gene data in the GEPIA database is used to select log2FC cutoff ═ 3 was screened again to obtain 5 key genes for endometrial cancer, as shown in fig. 9, ACTA2, CHMP4C, NID1, PRR15L, and SAPCD2, respectively.
S6: key gene verification
Only one literature report that the gene NID1 has a clear correlation with endometrial cancer (10.1007/s10585-015-9720-7) is obtained through the published literature reference; genes ACTA2 (10.1002/cbin.11451; 10.1186/s 12935-020-01471-w; 10.3390/ijms 2112409) and SAPCD2(10.26355/eurrev _202004_ 20844; 10.1186/s 12935-020-1121-6; 10.1002/cam4.2227) have been reported to be associated with various tumors, such as lung cancer, cervical cancer, breast cancer, gastric cancer, etc.; the gene PRR15L (10.1007/s 00428-019-and 02604-x) was only reported to have a close association with sigmoid colon cancer; the gene CHMP2C is not reported.
Survival analysis and verification are carried out through a GEPIA database, and each gene has no significant correlation with the survival time of the endometrial cancer patient.
Example 3
The present embodiment provides a screening method similar to that in embodiment 1, except that, in step S3, TC0, SC8, and SC9 are selected to screen key genes, and the specific steps are as follows:
s3: prescreening gene acquisition
Referring to the peak position grouping result in example 1, 1447 ATAC-seq peaks belonging to TC0, SC8 and SC9 at the same time are obtained by screening, 670 primary screening genes located in the peaks are obtained, and as shown in FIGS. 10 to 12, the primary screening genes are respectively a part of the obtained 670 genes (A, B and C).
S4: screening of Gene expression level
The primary screening genes were screened according to FPKM >10 using the expression data of tumor genes in TCGA database to obtain 220 genes, which are part A and part B of the 220 genes as shown in FIGS. 13 and 14.
S5: gene differential expression screening
For the 220 genes obtained in the step S4, the difference expression gene data in the GEPIA database is used to select log2The FC cutoff-3 was screened again to obtain 20 key genes of endometrial cancer, as shown in fig. 15, it can be seen that in addition to the 16 key genes obtained in example 1, 4 key genes are obtained in this example, which are: CCNB2, MUC1, S100a1 and THBS.
S6: key gene verification
For the MUC1 gene, the loss of MUC1 expression in endometrial cancer is well correlated with prognosis (10.1046/j.1365-2559.2002.01316. x); the S100A1 gene is reported in the literature (10.1309/AJCPTK87EMMIKPFS) to be a marker of poor prognosis of endometrioid cancer subtypes, and for the CCNB2 gene and the THBS gene, relevant reports are not found to be related to endometrial cancer, but research predicts that the CCNB2 gene may be a key gene of endometrial cancer, so that research on the CCNB2 gene and the THBS gene can be further carried out.
The screening results of the key genes of endometrial cancer provided in examples 1-3 are shown in Table 1.
TABLE 1 comparative table of screening results obtained in examples 1 to 3 (unit: unit)
Example 1 Example 2 Example 3
ATAC-seq peaks 588 1240 1447
Prescreening gene 417 268 670
Gene expression level screening gene 148 80 220
Gene differential expression screening of genes 16 5 20
Therefore, different groups are selected to obtain different numbers of key genes of endometrial cancer, so that the screening method for the key genes of endometrial cancer provided by the invention can be suitable for various requirements in research of related fields.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The method for screening key genes of endometrial cancer based on ATAC sequencing data is characterized by comprising the steps of firstly selecting more than two indexes to sort peak positions obtained by ATAC sequencing; then, dividing the peak position sorting results of different indexes into groups according to the number of the peak positions, and selecting at least one group as a primary screening peak position group suspected to contain tumor key genes; then screening out the peak positions which are commonly contained in the primary screening peak position groups obtained by different indexes to form secondary screening peak position groups; finally, according to the gene expression value of endometrial cancer, the genes corresponding to all peak positions in the obtained secondary screening peak position groups are finally screened, and the tumor key genes are obtained.
2. The method of claim 1, wherein the indices comprise a combination of two or more of transcription factor number, chromatin opening signal intensity, size of peaks, or position of peaks on the genome;
preferably, the indicators are transcription factor number and chromatin opening signal intensity.
3. The method according to claim 2, wherein the method for obtaining the number of transcription factors comprises a method for identifying a binding site of a transcription factor;
preferably, the recognition method of the transcription factor binding site comprises the steps of processing transcription factor data of a TFBS database to obtain a corresponding transcription factor motif, and recognizing the transcription factor binding site of an ATAC-seq open chromatin region through an iForm;
preferably, the TFBS database comprises one or a combination of two or more of the trans fac, JASPAR or UniPROBE databases;
further preferably, the TFBS database is a trans fac database;
further preferably, the TFBS database is a trans fac, JASPAR and UniPROBE database.
4. The method of claim 1 wherein the number of groups divided into groups is at least ten groups.
5. The method of claim 1, wherein the gene expression value for endometrial cancer comprises a gene expression level for endometrial cancer and/or a differential gene expression level for endometrial cancer.
6. The method of claim 5, wherein the gene expression level data for endometrial cancer is from the TCGA database;
preferably, when searching through the TCGA database, genes with FPKM >10 are selected.
7. The method of claim 5, wherein the differential expression of genes in endometrial cancer comprises log2FC, selection log2The gene of FC more than or equal to 2 is used as the key gene of endometrial cancer;
preferably, the differential expression level of the endometrial cancer gene is obtained by searching a GEPIA database, and log is set in the process of searching the GEPIA database2FCcutoff≥2;
Further preferably, log is set during retrieval of the GEPIA database2FC cutoff=3。
8. The method of claim 1, further comprising the step of validating the key gene for endometrial cancer.
9. The method of claim 8, wherein the step of validating comprises survival analysis and/or literature comparison.
10. Use of the method of any one of claims 1 to 9 in the preparation of a product for screening key genes of endometrial cancer.
CN202011452042.0A 2020-12-10 2020-12-10 Method for screening key gene of endometrial cancer based on ATAC sequencing data and application Pending CN112562785A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011452042.0A CN112562785A (en) 2020-12-10 2020-12-10 Method for screening key gene of endometrial cancer based on ATAC sequencing data and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011452042.0A CN112562785A (en) 2020-12-10 2020-12-10 Method for screening key gene of endometrial cancer based on ATAC sequencing data and application

Publications (1)

Publication Number Publication Date
CN112562785A true CN112562785A (en) 2021-03-26

Family

ID=75061259

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011452042.0A Pending CN112562785A (en) 2020-12-10 2020-12-10 Method for screening key gene of endometrial cancer based on ATAC sequencing data and application

Country Status (1)

Country Link
CN (1) CN112562785A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113284559A (en) * 2021-07-21 2021-08-20 暨南大学 Method, system and equipment for querying promoter of species genome

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011137302A1 (en) * 2010-04-29 2011-11-03 The General Hospital Corporation Methods for identifying aberrantly regulated intracellular signaling pathways in cancer cells
WO2014004724A1 (en) * 2012-06-26 2014-01-03 Board Of Regents, The University Of Texas System Efficient functional genomics platform
CN104059966A (en) * 2014-05-20 2014-09-24 吴松 STAG2 gene mutant sequence and detection method thereof as well as use of STAG2 gene mutation in detecting bladder cancer
WO2017220782A1 (en) * 2016-06-24 2017-12-28 Molecular Health Gmbh Screening method for endometrial cancer
US20180126354A1 (en) * 2016-11-04 2018-05-10 Washington University Automated exposition of known and novel multiple myeloma genomic variants using a single sequencing platform
CN109616198A (en) * 2018-12-28 2019-04-12 陈洪亮 It is only used for the choosing method of the special DNA methylation assay Sites Combination of the single cancer kind screening of liver cancer
CN109837335A (en) * 2019-03-20 2019-06-04 福建省农业科学院食用菌研究所(福建省蘑菇菌种研究推广站) A method of joint ATAC-seq and RNA-seq screens edible and medical fungi functional gene
CN110272985A (en) * 2019-06-26 2019-09-24 广州市雄基生物信息技术有限公司 Tumor screening kit and its System and method for based on peripheral blood plasma DNA high throughput sequencing technologies
WO2020031206A1 (en) * 2018-08-08 2020-02-13 Indian Institute Of Science Education And Research Combined expression pattern of satb family chromatin organizers as improved biomarker tool for cancer prognosis
CN111128299A (en) * 2019-12-16 2020-05-08 南京邮电大学 Construction method of ceRNA regulation and control network with significant correlation to colorectal cancer prognosis
US20200211674A1 (en) * 2018-12-31 2020-07-02 Nvidia Corporation Denoising ATAC-Seq Data With Deep Learning

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011137302A1 (en) * 2010-04-29 2011-11-03 The General Hospital Corporation Methods for identifying aberrantly regulated intracellular signaling pathways in cancer cells
WO2014004724A1 (en) * 2012-06-26 2014-01-03 Board Of Regents, The University Of Texas System Efficient functional genomics platform
CN104059966A (en) * 2014-05-20 2014-09-24 吴松 STAG2 gene mutant sequence and detection method thereof as well as use of STAG2 gene mutation in detecting bladder cancer
WO2017220782A1 (en) * 2016-06-24 2017-12-28 Molecular Health Gmbh Screening method for endometrial cancer
US20180126354A1 (en) * 2016-11-04 2018-05-10 Washington University Automated exposition of known and novel multiple myeloma genomic variants using a single sequencing platform
WO2020031206A1 (en) * 2018-08-08 2020-02-13 Indian Institute Of Science Education And Research Combined expression pattern of satb family chromatin organizers as improved biomarker tool for cancer prognosis
CN109616198A (en) * 2018-12-28 2019-04-12 陈洪亮 It is only used for the choosing method of the special DNA methylation assay Sites Combination of the single cancer kind screening of liver cancer
US20200211674A1 (en) * 2018-12-31 2020-07-02 Nvidia Corporation Denoising ATAC-Seq Data With Deep Learning
CN109837335A (en) * 2019-03-20 2019-06-04 福建省农业科学院食用菌研究所(福建省蘑菇菌种研究推广站) A method of joint ATAC-seq and RNA-seq screens edible and medical fungi functional gene
CN110272985A (en) * 2019-06-26 2019-09-24 广州市雄基生物信息技术有限公司 Tumor screening kit and its System and method for based on peripheral blood plasma DNA high throughput sequencing technologies
CN111128299A (en) * 2019-12-16 2020-05-08 南京邮电大学 Construction method of ceRNA regulation and control network with significant correlation to colorectal cancer prognosis

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
SUN, YY,等: "Detect accessible chromatin using ATAC-sequencing, from principle to applications", HEREDITAS, vol. 156, no. 01, pages 29 *
TANG, XH,等: "Regulatory patterns analysis of transcription factor binding site clustered regions and identification of key genes in endometrial cancer", COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, vol. 20, pages 812 - 823 *
伊梦杰,等: "联合应用RNA-seq和ATAC-seq寻找FOXQ1转录因子的下游靶基因", 生物信息学, vol. 17, no. 04, pages 227 - 236 *
杨晓,等: "西北地区751例新生儿耳聋基因突变筛查", 发育医学电子杂志, vol. 08, no. 02, pages 140 - 144 *
董云巧,等: "高通量筛选肿瘤转移相关基因的研究", 热带医学杂志, no. 08, pages 943 - 944 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113284559A (en) * 2021-07-21 2021-08-20 暨南大学 Method, system and equipment for querying promoter of species genome

Similar Documents

Publication Publication Date Title
CN106980763B (en) Screening method of cancer driver gene based on gene mutation frequency
US20070020670A1 (en) Methods for detecting and confirming minimal disease
CN110423816B (en) Breast cancer prognosis quantitative evaluation system and application
AU2018305609B2 (en) Enhancement of cancer screening using cell-free viral nucleic acids
CN112397151B (en) Methylation marker screening and evaluating method and device based on target capture sequencing
CN112941180A (en) Group of lung cancer DNA methylation molecular markers and application thereof in preparation of lung cancer early diagnosis kit
CN113215254B (en) Immune-clinical characteristic combined prediction model for evaluating lung adenocarcinoma prognosis
WO2020034543A1 (en) Marker for breast cancer diagnosis and screening method therefor
CN114891887A (en) Method for screening triple negative breast cancer prognosis gene marker
CN113025716A (en) Gene combination for human tumor classification and application thereof
CN112779334A (en) Methylation marker combination for early screening of prostate cancer and screening method
CN114203256B (en) MIBC typing and prognosis prediction model construction method based on microbial abundance
Wei et al. Integration of scRNA-Seq and TCGA RNA-Seq to analyze the heterogeneity of HPV+ and HPV-cervical cancer immune cells and establish molecular risk models
CN112562785A (en) Method for screening key gene of endometrial cancer based on ATAC sequencing data and application
CN107273717A (en) A kind of detection model of Sera of Lung Cancer gene and its construction method and application
CN113151460B (en) Gene marker for identifying lung adenocarcinoma tumor cells and application thereof
US20140058682A1 (en) Nucleic Acid Information Processing Device and Processing Method Thereof
CN113544288A (en) DNA methylation marker for predicting liver cancer recurrence and application thereof
CN115807089A (en) Hepatocellular carcinoma prognosis biomarker and application thereof
CN107119144B (en) Application of DNA binding site CTCF-55 of multifunctional transcription regulatory factor CTCF
CN111621565B (en) Diffuse large B cell lymphoma molecular typing kit and typing device
US20140019062A1 (en) Nucleic Acid Information Processing Device and Processing Method Thereof
US20140297194A1 (en) Gene signatures for detection of potential human diseases
CN116434830B (en) Tumor focus position identification method based on ctDNA multi-site methylation
CN116403648B (en) Small cell lung cancer immune novel typing method established based on multidimensional analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Lu Meisong

Inventor after: Tang Xiaohan

Inventor after: Chen Hebing

Inventor after: Wang Junting

Inventor after: Li Hao

Inventor after: Tao Huan

Inventor after: Bo Xiaochen

Inventor before: Lu Meisong

Inventor before: Chen Hebing

Inventor before: Tang Xiaohan

Inventor before: Wang Junting

Inventor before: Li Hao

Inventor before: Tao Huan

Inventor before: Bo Xiaochen

CB03 Change of inventor or designer information