CN115083517A - Data processing method and system for identifying enhancer and super enhancer - Google Patents

Data processing method and system for identifying enhancer and super enhancer Download PDF

Info

Publication number
CN115083517A
CN115083517A CN202210802841.9A CN202210802841A CN115083517A CN 115083517 A CN115083517 A CN 115083517A CN 202210802841 A CN202210802841 A CN 202210802841A CN 115083517 A CN115083517 A CN 115083517A
Authority
CN
China
Prior art keywords
sample data
enrichment
data
region
enhancer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210802841.9A
Other languages
Chinese (zh)
Other versions
CN115083517B (en
Inventor
李春权
王秋毓
宋超
钱凤翠
尚德思
刘佳琦
杨用三
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
First Affiliated Hospital of University of South China
Original Assignee
First Affiliated Hospital of University of South China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by First Affiliated Hospital of University of South China filed Critical First Affiliated Hospital of University of South China
Priority to CN202210802841.9A priority Critical patent/CN115083517B/en
Publication of CN115083517A publication Critical patent/CN115083517A/en
Application granted granted Critical
Publication of CN115083517B publication Critical patent/CN115083517B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a data processing method, a system, equipment and a computer readable storage medium for identifying an enhancer and a super enhancer, wherein the method comprises the following steps: acquiring sample data; preprocessing the sample data to obtain an enrichment area of the sample data, and position information and signal intensity of a chromatin area corresponding to the enrichment area of the sample data; respectively extracting the characteristics of the position information and the signal intensity to obtain position information characteristic data and signal intensity characteristic data after characteristic extraction; stitching the enrichment regions of the adjacent sample data together according to the position information characteristic data of the chromatin region to obtain the stitched enrichment regions of the sample data; sorting the gathered sample data enrichment areas and the unstitched sample data enrichment areas based on the signal intensity characteristic data of the chromatin areas to obtain a sorting result; and outputting the sequencing result.

Description

Data processing method and system for identifying enhancer and super enhancer
Technical Field
The invention relates to the technical field of bioinformatics and genomics, in particular to a data processing method and a data processing system for an identification enhancer and a super enhancer.
Background
In the past few years, there has been great interest in understanding the regulatory role of genomic regions and research into regulatory DNA has made new and important advances. With the rapid development of genome high-throughput sequencing technology, functional genome region data is rapidly increasing. How to understand these unknown genomic regions is currently an urgent task. In human biological processes, functional genomic regions play a role in the development of disease, primarily by affecting gene expression levels. To exploit the function of genomic regions in transcriptional regulation, it is necessary to integrate existing, extensive genetic and epigenetic information.
Research shows that the functional enrichment analysis of gene sets is a very successful bioinformatics analysis method. It compares the newly found gene set with the gene set with definite functions one by one to obtain the functions of the new genes. This approach has a key limitation: the data must be gene-centered. By taking the enrichment analysis method of the gene set as a reference, researchers propose the enrichment analysis of the genome region, and similarly, the existing genome region data and the newly found region data are combined to find the overlapped part of the regions in the two sets, and then the enrichment significance score is calculated by utilizing a statistical method. Through the enrichment analysis of the genome region, researchers can better explore the biological functions of the genome region. A great deal of human genetics and epigenetics research rapidly accumulates different data sets such as ChIP-seq and ATAC-seq, effectively arranges and utilizes the data sets, and is very important for researching genome regions. Meanwhile, many published studies have shown that functional genomic regions such as enhancers and super enhancers play a role in human biological processes that are difficult to replace.
The identification of enhancers and their strength is one of the hotspots of biological research, attracting a large number of researchers. Researchers have not previously selected but have been able to solve this problem experimentally, such as chromatin immunoprecipitation, deep sequencing, DNase I hypersensitivity and histone modification genome-wide mapping, among others. However, these experimental methods are expensive, time consuming and inefficient. Therefore, some calculation methods are needed to identify the enhancer and its strength. In fact, some research has been done to do this. For example, in 2016, Liu et al established a two-layered predictor that can recognize not only enhancers but also their strength; jia et al established an identifier to discover enhancers by combining and selecting various features; two years later, Liu et al proposed a model to identify enhancers and their strengths based on ensemble learning methods; in 2019, Nguyen et al proposed to use the integration of convolutional neural networks to identify enhancers and their strengths. However, the overall recognition accuracy is not very high, and the prior art also lacks a method for directly and intelligently recognizing a common enhancer and a super enhancer, so that a new calculation method is required to be invented for recognizing the enhancer and the strength thereof, and the common enhancer and the super enhancer can be effectively recognized; meanwhile, the identified result can be directly used as one of data types in the process of enrichment analysis of the genome region to carry out enrichment analysis of the genome region, so that researchers can better explore the biological function of the genome region.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a data processing method for identifying an enhancer and a super enhancer, which can quickly and accurately identify the enhancer and the strength thereof and can effectively identify a common enhancer region and a super enhancer region; meanwhile, the processing result obtained by the method can be applied to the process of genome region annotation enrichment analysis, so that the accuracy, portability and systematicness of the result of the enrichment analysis method are ensured, the life rule hidden behind biological data is deeply mined, and the related life science problem is solved.
The application discloses a data processing method for identifying an enhancer and a super enhancer, which comprises the following steps:
acquiring sample data;
preprocessing the sample data to obtain an enrichment area of the sample data, and position information and signal intensity of a chromatin area corresponding to the enrichment area of the sample data;
respectively extracting the characteristics of the position information and the signal intensity to obtain position information characteristic data and signal intensity characteristic data after characteristic extraction;
stitching the enrichment regions of the adjacent sample data together according to the position information characteristic data of the chromatin region to obtain the stitched enrichment regions of the sample data;
sorting the gathered sample data enrichment areas and the unstitched sample data enrichment areas based on the signal intensity characteristic data of the chromatin areas to obtain a sorting result;
and outputting the sequencing result.
The stitched enriched region of the sample data is obtained by the following method:
removing the enrichment region of the sample data within a first threshold from the gene transcription start site to obtain the enrichment region of the stitched sample data of the non-promoter region;
stitching together enriched regions having an interval smaller than a second threshold value based on the stitched enriched regions of the sample data of the non-promoter region to obtain the stitched enriched regions of the sample data; the enrichment region of the stitched sample data is a stitching enhancer;
optionally, the position information feature data includes: a starting position and an ending position of the enrichment region on the genome; the signal strength characteristic data comprises: height of signal peak of the enrichment region.
The pretreatment process comprises sequence comparison and enrichment area identification;
the sequence alignment process comprises the following steps: acquiring a fastq file containing read; comparing the fastq file to a reference genome by adopting an algorithm to obtain the position information of the read in the reference genome; the position information of the read in the reference genome is stored in a sam file;
the process of identifying an enrichment region comprises: obtaining the sam file, and analyzing the sam file by using an algorithm to obtain an enrichment area of the sample data; the enrichment area of the sample data is stored in the bed file;
optionally, the preprocessing process further includes: format conversion;
the format conversion process includes: acquiring the sam file, and converting the sam-format file into the bam file by using an algorithm;
acquiring the bam file, and analyzing the bam file by using an algorithm to obtain an enrichment area of the sample data; the enrichment area of the sample data is stored in the bed file.
The process of sorting the stitched enriched regions of the sample data and the unstitched enriched regions of the sample data based on the signal intensity characteristic data of the chromatin regions to obtain a sorting result includes:
normalizing the enrichment region signal of the sample data in the genome region to obtain the background normalized density level of the enrichment region signal of the sample data; and sorting the gathered enrichment regions of the sample data and the unstitched enrichment regions of the sample data according to the background normalized density level of the enrichment region signal of the sample data to obtain a sorting result.
The data processing method further comprises: performing image composition according to the sequencing result to obtain a distribution curve based on the signal intensity characteristic data of the chromatin region;
calculating the curve slope of the distribution curve and outputting the curve slope;
comparing the curve slope with a reference slope threshold value, and outputting an enrichment area with the curve slope being greater than/equal to/less than the reference slope threshold value;
optionally, the enrichment region with the curve slope greater than or equal to the reference slope threshold is determined as a super enhancer, and the enrichment region with the curve slope less than the reference slope threshold is determined as a normal enhancer.
The sample data comprises the sample data after screening processing; the sample data after screening processing comprises a plurality of groups of sample data, and each group of sample data comprises H3K27ac data and corresponding input data;
the method or the steps for acquiring the sample data after the screening processing comprise:
obtaining original sample data; the original sample data comprises: cell tissue type, treatment condition, sample number of the sample;
screening the original sample data to obtain the screened sample data; the sample data after screening processing is unique and non-redundant;
optionally, the screening process includes: checking the sample data according to the unique sample number; the verification adopts manual verification.
A data processing system for identifying an enhancer and a super enhancer, comprising:
an acquisition unit configured to acquire sample data;
the first processing unit is used for preprocessing the sample data to obtain an enrichment area of the sample data, and position information and signal intensity of a chromatin area corresponding to the enrichment area of the sample data; the system is used for respectively extracting the characteristics of the position information and the signal intensity to obtain position information characteristic data and signal intensity characteristic data after characteristic extraction;
the second processing unit is used for stitching the enrichment regions adjacent to the sample data together according to the position information characteristic data of the chromatin region to obtain the stitched enrichment regions of the sample data;
the third processing unit is used for sorting the gathered sample data enrichment areas and the unstitched sample data enrichment areas based on the signal intensity characteristic data of the chromatin areas to obtain a sorting result;
and the output unit is used for outputting the sequencing result.
The application has the following beneficial effects:
1. the application innovatively discloses a data processing method for identifying an enhancer and a super enhancer, which starts from original data, obtains an enrichment region, identifies the enhancer and the strength of the enhancer through steps such as preprocessing and the like; and the method of sequencing and composition for the enrichment region is adopted to effectively identify the common enhancer region and the super enhancer region, thereby greatly improving the precision and the depth of data analysis.
2. The processing result of the data processing method which innovatively utilizes the recognition enhancer and the super enhancer is applied to the process of annotation enrichment analysis of the genome region and is used as one data set of the annotation enrichment analysis of the genome region, so that the accuracy, the portability and the systematicness of the result of the enrichment analysis method are ensured, and sample data is fully utilized; and the life law hidden behind biological data can be deeply mined and applied to the genome enrichment analysis research of various diseases, so that the method has great application value.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart illustrating an analysis of a data processing method for identifying an enhancer and a super enhancer according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a data processing apparatus for identifying an enhancer and a super enhancer according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of a data processing system for identifying an enhancer and super enhancer provided by an embodiment of the present invention;
FIG. 4 is a schematic flow chart of the analysis of the enrichment analysis method based on the annotation of genome region provided by the embodiment of the invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.
In some of the flows described in the present specification and claims and in the above figures, a number of operations are included that occur in a particular order, but it should be clearly understood that these operations may be performed out of order or in parallel as they occur herein, with the order of the operations being indicated as 101, 102, etc. merely to distinguish between the various operations, and the order of the operations by themselves does not represent any order of performance. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a data processing method for identifying an enhancer and a super enhancer according to an embodiment of the present invention, specifically, the method includes the following steps:
101: acquiring sample data;
in one embodiment, the sample data comprises sample data after the filtering process; the sample data after screening processing comprises a plurality of groups of sample data, and each group of sample data comprises H3K27ac data and corresponding input data; the H3K27ac data generally function to promote gene expression and are considered markers of gene activation. Input in chip experiments is Input total DNA, not processed by antibody enrichment, as background reference in analysis. In the experimental process, before antibody enrichment, input needs to be separated out, and then the input and the enriched IP are subjected to uncrosslinking and purification to build a library. input can also verify the effect of IP breaks and the background of the final analysis.
In one embodiment, the method or step for acquiring the sample data after the screening process includes:
obtaining original sample data; the original sample data is 542 publicly available human samples processed and collected from H3K27ac ChIP-seq data of NCBI GEO/SRA, ENCODE, Roadmap and GGR; in the public database, we input the keywords "H3K 27 ac" and "ChIP-seq" to search for samples, and obtain information of more than 2,000 original samples, including cell tissue type, treatment condition and sample number of the samples, etc.;
screening the original sample data to obtain screened sample data; the sample data after screening processing is unique and non-redundant; the screening process comprises: checking the sample data according to the unique sample number; the verification adopts manual verification. Finally, 542 publicly available human samples were collected from these four public data sources.
Bioinformatics effectively combines biology with mathematics and computers, mainly clarifies biological significance contained in a large amount of biological data by comprehensively using methods and tools in multiple fields of mathematics, information science and the like to acquire, process, store, analyze and explain biological information, and research focuses mainly on two aspects of genomics and proteomics.
102: preprocessing the sample data to obtain an enrichment area of the sample data, and position information and signal intensity of a chromatin area corresponding to the enrichment area of the sample data; respectively extracting the characteristics of the position information and the signal intensity to obtain position information characteristic data and signal intensity characteristic data after characteristic extraction;
in one embodiment, the process of pre-processing comprises sequence alignment, identifying an enriched region;
the sequence alignment process comprises the following steps: acquiring a fastq file containing read; comparing the fastq file to a reference genome by adopting an algorithm to obtain the position information of the read in the reference genome; the position information of the read in the reference genome is stored in a sam file; the reference genome is hg19 reference genome downloaded for UCSC; the software used for the sequence alignment process includes, but is not limited to, the following: bowtie, the specific code is as follows:
where-n 2 represents the maximum number of mismatch bases allowed in the high fidelity region to be 2; e 70 indicates that the value of Phred Quality at the mismatch site cannot exceed 70;
single-ended: bowtie-e 70-k 2-n 2-m 2-S-q genome H3K27ac. fastq H3K27ac. sam
Single-ended: bowtie-e 70-k 2-n 2-m 2-S-q genome input. fastq input. sam
Double ends: bowtie-e 70-K2-n 2-m 2-S-q genome-1H3K27 ac-1. fastq-2
H3K27ac_2.fastq H3K27ac.sam
Double ends: bowtie-e 70-k 2-n 2-m 2-S-q genome-1input _1, fastq-2input _2, fastq input, sam
The process of identifying an enrichment region comprises: obtaining the sam file, and analyzing the sam file by using an algorithm to obtain an enrichment area of the sample data; the enrichment area of the sample data is stored in the bed file; the software employed for the process of identifying enriched regions includes, but is not limited to, the following: MACS; the region enriched in H3K27ac was found using MACS14 with the following specific codes: where- -keep-dup remains repeated. The default MACS (auto) will use a two-term distribution to estimate whether there is a duplicate at each location (default is 1, i.e., the probability of one read occurring at each location is the greatest). Inputting a bam file and outputting a bed file of the enrichment area;
macs14-p 1e-9-w-S--keep-dup=auto--wig--single-profile--space=50-cinput.sort.bam-t H3K27ac.sort.bam-g hs-n macs
optionally, the preprocessing process further includes: format conversion;
the format conversion process includes: acquiring the sam file, and converting the sam-format file into the bam file by using an algorithm; the software employed by the format conversion process includes, but is not limited to, the following:
SAMtools; the sam file is converted into a binary bam file by using samtools, and specific codes are as follows:
samtools view-b-S H3K27ac.sam>H3K27ac.bam
samtools sort H3K27ac.bam>H3K27ac.sort.bam
samtools index H3K27ac.sort.bam H3K27ac.sort.bam.bai
samtools view-b-S input.sam>input.bam
samtools sort input.bam>input.sort.bam
samtools index input.sort.bam input.sort.bam.bai
obtaining the bam file, and analyzing the bam file by using an algorithm to obtain an enrichment area of the sample data; the enrichment area of the sample data is stored in the bed file. This step employs MACS software.
103: stitching the enrichment regions of the adjacent sample data together according to the position information characteristic data of the chromatin region to obtain the stitched enrichment regions of the sample data;
in one embodiment, a super enhancer is identified using a ROSE stitching enhancer; the stitched enriched region of the sample data is obtained by the following method:
removing the enrichment region of the sample data within a first threshold from the gene transcription start site to obtain the enrichment region of the stitched sample data of the non-promoter region; the first threshold value is +/-2 kb, and the calculated regions are all non-promoter regions;
stitching together enriched regions having an interval smaller than a second threshold value based on the stitched enriched regions of the sample data of the non-promoter region to obtain the stitched enriched regions of the sample data; the enrichment region of the stitched sample data is a stitching enhancer; the second threshold is 12,500bp, defining an entity spanning a genomic region, and the suture enhancer is composed of multiple enhancer elements.
The process of identifying the general enhancer and super enhancer uses software including, but not limited to, the following: ROSE; the ROSE recognition super enhancer and the general enhancer were used to identify based on the H3K27ac enrichment region found by MACS. S 12500 is an enriched region with suture spacing less than 12,500 bp; t 2000 excludes the TSS region size, excluding the region 2,000bp before and after TSS, to exclude promoter bias. And outputting a txt format file containing information such as positions of the super enhancer and the common enhancer, ranks in the sample, the number of the constituent elements and the like. The specific codes are as follows:
python ROSE_main.py-g HG19-i macs_peaks.gff-c input.sort.bam-r H3K27ac.sort.bam-o name-s 12500-t 2000
optionally, the position information feature data includes: a starting position and an ending position of the enrichment region on the genome; the signal strength characteristic data comprises: height of signal peak of the enrichment region.
104: sorting the gathered sample data enrichment areas and the unstitched sample data enrichment areas based on the signal intensity characteristic data of the chromatin areas to obtain a sorting result;
in one embodiment, the sorting the stitched enriched regions of the sample data and unstitched enriched regions of the sample data based on the signal intensity characteristic data of the chromatin regions to obtain sorting results includes:
normalizing the enrichment region signal of the sample data in the genome region to obtain the background normalized density level of the enrichment region signal of the sample data; and sorting the gathered enrichment regions of the sample data and the unstitched enrichment regions of the sample data according to the background normalized density level of the enrichment region signal of the sample data to obtain a sorting result. The sequence is ordered according to the signals of the suture enhancer and the remaining single enhancers from large to small.
The data processing method further comprises: performing image composition according to the sequencing result to obtain a distribution curve based on the signal intensity characteristic data of the chromatin region;
calculating the curve slope of the distribution curve and outputting the curve slope;
comparing the slope of the curve with a reference slope threshold, and outputting an enrichment area with the slope of the curve being greater than, equal to or less than the reference slope threshold;
optionally, the enrichment region with the curve slope greater than or equal to the reference slope threshold is determined as a super enhancer, and the enrichment region with the curve slope less than the reference slope threshold is determined as a normal enhancer.
105: and outputting the sequencing result.
FIG. 2 isThe embodiment of the invention provides data processing equipment for identifying an enhancer and a super enhancer, which comprises: a memory and a processor;
the memory is to store program instructions;
the processor is configured to call program instructions, and when the program instructions are executed, the processor is configured to perform the above-mentioned data processing method for identifying the enhancer and the super enhancer.
FIG. 3 is a schematic view ofThe embodiment of the invention provides a data processing system for identifying an enhancer and a super enhancer, which comprises:
an obtaining unit 301, configured to obtain sample data;
a first processing unit 302, configured to pre-process the sample data to obtain an enrichment region of the sample data, and position information and signal intensity of a chromatin region corresponding to the enrichment region of the sample data; the system is used for respectively extracting the characteristics of the position information and the signal intensity to obtain position information characteristic data and signal intensity characteristic data after characteristic extraction;
the second processing unit 303 is configured to stitch together the enrichment regions of adjacent sample data according to the position information characteristic data of the chromatin region to obtain a stitched enrichment region of the sample data;
a third processing unit 304, configured to sort, based on the signal intensity characteristic data of the chromatin region, the enriched regions of the stitched sample data and the enriched regions of the unstitched sample data to obtain a sorting result;
an output unit 305, configured to output the sorting result.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the above-mentioned data processing method of an enhancer of identity and a super enhancer.
FIG. 4 isThe embodiment of the invention provides an analysis schematic flow chart of an enrichment analysis method based on genome region annotation;
an enrichment analysis method based on genome region annotation, comprising:
acquiring a genome region set;
matching the genome region set with a sample data reference set, and calculating enrichment score values of the genome region set and the sample data reference set; the sample data reference set comprises an enhancer region and a super enhancer region which are identified based on the sample data reference set;
based on the enrichment score values, obtaining an enrichment prominence ranking for the set of genomic regions.
An enrichment analysis system based on genome region annotation, comprising:
a collection unit for obtaining a set of genomic regions;
the analysis unit is used for matching the genome region set with a sample data reference set and calculating the enrichment score values of the genome region set and the sample data reference set; the sample data reference set comprises an enhancer region and a super enhancer region which are identified based on the sample data reference set;
and the sequencing unit is used for obtaining an enrichment significant ranking of the genome region set based on the enrichment score value.
In one embodiment, the reference set of sample data further comprises the following: chromatin state data, transcription factors and transcription cofactors based on hidden markov model algorithm, histone modification, open chromatin regions, SNP data, methylation data, LncRNA and mRNA;
the method for acquiring the chromatin state data based on the hidden Markov model algorithm comprises the following steps: 15 core chromatin state data from five chromatin markers, H3K4me3, H3K4me1, H3K36me3, H3K27me3 and H3K9me3, including enhancers, promoters, insulators and heterochromatin, were calculated using ChromHMM on the Roadmap database were collected.
The method for acquiring the transcription factor and the transcription cofactor comprises the following steps: transcription factor and transcription cofactor ChIP-seq data were downloaded from cistome, with transcription factor binding region information covering over 6,000 samples, including 57 tissue types and 2,528 transcription factors; the transcriptional cofactor binding region information covers 3,000 samples, including 41 tissue types and 973 transcriptional cofactors.
The acquisition method of histone modification comprises the following steps: histone modification ChIP-seq data was downloaded from ENCODE to provide a user-friendly assay for researchers. These ChIP-seq data contained 1,400 multiple samples covering 33 histone modifications, such as H3K27ac, H3K27me3, and H3K4me1, etc.
The method for obtaining the open chromatin region comprises the following steps: 1493 ATAC-seq data were downloaded from NCBI GEO/SRA covering a variety of cell/tissue types. We used the unified flow of Bowtie2, SAMtools and MACS2 to identify open chromatin regions and developed the ATACdb database.
The acquisition method of the SNP data comprises the following steps: the Human eQTL data sets are downloaded and merged from PancanQTL. PancanQTL data includes eQTL-gene relationship pairs for different cancers in TCGA. GWAS provides a large body of data to link genetic variation to common phenotypes. The present invention collected risk SNPs from NHGRI GWAS Catalog. Then, we filtered out at-risk SNPs in "Variant ID" that do not belong to "rsID". Finally, 1,515,001 risk SNPs associated with disease, trait, and phenotype were obtained. Based on the influence of SNP on gene expression, the invention enlarges SNP sites by 10kb/1kb, 15kb/1kb and 20kb/1kb respectively, and the expansion sites are not limited.
The method for acquiring the methylation data comprises the following steps: 198,468,712 methylation sites were obtained from the 450K chip data of ENCODE. The present invention classifies these sites as hypermethylated and hypomethylated based on the beta value. Sites with a beta value greater than 0.6 are considered hypermethylated, while sites with a beta value greater than 0.2 and less than 0.6 are considered hypomethylated. Finally, the methylation site is respectively enlarged by 10kb/1kb, 15kb/1kb and 20kb/1kb, and the extended site is not limited.
The method for obtaining LncRNA and mRNA comprises the following steps: data were collected from LncSEA databases for a multi-class lncRNA set including disease, drug, subcellular localization, cancer markers, smif, exosomes and cell markers. In addition, the present invention collects human cell marker information of mRNA from CellMarker database. The transcription start sites of these mRNAs were extended to 2kb/1kb, 5kb/1kb and 10kb/1kb regions, respectively, and used as mRNA reference set subclasses (Cell _ Marker _2kb, Cell _ Marker _5kb and Cell _ Marker _10kb) upon downloading the annotation file from GENEODE. In addition, we collected information on the composition of GOterm containing the gene encoding the protein. Similarly, the transcription start sites of the genes in Goterm were expanded to 2kb/1kb, 5kb/1kb and 10kb/1kb regions, respectively, as mRNA reference set subclasses (Goterm _2kb, Goterm _5kb and Goterm _10 kb); the extension sites mentioned above are not limited.
In one embodiment, said calculating the enrichment score values of said set of genomic regions and reference set of sample data comprises, but is not limited to, the following methods: hyper-geometric test method and site overlap analysis (LOLA);
the analysis process of the super-geometric inspection method comprises the following steps: identifying the overlapping part of the genome region set and the region corresponding to the sample data reference set to obtain the overlapping number of the overlapping part of the regions; calculating the overlapping number of the overlapping parts of the regions by adopting a hyper-geometric inspection method to obtain the enrichment score value; the portion of the set of genomic regions that overlaps with the region corresponding to the reference set of sample data comprises at least one base intersection;
specifically, the hyper-geometric test method comprises the steps of firstly finding out the region overlapping parts of two sets by using Bedtools software, and then inputting the number of the overlapping regions into the hyper-geometric test to calculate an enrichment score P; the enrichment significance P value was calculated as:
Figure BDA0003734888510000131
the conventional method;
wherein M represents the number of reference set regions; n represents the number of background aggregation regions consisting of DNaseI signal regions; n represents the number of user input areas; k represents the number of regions where the user input region overlaps the reference set.
After the genome region set is obtained, when a hyper-geometric inspection method is selected to calculate the enrichment score, the system generates a bed file for the genome region set and stores the bed file in a server, and then the Bedtools software is used for finding out the region overlapping part of the two sets. In the present invention, it is considered that if there is an intersection of one base between two regions, it can be regarded as an overlapping region. The overlap regions found by Bedtools may also be filtered by selecting the "minimum overlap percentage" parameter. And finally, inputting the number of the overlapped areas into the hypergeometric test by the system to calculate an enrichment score.
Optionally, the analysis process of the site overlap analysis includes: filtering the genome region set to obtain the filtered genome region set; comparing the filtered genome region set with the sample data reference set to obtain an overlapping region; and calculating the overlapping area by adopting a Fisher accurate inspection method to obtain the enrichment score value.
Specifically, the site overlap analysis (LOLA) is to filter the region input by a user by introducing a universe set by means of an R packet of the LOLA, then to compare the filtered region with a reference set to obtain an overlapped region, and to calculate an enrichment score by using Fisher's exact test. The LOLA flexibly filters a region input by a user by introducing a universe file, and then respectively obtains four values of a, b, c and d according to an overlapping condition, wherein a represents the number of intersections of a set input by the user, the universe file and a background set; b represents the number of intersections of the set input by the user and the universe or the background set; c represents the number of the sets input by the user or the intersection of the univorse and the background set; d represents the number of the sets input by the user, universe and the background set which do not have intersection directly. And then, calculating a P value by using Fisher accurate test, wherein the specific formula is as follows:
Figure BDA0003734888510000141
where n represents the number of regions of the user input set. When the LOLA method is selected after the collection of genomic regions is obtained, the system finds the overlapping regions differently than the hyper-geometric test method, but with the R-package of the LOLA. The area input by the user is filtered by introducing a universe set, and then the area input by the user is compared with a reference set to obtain an overlapping area. The method of calculating the enrichment score is a fisher exact test.
To reduce false positives, the system also uses multiple hypothesis testing to correct for the calculated P value, such as the Bonferroni method and the FDR method. The Bonferroni method filters out all false positive results by decreasing the P-value by the number of tests. But this may cause false negatives resulting in the absence of enrichment results, so the system also provides FDR correction for P values. The FDR rule corrects the P value by controlling the false discovery rate, calculates the expected value of FP/(TP + FP), and considers that the result of the test has biological significance if the expected value is less than 0.05 of the P value.
In one embodiment, the enrichment analysis system further comprises a region annotation unit comprising genetic annotations for the genomic region and epigenetic annotations for the genomic region; genetic annotations of genomic regions include: common SNP, risk SNP, eQTL;
epigenetic annotations of genomic regions include: transcription factors and transcription cofactors, enhancers and super enhancers, DNA methylation, histone modification, chromatin openness.
In one embodiment, the enrichment analysis system provides 3 query approaches to search a set of genomic regions, specifically:
the method comprises the steps of 'searching through data types', selecting data types and subclasses to inquire;
selecting data classes, subclasses and input genome regions for query through genome region search;
"search by gene name", select data class, subclass and input and gene name for query.
The enrichment analysis system provides various enrichment analysis strategies, and can perform enrichment analysis on the obtained region set from different angles, so that the accuracy of enrichment scores is improved. Also collect and process 11 different data types, and classify according to the characteristic of each data type; while also providing broad annotation information for each region and also enabling data visualization. The system focuses on constructing a comprehensive human genome region set, is software which collects the most human genome region sets so far, and provides an accurate enrichment analysis method and wide annotation information. Not only can explore the biological significance of the genome region, but also is beneficial to excavating the function of the genome region in transcriptional regulation.
In addition, using the enrichment analysis system, the inventors have also made clinical studies on the following diseases: breast cancer, colon cancer, cardiovascular and cerebrovascular diseases.
In the case study of breast cancer, the study uses the-2/+ 1kb region around the transcription initiation site of lncRNA (log2FC >1, Padj <0.05) associated with breast cancer downloaded in the TCGA project and circlnneat database as input, and finds that the "breast cancer" pool is significantly enriched, and important regulatory information is found by further studying the promoter region, enhancer region and chromatin opening region of lncRNA (such as HOTAIR) in the pool. In the promoter region of HOTAIR, binding of 25 relevant transcription factors was found by ChIP-seq analysis. These transcription factors were confirmed to be associated with breast cancer, demonstrating the reliability of the method.
In the colon cancer case study, the binding site data of a transcription factor TCF7L2 related to colon cancer regulation is taken as input in the study, and how TCF7L2 influences the colon cancer occurrence and development through remote regulation can be found. It can be clearly found in this study that the TCF7L2 binding site overlaps with multiple super enhancer regions in the colon cancer cell line HCT 116. The present study also provides the relevant genes (MYC, etc.), the relevant transcription factors (CTCF, etc.) and the signaling pathway (TGF _ beta _ Receptor, etc.) of these super enhancers, all of which indirectly demonstrate that in colon cancer, the transcription factor TCF7L2 can influence transcription of colon cancer oncogenes by regulating super enhancers. Similar conclusions can be drawn when we have taken as input the enhancer region associated with colon cancer.
In the case study of cardiovascular and cerebrovascular diseases, the study takes a region of-2/+ 1kb around the promoter region of a batch of heart failure differential genes as input, and the binding sites of GATA4 transcription factors of myocardial cells are obviously enriched. GATA4 is a key transcription factor in cardiac gene regulation, responding to a hypertrophic agonist.
It is noted that in the enrichment analysis of the present invention, a plurality of oncogenes and transcription factors related to diseases are annotated, which indicates that the enrichment analysis system is important for studying the transcriptional regulation of genomic regions in the development and progression of diseases.
Through case research on breast cancer, colon cancer and cardiovascular and cerebrovascular diseases, the data background set collected and collated in the research and an enrichment analysis method are found to play an important role in researching a genome region in a disease occurrence mechanism.
The validation results of this validation example show that assigning an intrinsic weight to an indication can moderately improve the performance of the method relative to the default settings.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by hardware that is instructed to implement by a program, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
While the invention has been described in detail with reference to specific embodiments thereof, it will be apparent to one skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims (10)

1. A data processing method for identifying an enhancer and a super enhancer, comprising:
acquiring sample data;
preprocessing the sample data to obtain an enrichment area of the sample data, and position information and signal intensity of a chromatin area corresponding to the enrichment area of the sample data;
respectively extracting the characteristics of the position information and the signal intensity to obtain position information characteristic data and signal intensity characteristic data after characteristic extraction;
stitching the enrichment regions of the adjacent sample data together according to the position information characteristic data of the chromatin region to obtain the stitched enrichment regions of the sample data;
sorting the gathered sample data enrichment areas and the unstitched sample data enrichment areas based on the signal intensity characteristic data of the chromatin areas to obtain a sorting result;
and outputting the sequencing result.
2. The method of claim 1, wherein the enriched region of the stitched sample data is obtained by:
removing the enrichment region of the sample data within a first threshold from the gene transcription start site to obtain the enrichment region of the stitched sample data of the non-promoter region;
stitching together enriched regions having an interval smaller than a second threshold value based on the stitched enriched regions of the sample data of the non-promoter region to obtain the stitched enriched regions of the sample data; the enrichment region of the stitched sample data is a stitching enhancer;
optionally, the position information feature data includes: a starting position and an ending position of the enrichment region on the genome; the signal strength characteristic data comprises: height of signal peak of the enrichment region.
3. The method of claim 1, wherein the preprocessing comprises sequence alignment, identification of enriched regions;
the sequence alignment process comprises the following steps: acquiring a fastq file containing read; comparing the fastq file to a reference genome by adopting an algorithm to obtain the position information of the read in the reference genome; the position information of the read in the reference genome is stored in a sam file;
the process of identifying an enrichment region comprises: obtaining the sam file, and analyzing the sam file by using an algorithm to obtain an enrichment area of the sample data; the enrichment area of the sample data is stored in the bed file;
optionally, the preprocessing process further includes: format conversion;
the format conversion process comprises: obtaining the sam file, and converting the sam format file into the bam file by using an algorithm;
obtaining the bam file, and analyzing the bam file by using an algorithm to obtain an enrichment area of the sample data; the enrichment area of the sample data is stored in the bed file.
4. The method of claim 1, wherein the step of sorting the enriched regions of the stitched sample data and the enriched regions of the unstitched sample data based on the signal intensity characteristic data of the chromatin region to obtain the sorting result comprises:
normalizing the enrichment region signal of the sample data in the genome region to obtain the background normalized density level of the enrichment region signal of the sample data; and sequencing the stitched enriched regions of the sample data and the unstitched enriched regions of the sample data according to the background normalized density level of the enriched region signal of the sample data to obtain a sequencing result.
5. The method of data processing for identifying an enhancer and super enhancer as claimed in any one of claims 1 to 4, further comprising: performing image composition according to the sequencing result to obtain a distribution curve based on the signal intensity characteristic data of the chromatin region;
calculating the curve slope of the distribution curve and outputting the curve slope;
comparing the slope of the curve with a reference slope threshold, and outputting an enrichment area with the slope of the curve being greater than, equal to or less than the reference slope threshold;
optionally, the enrichment region with the curve slope greater than or equal to the reference slope threshold is determined as a super enhancer, and the enrichment region with the curve slope less than the reference slope threshold is determined as a normal enhancer.
6. The method of claim 1, wherein the sample data comprises the sample data after the screening process; the sample data after screening processing comprises a plurality of groups of sample data, and each group of sample data comprises H3K27ac data and corresponding input data;
the method or the steps for acquiring the sample data after the screening processing comprise:
obtaining original sample data; the original sample data comprises: cell tissue type, treatment condition, sample number of the sample;
screening the original sample data to obtain screened sample data; the sample data after screening processing is unique and non-redundant;
optionally, the screening process includes: checking the sample data according to the unique sample number; the verification adopts manual verification.
7. An enrichment analysis method based on genome region annotation, comprising:
acquiring a genome region set;
matching the genome region set with a sample data reference set, and calculating the enrichment score values of the genome region set and the sample data reference set; the reference set of sample data comprising an enhancer region and a super enhancer region identified based on the ordering result of claim 5;
obtaining an enrichment prominence ranking for the set of genomic regions based on the enrichment score value;
optionally, the sample data reference set further includes the following: chromatin state data, transcription factors and transcription cofactors, histone modification, open chromatin regions, SNP data, methylation data, LncRNA and mRNA based on hidden markov model algorithm;
optionally, the calculating the enrichment score values of the set of genomic regions and the reference set of sample data comprises, but is not limited to, the following methods: hyper-geometric test method and site overlap analysis;
optionally, the analysis process of the super-geometry inspection method includes: identifying the overlapping part of the genome region set and the region corresponding to the sample data reference set to obtain the overlapping number of the overlapping part of the regions; calculating the overlapping number of the overlapping parts of the regions by adopting a hyper-geometric inspection method to obtain the enrichment score value;
optionally, the overlapping portion of the set of genomic regions and the region corresponding to the reference set of sample data comprises at least one base intersection.
8. A data processing apparatus for identifying an enhancer and a super enhancer, the apparatus comprising: a memory and a processor;
the memory is to store program instructions;
the processor is adapted to invoke program instructions which, when executed, perform the method of data processing of the recognition enhancer and super enhancer of any of claims 1 to 6.
9. A data processing system for identifying an enhancer and a super enhancer, comprising:
an acquisition unit configured to acquire sample data;
the first processing unit is used for preprocessing the sample data to obtain an enrichment area of the sample data, and position information and signal intensity of a chromatin area corresponding to the enrichment area of the sample data; the system is used for respectively extracting the characteristics of the position information and the signal intensity to obtain position information characteristic data and signal intensity characteristic data after characteristic extraction;
the second processing unit is used for stitching the enrichment regions adjacent to the sample data together according to the position information characteristic data of the chromatin region to obtain the stitched enrichment regions of the sample data;
the third processing unit is used for sorting the gathered sample data enrichment areas and the unstitched sample data enrichment areas based on the signal intensity characteristic data of the chromatin areas to obtain a sorting result;
and the output unit is used for outputting the sequencing result.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of data processing for identifying an enhancer and a super enhancer of any of the preceding claims 1-6.
CN202210802841.9A 2022-07-07 2022-07-07 Data processing method and system for identifying enhancer and super enhancer Active CN115083517B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210802841.9A CN115083517B (en) 2022-07-07 2022-07-07 Data processing method and system for identifying enhancer and super enhancer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210802841.9A CN115083517B (en) 2022-07-07 2022-07-07 Data processing method and system for identifying enhancer and super enhancer

Publications (2)

Publication Number Publication Date
CN115083517A true CN115083517A (en) 2022-09-20
CN115083517B CN115083517B (en) 2023-04-18

Family

ID=83256887

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210802841.9A Active CN115083517B (en) 2022-07-07 2022-07-07 Data processing method and system for identifying enhancer and super enhancer

Country Status (1)

Country Link
CN (1) CN115083517B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103646192A (en) * 2013-11-14 2014-03-19 漯河医学高等专科学校 Research method for interaction between enhancers in whole genome
CN103853936A (en) * 2013-11-27 2014-06-11 上海丰核信息科技有限公司 Data processing method for chromatin immunoprecipitation high-throughput sequencing
CN106682456A (en) * 2016-12-30 2017-05-17 西安交通大学 Method for exploring complex disease susceptibility genes based on characteristics of genome epigenetic regulation elements
CN110544509A (en) * 2019-08-20 2019-12-06 广州基迪奥生物科技有限公司 single-cell ATAC-seq data analysis method
CN110838341A (en) * 2019-11-05 2020-02-25 广州基迪奥生物科技有限公司 Biological information analysis method of ATAC-seq sequencing data
CN111951896A (en) * 2020-08-20 2020-11-17 杭州瀚因生命科技有限公司 Chromatin accessibility data analysis method based on clinical samples
TW202126805A (en) * 2019-08-26 2021-07-16 臺北榮民總醫院 A super enhancer for driving pluripotency network and stemness circuitry
CN113801881A (en) * 2021-08-27 2021-12-17 浙江大学 Use of super enhancer gene sequence in promoting human B2M gene expression

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103646192A (en) * 2013-11-14 2014-03-19 漯河医学高等专科学校 Research method for interaction between enhancers in whole genome
CN103853936A (en) * 2013-11-27 2014-06-11 上海丰核信息科技有限公司 Data processing method for chromatin immunoprecipitation high-throughput sequencing
CN106682456A (en) * 2016-12-30 2017-05-17 西安交通大学 Method for exploring complex disease susceptibility genes based on characteristics of genome epigenetic regulation elements
CN110544509A (en) * 2019-08-20 2019-12-06 广州基迪奥生物科技有限公司 single-cell ATAC-seq data analysis method
TW202126805A (en) * 2019-08-26 2021-07-16 臺北榮民總醫院 A super enhancer for driving pluripotency network and stemness circuitry
CN110838341A (en) * 2019-11-05 2020-02-25 广州基迪奥生物科技有限公司 Biological information analysis method of ATAC-seq sequencing data
CN111951896A (en) * 2020-08-20 2020-11-17 杭州瀚因生命科技有限公司 Chromatin accessibility data analysis method based on clinical samples
CN113801881A (en) * 2021-08-27 2021-12-17 浙江大学 Use of super enhancer gene sequence in promoting human B2M gene expression

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
JIAXIN CHEN等: "LncSEA: a platfo rm fo r long non-coding RNA related sets and enrich ment analysis" *
RONGPU JIA等: "Super Enhancer Profi les Identify Key Cell Identity Genes During Differentiation From Embryonic Stem Cells to Trophoblast Stem Cells Super Enhencers in Trophoblast Differentiation" *
YUEXIN ZHANG等: "TcoFBase: a comprehensive database fo r d ecoding the regulator y transcription co-factors in human and mouse" *
崔爽: "肿瘤中EphA2超级增强子的识别及其功能与机制的研究" *
张思佳: "保守性与特异性人源增强子序列识别及转录调控的研究" *
张茵等: "乳腺癌中p53调控增强子的特征与功能分析" *

Also Published As

Publication number Publication date
CN115083517B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
US20210257050A1 (en) Systems and methods for using neural networks for germline and somatic variant calling
Yao et al. A comparison of experimental assays and analytical methods for genome-wide identification of active enhancers
CN111584006B (en) Circular RNA identification method based on machine learning strategy
CA3204451A1 (en) Systems and methods for joint low-coverage whole genome sequencing and whole exome sequencing inference of copy number variation for clinical diagnostics
CN115083521B (en) Method and system for identifying tumor cell group in single cell transcriptome sequencing data
CN111180013B (en) Device for detecting blood disease fusion gene
CN112908405B (en) Tumor gene co-expression network construction method, device, equipment and storage medium
CN109524060B (en) Genetic disease risk prompting gene sequencing data processing system and processing method
US11335438B1 (en) Detecting false positive variant calls in next-generation sequencing
CN115083517B (en) Data processing method and system for identifying enhancer and super enhancer
Forsberg et al. CLC Bio Integrated Platform for Handling and Analysis of Tag Sequencing Data
Adsera et al. Integrative analysis of 10,000 epigenomic maps across 800 samples for regulatory genomics and disease dissection
US20190108311A1 (en) Site-specific noise model for targeted sequencing
CN114530200B (en) Mixed sample identification method based on calculation of SNP entropy
Henke et al. Identification of Mutations in Zebrafish Using Next‐Generation Sequencing
CN113257354B (en) Method for mining key RNA function based on high-throughput experimental data mining
CN113345526B (en) Tumor transcriptome multimode information analysis platform PipeOne and construction method thereof
Zhang et al. Essential non-coding genes: a new playground of bioinformatics
Thangam et al. CRCDA—Comprehensive resources for cancer NGS data analysis
Huang Computational Discovery and Annotations of Cell-Type Specific Long-Range Gene Regulation
Hertzberg Identification and Prioritization of Putative Pathogenic Structural Variants based on Functional Annotation
Liu Novel Computational Methods for Sequencing Data Analysis: Mapping, Query, and Classification
CN104182654A (en) Protein-protein interaction network based gene set identification method
CN114664375A (en) Variation filtering method based on multi-sample whole exon sequencing
Yao et al. Systematic comparison of experimental assays and analytical pipelines for identification of active enhancers genome-wide

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant