CN115083517A

CN115083517A - Data processing method and system for identifying enhancer and super enhancer

Info

Publication number: CN115083517A
Application number: CN202210802841.9A
Authority: CN
Inventors: 李春权; 王秋毓; 宋超; 钱凤翠; 尚德思; 刘佳琦; 杨用三
Original assignee: First Affiliated Hospital of University of South China
Current assignee: First Affiliated Hospital of University of South China
Priority date: 2022-07-07
Filing date: 2022-07-07
Publication date: 2022-09-20
Anticipated expiration: 2042-07-07
Also published as: CN115083517B

Abstract

The invention discloses a data processing method, a system, equipment and a computer readable storage medium for identifying an enhancer and a super enhancer, wherein the method comprises the following steps: acquiring sample data; preprocessing the sample data to obtain an enrichment area of the sample data, and position information and signal intensity of a chromatin area corresponding to the enrichment area of the sample data; respectively extracting the characteristics of the position information and the signal intensity to obtain position information characteristic data and signal intensity characteristic data after characteristic extraction; stitching the enrichment regions of the adjacent sample data together according to the position information characteristic data of the chromatin region to obtain the stitched enrichment regions of the sample data; sorting the gathered sample data enrichment areas and the unstitched sample data enrichment areas based on the signal intensity characteristic data of the chromatin areas to obtain a sorting result; and outputting the sequencing result.

Description

Data processing method and system for identifying enhancer and super enhancer

Technical Field

The invention relates to the technical field of bioinformatics and genomics, in particular to a data processing method and a data processing system for an identification enhancer and a super enhancer.

Background

In the past few years, there has been great interest in understanding the regulatory role of genomic regions and research into regulatory DNA has made new and important advances. With the rapid development of genome high-throughput sequencing technology, functional genome region data is rapidly increasing. How to understand these unknown genomic regions is currently an urgent task. In human biological processes, functional genomic regions play a role in the development of disease, primarily by affecting gene expression levels. To exploit the function of genomic regions in transcriptional regulation, it is necessary to integrate existing, extensive genetic and epigenetic information.

Research shows that the functional enrichment analysis of gene sets is a very successful bioinformatics analysis method. It compares the newly found gene set with the gene set with definite functions one by one to obtain the functions of the new genes. This approach has a key limitation: the data must be gene-centered. By taking the enrichment analysis method of the gene set as a reference, researchers propose the enrichment analysis of the genome region, and similarly, the existing genome region data and the newly found region data are combined to find the overlapped part of the regions in the two sets, and then the enrichment significance score is calculated by utilizing a statistical method. Through the enrichment analysis of the genome region, researchers can better explore the biological functions of the genome region. A great deal of human genetics and epigenetics research rapidly accumulates different data sets such as ChIP-seq and ATAC-seq, effectively arranges and utilizes the data sets, and is very important for researching genome regions. Meanwhile, many published studies have shown that functional genomic regions such as enhancers and super enhancers play a role in human biological processes that are difficult to replace.

The identification of enhancers and their strength is one of the hotspots of biological research, attracting a large number of researchers. Researchers have not previously selected but have been able to solve this problem experimentally, such as chromatin immunoprecipitation, deep sequencing, DNase I hypersensitivity and histone modification genome-wide mapping, among others. However, these experimental methods are expensive, time consuming and inefficient. Therefore, some calculation methods are needed to identify the enhancer and its strength. In fact, some research has been done to do this. For example, in 2016, Liu et al established a two-layered predictor that can recognize not only enhancers but also their strength; jia et al established an identifier to discover enhancers by combining and selecting various features; two years later, Liu et al proposed a model to identify enhancers and their strengths based on ensemble learning methods; in 2019, Nguyen et al proposed to use the integration of convolutional neural networks to identify enhancers and their strengths. However, the overall recognition accuracy is not very high, and the prior art also lacks a method for directly and intelligently recognizing a common enhancer and a super enhancer, so that a new calculation method is required to be invented for recognizing the enhancer and the strength thereof, and the common enhancer and the super enhancer can be effectively recognized; meanwhile, the identified result can be directly used as one of data types in the process of enrichment analysis of the genome region to carry out enrichment analysis of the genome region, so that researchers can better explore the biological function of the genome region.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a data processing method for identifying an enhancer and a super enhancer, which can quickly and accurately identify the enhancer and the strength thereof and can effectively identify a common enhancer region and a super enhancer region; meanwhile, the processing result obtained by the method can be applied to the process of genome region annotation enrichment analysis, so that the accuracy, portability and systematicness of the result of the enrichment analysis method are ensured, the life rule hidden behind biological data is deeply mined, and the related life science problem is solved.

The application discloses a data processing method for identifying an enhancer and a super enhancer, which comprises the following steps:

acquiring sample data;

preprocessing the sample data to obtain an enrichment area of the sample data, and position information and signal intensity of a chromatin area corresponding to the enrichment area of the sample data;

respectively extracting the characteristics of the position information and the signal intensity to obtain position information characteristic data and signal intensity characteristic data after characteristic extraction;

stitching the enrichment regions of the adjacent sample data together according to the position information characteristic data of the chromatin region to obtain the stitched enrichment regions of the sample data;

sorting the gathered sample data enrichment areas and the unstitched sample data enrichment areas based on the signal intensity characteristic data of the chromatin areas to obtain a sorting result;

and outputting the sequencing result.

The stitched enriched region of the sample data is obtained by the following method:

removing the enrichment region of the sample data within a first threshold from the gene transcription start site to obtain the enrichment region of the stitched sample data of the non-promoter region;

stitching together enriched regions having an interval smaller than a second threshold value based on the stitched enriched regions of the sample data of the non-promoter region to obtain the stitched enriched regions of the sample data; the enrichment region of the stitched sample data is a stitching enhancer;

optionally, the position information feature data includes: a starting position and an ending position of the enrichment region on the genome; the signal strength characteristic data comprises: height of signal peak of the enrichment region.

The pretreatment process comprises sequence comparison and enrichment area identification;

the sequence alignment process comprises the following steps: acquiring a fastq file containing read; comparing the fastq file to a reference genome by adopting an algorithm to obtain the position information of the read in the reference genome; the position information of the read in the reference genome is stored in a sam file;

the process of identifying an enrichment region comprises: obtaining the sam file, and analyzing the sam file by using an algorithm to obtain an enrichment area of the sample data; the enrichment area of the sample data is stored in the bed file;

optionally, the preprocessing process further includes: format conversion;

the format conversion process includes: acquiring the sam file, and converting the sam-format file into the bam file by using an algorithm;

acquiring the bam file, and analyzing the bam file by using an algorithm to obtain an enrichment area of the sample data; the enrichment area of the sample data is stored in the bed file.

The process of sorting the stitched enriched regions of the sample data and the unstitched enriched regions of the sample data based on the signal intensity characteristic data of the chromatin regions to obtain a sorting result includes:

normalizing the enrichment region signal of the sample data in the genome region to obtain the background normalized density level of the enrichment region signal of the sample data; and sorting the gathered enrichment regions of the sample data and the unstitched enrichment regions of the sample data according to the background normalized density level of the enrichment region signal of the sample data to obtain a sorting result.

The data processing method further comprises: performing image composition according to the sequencing result to obtain a distribution curve based on the signal intensity characteristic data of the chromatin region;

calculating the curve slope of the distribution curve and outputting the curve slope;

comparing the curve slope with a reference slope threshold value, and outputting an enrichment area with the curve slope being greater than/equal to/less than the reference slope threshold value;

optionally, the enrichment region with the curve slope greater than or equal to the reference slope threshold is determined as a super enhancer, and the enrichment region with the curve slope less than the reference slope threshold is determined as a normal enhancer.

The sample data comprises the sample data after screening processing; the sample data after screening processing comprises a plurality of groups of sample data, and each group of sample data comprises H3K27ac data and corresponding input data;

the method or the steps for acquiring the sample data after the screening processing comprise:

obtaining original sample data; the original sample data comprises: cell tissue type, treatment condition, sample number of the sample;

screening the original sample data to obtain the screened sample data; the sample data after screening processing is unique and non-redundant;

optionally, the screening process includes: checking the sample data according to the unique sample number; the verification adopts manual verification.

A data processing system for identifying an enhancer and a super enhancer, comprising:

an acquisition unit configured to acquire sample data;

the first processing unit is used for preprocessing the sample data to obtain an enrichment area of the sample data, and position information and signal intensity of a chromatin area corresponding to the enrichment area of the sample data; the system is used for respectively extracting the characteristics of the position information and the signal intensity to obtain position information characteristic data and signal intensity characteristic data after characteristic extraction;

the second processing unit is used for stitching the enrichment regions adjacent to the sample data together according to the position information characteristic data of the chromatin region to obtain the stitched enrichment regions of the sample data;

the third processing unit is used for sorting the gathered sample data enrichment areas and the unstitched sample data enrichment areas based on the signal intensity characteristic data of the chromatin areas to obtain a sorting result;

and the output unit is used for outputting the sequencing result.

The application has the following beneficial effects:

1. the application innovatively discloses a data processing method for identifying an enhancer and a super enhancer, which starts from original data, obtains an enrichment region, identifies the enhancer and the strength of the enhancer through steps such as preprocessing and the like; and the method of sequencing and composition for the enrichment region is adopted to effectively identify the common enhancer region and the super enhancer region, thereby greatly improving the precision and the depth of data analysis.

2. The processing result of the data processing method which innovatively utilizes the recognition enhancer and the super enhancer is applied to the process of annotation enrichment analysis of the genome region and is used as one data set of the annotation enrichment analysis of the genome region, so that the accuracy, the portability and the systematicness of the result of the enrichment analysis method are ensured, and sample data is fully utilized; and the life law hidden behind biological data can be deeply mined and applied to the genome enrichment analysis research of various diseases, so that the method has great application value.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic flow chart illustrating an analysis of a data processing method for identifying an enhancer and a super enhancer according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a data processing apparatus for identifying an enhancer and a super enhancer according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a data processing system for identifying an enhancer and super enhancer provided by an embodiment of the present invention;

FIG. 4 is a schematic flow chart of the analysis of the enrichment analysis method based on the annotation of genome region provided by the embodiment of the invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.

In some of the flows described in the present specification and claims and in the above figures, a number of operations are included that occur in a particular order, but it should be clearly understood that these operations may be performed out of order or in parallel as they occur herein, with the order of the operations being indicated as 101, 102, etc. merely to distinguish between the various operations, and the order of the operations by themselves does not represent any order of performance. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Fig. 1 is a schematic flow chart of a data processing method for identifying an enhancer and a super enhancer according to an embodiment of the present invention, specifically, the method includes the following steps:

101: acquiring sample data;

in one embodiment, the sample data comprises sample data after the filtering process; the sample data after screening processing comprises a plurality of groups of sample data, and each group of sample data comprises H3K27ac data and corresponding input data; the H3K27ac data generally function to promote gene expression and are considered markers of gene activation. Input in chip experiments is Input total DNA, not processed by antibody enrichment, as background reference in analysis. In the experimental process, before antibody enrichment, input needs to be separated out, and then the input and the enriched IP are subjected to uncrosslinking and purification to build a library. input can also verify the effect of IP breaks and the background of the final analysis.

In one embodiment, the method or step for acquiring the sample data after the screening process includes:

obtaining original sample data; the original sample data is 542 publicly available human samples processed and collected from H3K27ac ChIP-seq data of NCBI GEO/SRA, ENCODE, Roadmap and GGR; in the public database, we input the keywords "H3K 27 ac" and "ChIP-seq" to search for samples, and obtain information of more than 2,000 original samples, including cell tissue type, treatment condition and sample number of the samples, etc.;

screening the original sample data to obtain screened sample data; the sample data after screening processing is unique and non-redundant; the screening process comprises: checking the sample data according to the unique sample number; the verification adopts manual verification. Finally, 542 publicly available human samples were collected from these four public data sources.

Bioinformatics effectively combines biology with mathematics and computers, mainly clarifies biological significance contained in a large amount of biological data by comprehensively using methods and tools in multiple fields of mathematics, information science and the like to acquire, process, store, analyze and explain biological information, and research focuses mainly on two aspects of genomics and proteomics.

102: preprocessing the sample data to obtain an enrichment area of the sample data, and position information and signal intensity of a chromatin area corresponding to the enrichment area of the sample data; respectively extracting the characteristics of the position information and the signal intensity to obtain position information characteristic data and signal intensity characteristic data after characteristic extraction;

in one embodiment, the process of pre-processing comprises sequence alignment, identifying an enriched region;

the sequence alignment process comprises the following steps: acquiring a fastq file containing read; comparing the fastq file to a reference genome by adopting an algorithm to obtain the position information of the read in the reference genome; the position information of the read in the reference genome is stored in a sam file; the reference genome is hg19 reference genome downloaded for UCSC; the software used for the sequence alignment process includes, but is not limited to, the following: bowtie, the specific code is as follows:

where-n 2 represents the maximum number of mismatch bases allowed in the high fidelity region to be 2; e 70 indicates that the value of Phred Quality at the mismatch site cannot exceed 70;

single-ended: bowtie-e 70-k 2-n 2-m 2-S-q genome H3K27ac. fastq H3K27ac. sam

Single-ended: bowtie-e 70-k 2-n 2-m 2-S-q genome input. fastq input. sam

Double ends: bowtie-e 70-K2-n 2-m 2-S-q genome-1H3K27 ac-1. fastq-2

H3K27ac_2.fastq H3K27ac.sam

Double ends: bowtie-e 70-k 2-n 2-m 2-S-q genome-1input _1, fastq-2input _2, fastq input, sam

The process of identifying an enrichment region comprises: obtaining the sam file, and analyzing the sam file by using an algorithm to obtain an enrichment area of the sample data; the enrichment area of the sample data is stored in the bed file; the software employed for the process of identifying enriched regions includes, but is not limited to, the following: MACS; the region enriched in H3K27ac was found using MACS14 with the following specific codes: where- -keep-dup remains repeated. The default MACS (auto) will use a two-term distribution to estimate whether there is a duplicate at each location (default is 1, i.e., the probability of one read occurring at each location is the greatest). Inputting a bam file and outputting a bed file of the enrichment area;

macs14-p 1e-9-w-S--keep-dup＝auto--wig--single-profile--space＝50-cinput.sort.bam-t H3K27ac.sort.bam-g hs-n macs

optionally, the preprocessing process further includes: format conversion;

the format conversion process includes: acquiring the sam file, and converting the sam-format file into the bam file by using an algorithm; the software employed by the format conversion process includes, but is not limited to, the following:

SAMtools; the sam file is converted into a binary bam file by using samtools, and specific codes are as follows:

samtools view-b-S H3K27ac.sam>H3K27ac.bam

samtools sort H3K27ac.bam>H3K27ac.sort.bam

samtools index H3K27ac.sort.bam H3K27ac.sort.bam.bai

samtools view-b-S input.sam>input.bam

samtools sort input.bam>input.sort.bam

samtools index input.sort.bam input.sort.bam.bai

obtaining the bam file, and analyzing the bam file by using an algorithm to obtain an enrichment area of the sample data; the enrichment area of the sample data is stored in the bed file. This step employs MACS software.

103: stitching the enrichment regions of the adjacent sample data together according to the position information characteristic data of the chromatin region to obtain the stitched enrichment regions of the sample data;

in one embodiment, a super enhancer is identified using a ROSE stitching enhancer; the stitched enriched region of the sample data is obtained by the following method:

removing the enrichment region of the sample data within a first threshold from the gene transcription start site to obtain the enrichment region of the stitched sample data of the non-promoter region; the first threshold value is +/-2 kb, and the calculated regions are all non-promoter regions;

stitching together enriched regions having an interval smaller than a second threshold value based on the stitched enriched regions of the sample data of the non-promoter region to obtain the stitched enriched regions of the sample data; the enrichment region of the stitched sample data is a stitching enhancer; the second threshold is 12,500bp, defining an entity spanning a genomic region, and the suture enhancer is composed of multiple enhancer elements.

The process of identifying the general enhancer and super enhancer uses software including, but not limited to, the following: ROSE; the ROSE recognition super enhancer and the general enhancer were used to identify based on the H3K27ac enrichment region found by MACS. S 12500 is an enriched region with suture spacing less than 12,500 bp; t 2000 excludes the TSS region size, excluding the region 2,000bp before and after TSS, to exclude promoter bias. And outputting a txt format file containing information such as positions of the super enhancer and the common enhancer, ranks in the sample, the number of the constituent elements and the like. The specific codes are as follows:

python ROSE_main.py-g HG19-i macs_peaks.gff-c input.sort.bam-r H3K27ac.sort.bam-o name-s 12500-t 2000

104: sorting the gathered sample data enrichment areas and the unstitched sample data enrichment areas based on the signal intensity characteristic data of the chromatin areas to obtain a sorting result;

in one embodiment, the sorting the stitched enriched regions of the sample data and unstitched enriched regions of the sample data based on the signal intensity characteristic data of the chromatin regions to obtain sorting results includes:

normalizing the enrichment region signal of the sample data in the genome region to obtain the background normalized density level of the enrichment region signal of the sample data; and sorting the gathered enrichment regions of the sample data and the unstitched enrichment regions of the sample data according to the background normalized density level of the enrichment region signal of the sample data to obtain a sorting result. The sequence is ordered according to the signals of the suture enhancer and the remaining single enhancers from large to small.

comparing the slope of the curve with a reference slope threshold, and outputting an enrichment area with the slope of the curve being greater than, equal to or less than the reference slope threshold;

105: and outputting the sequencing result.

FIG. 2 isThe embodiment of the invention provides data processing equipment for identifying an enhancer and a super enhancer, which comprises: a memory and a processor;

the memory is to store program instructions;

the processor is configured to call program instructions, and when the program instructions are executed, the processor is configured to perform the above-mentioned data processing method for identifying the enhancer and the super enhancer.

FIG. 3 is a schematic view ofThe embodiment of the invention provides a data processing system for identifying an enhancer and a super enhancer, which comprises:

an obtaining unit 301, configured to obtain sample data;

a first processing unit 302, configured to pre-process the sample data to obtain an enrichment region of the sample data, and position information and signal intensity of a chromatin region corresponding to the enrichment region of the sample data; the system is used for respectively extracting the characteristics of the position information and the signal intensity to obtain position information characteristic data and signal intensity characteristic data after characteristic extraction;

the second processing unit 303 is configured to stitch together the enrichment regions of adjacent sample data according to the position information characteristic data of the chromatin region to obtain a stitched enrichment region of the sample data;

a third processing unit 304, configured to sort, based on the signal intensity characteristic data of the chromatin region, the enriched regions of the stitched sample data and the enriched regions of the unstitched sample data to obtain a sorting result;

an output unit 305, configured to output the sorting result.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the above-mentioned data processing method of an enhancer of identity and a super enhancer.

FIG. 4 isThe embodiment of the invention provides an analysis schematic flow chart of an enrichment analysis method based on genome region annotation;

an enrichment analysis method based on genome region annotation, comprising:

acquiring a genome region set;

matching the genome region set with a sample data reference set, and calculating enrichment score values of the genome region set and the sample data reference set; the sample data reference set comprises an enhancer region and a super enhancer region which are identified based on the sample data reference set;

based on the enrichment score values, obtaining an enrichment prominence ranking for the set of genomic regions.

An enrichment analysis system based on genome region annotation, comprising:

a collection unit for obtaining a set of genomic regions;

the analysis unit is used for matching the genome region set with a sample data reference set and calculating the enrichment score values of the genome region set and the sample data reference set; the sample data reference set comprises an enhancer region and a super enhancer region which are identified based on the sample data reference set;

and the sequencing unit is used for obtaining an enrichment significant ranking of the genome region set based on the enrichment score value.

In one embodiment, the reference set of sample data further comprises the following: chromatin state data, transcription factors and transcription cofactors based on hidden markov model algorithm, histone modification, open chromatin regions, SNP data, methylation data, LncRNA and mRNA;

the method for acquiring the chromatin state data based on the hidden Markov model algorithm comprises the following steps: 15 core chromatin state data from five chromatin markers, H3K4me3, H3K4me1, H3K36me3, H3K27me3 and H3K9me3, including enhancers, promoters, insulators and heterochromatin, were calculated using ChromHMM on the Roadmap database were collected.

The method for acquiring the transcription factor and the transcription cofactor comprises the following steps: transcription factor and transcription cofactor ChIP-seq data were downloaded from cistome, with transcription factor binding region information covering over 6,000 samples, including 57 tissue types and 2,528 transcription factors; the transcriptional cofactor binding region information covers 3,000 samples, including 41 tissue types and 973 transcriptional cofactors.

The acquisition method of histone modification comprises the following steps: histone modification ChIP-seq data was downloaded from ENCODE to provide a user-friendly assay for researchers. These ChIP-seq data contained 1,400 multiple samples covering 33 histone modifications, such as H3K27ac, H3K27me3, and H3K4me1, etc.

The method for obtaining the open chromatin region comprises the following steps: 1493 ATAC-seq data were downloaded from NCBI GEO/SRA covering a variety of cell/tissue types. We used the unified flow of Bowtie2, SAMtools and MACS2 to identify open chromatin regions and developed the ATACdb database.

The acquisition method of the SNP data comprises the following steps: the Human eQTL data sets are downloaded and merged from PancanQTL. PancanQTL data includes eQTL-gene relationship pairs for different cancers in TCGA. GWAS provides a large body of data to link genetic variation to common phenotypes. The present invention collected risk SNPs from NHGRI GWAS Catalog. Then, we filtered out at-risk SNPs in "Variant ID" that do not belong to "rsID". Finally, 1,515,001 risk SNPs associated with disease, trait, and phenotype were obtained. Based on the influence of SNP on gene expression, the invention enlarges SNP sites by 10kb/1kb, 15kb/1kb and 20kb/1kb respectively, and the expansion sites are not limited.

The method for acquiring the methylation data comprises the following steps: 198,468,712 methylation sites were obtained from the 450K chip data of ENCODE. The present invention classifies these sites as hypermethylated and hypomethylated based on the beta value. Sites with a beta value greater than 0.6 are considered hypermethylated, while sites with a beta value greater than 0.2 and less than 0.6 are considered hypomethylated. Finally, the methylation site is respectively enlarged by 10kb/1kb, 15kb/1kb and 20kb/1kb, and the extended site is not limited.

The method for obtaining LncRNA and mRNA comprises the following steps: data were collected from LncSEA databases for a multi-class lncRNA set including disease, drug, subcellular localization, cancer markers, smif, exosomes and cell markers. In addition, the present invention collects human cell marker information of mRNA from CellMarker database. The transcription start sites of these mRNAs were extended to 2kb/1kb, 5kb/1kb and 10kb/1kb regions, respectively, and used as mRNA reference set subclasses (Cell _ Marker _2kb, Cell _ Marker _5kb and Cell _ Marker _10kb) upon downloading the annotation file from GENEODE. In addition, we collected information on the composition of GOterm containing the gene encoding the protein. Similarly, the transcription start sites of the genes in Goterm were expanded to 2kb/1kb, 5kb/1kb and 10kb/1kb regions, respectively, as mRNA reference set subclasses (Goterm _2kb, Goterm _5kb and Goterm _10 kb); the extension sites mentioned above are not limited.

In one embodiment, said calculating the enrichment score values of said set of genomic regions and reference set of sample data comprises, but is not limited to, the following methods: hyper-geometric test method and site overlap analysis (LOLA);

the analysis process of the super-geometric inspection method comprises the following steps: identifying the overlapping part of the genome region set and the region corresponding to the sample data reference set to obtain the overlapping number of the overlapping part of the regions; calculating the overlapping number of the overlapping parts of the regions by adopting a hyper-geometric inspection method to obtain the enrichment score value; the portion of the set of genomic regions that overlaps with the region corresponding to the reference set of sample data comprises at least one base intersection;

specifically, the hyper-geometric test method comprises the steps of firstly finding out the region overlapping parts of two sets by using Bedtools software, and then inputting the number of the overlapping regions into the hyper-geometric test to calculate an enrichment score P; the enrichment significance P value was calculated as:

the conventional method;

wherein M represents the number of reference set regions; n represents the number of background aggregation regions consisting of DNaseI signal regions; n represents the number of user input areas; k represents the number of regions where the user input region overlaps the reference set.

After the genome region set is obtained, when a hyper-geometric inspection method is selected to calculate the enrichment score, the system generates a bed file for the genome region set and stores the bed file in a server, and then the Bedtools software is used for finding out the region overlapping part of the two sets. In the present invention, it is considered that if there is an intersection of one base between two regions, it can be regarded as an overlapping region. The overlap regions found by Bedtools may also be filtered by selecting the "minimum overlap percentage" parameter. And finally, inputting the number of the overlapped areas into the hypergeometric test by the system to calculate an enrichment score.

Optionally, the analysis process of the site overlap analysis includes: filtering the genome region set to obtain the filtered genome region set; comparing the filtered genome region set with the sample data reference set to obtain an overlapping region; and calculating the overlapping area by adopting a Fisher accurate inspection method to obtain the enrichment score value.

Specifically, the site overlap analysis (LOLA) is to filter the region input by a user by introducing a universe set by means of an R packet of the LOLA, then to compare the filtered region with a reference set to obtain an overlapped region, and to calculate an enrichment score by using Fisher's exact test. The LOLA flexibly filters a region input by a user by introducing a universe file, and then respectively obtains four values of a, b, c and d according to an overlapping condition, wherein a represents the number of intersections of a set input by the user, the universe file and a background set; b represents the number of intersections of the set input by the user and the universe or the background set; c represents the number of the sets input by the user or the intersection of the univorse and the background set; d represents the number of the sets input by the user, universe and the background set which do not have intersection directly. And then, calculating a P value by using Fisher accurate test, wherein the specific formula is as follows:

where n represents the number of regions of the user input set. When the LOLA method is selected after the collection of genomic regions is obtained, the system finds the overlapping regions differently than the hyper-geometric test method, but with the R-package of the LOLA. The area input by the user is filtered by introducing a universe set, and then the area input by the user is compared with a reference set to obtain an overlapping area. The method of calculating the enrichment score is a fisher exact test.

To reduce false positives, the system also uses multiple hypothesis testing to correct for the calculated P value, such as the Bonferroni method and the FDR method. The Bonferroni method filters out all false positive results by decreasing the P-value by the number of tests. But this may cause false negatives resulting in the absence of enrichment results, so the system also provides FDR correction for P values. The FDR rule corrects the P value by controlling the false discovery rate, calculates the expected value of FP/(TP + FP), and considers that the result of the test has biological significance if the expected value is less than 0.05 of the P value.

In one embodiment, the enrichment analysis system further comprises a region annotation unit comprising genetic annotations for the genomic region and epigenetic annotations for the genomic region; genetic annotations of genomic regions include: common SNP, risk SNP, eQTL;

epigenetic annotations of genomic regions include: transcription factors and transcription cofactors, enhancers and super enhancers, DNA methylation, histone modification, chromatin openness.

In one embodiment, the enrichment analysis system provides 3 query approaches to search a set of genomic regions, specifically:

the method comprises the steps of 'searching through data types', selecting data types and subclasses to inquire;

selecting data classes, subclasses and input genome regions for query through genome region search;

"search by gene name", select data class, subclass and input and gene name for query.

The enrichment analysis system provides various enrichment analysis strategies, and can perform enrichment analysis on the obtained region set from different angles, so that the accuracy of enrichment scores is improved. Also collect and process 11 different data types, and classify according to the characteristic of each data type; while also providing broad annotation information for each region and also enabling data visualization. The system focuses on constructing a comprehensive human genome region set, is software which collects the most human genome region sets so far, and provides an accurate enrichment analysis method and wide annotation information. Not only can explore the biological significance of the genome region, but also is beneficial to excavating the function of the genome region in transcriptional regulation.

In addition, using the enrichment analysis system, the inventors have also made clinical studies on the following diseases: breast cancer, colon cancer, cardiovascular and cerebrovascular diseases.

In the case study of breast cancer, the study uses the-2/+ 1kb region around the transcription initiation site of lncRNA (log2FC >1, Padj <0.05) associated with breast cancer downloaded in the TCGA project and circlnneat database as input, and finds that the "breast cancer" pool is significantly enriched, and important regulatory information is found by further studying the promoter region, enhancer region and chromatin opening region of lncRNA (such as HOTAIR) in the pool. In the promoter region of HOTAIR, binding of 25 relevant transcription factors was found by ChIP-seq analysis. These transcription factors were confirmed to be associated with breast cancer, demonstrating the reliability of the method.

In the colon cancer case study, the binding site data of a transcription factor TCF7L2 related to colon cancer regulation is taken as input in the study, and how TCF7L2 influences the colon cancer occurrence and development through remote regulation can be found. It can be clearly found in this study that the TCF7L2 binding site overlaps with multiple super enhancer regions in the colon cancer cell line HCT 116. The present study also provides the relevant genes (MYC, etc.), the relevant transcription factors (CTCF, etc.) and the signaling pathway (TGF _ beta _ Receptor, etc.) of these super enhancers, all of which indirectly demonstrate that in colon cancer, the transcription factor TCF7L2 can influence transcription of colon cancer oncogenes by regulating super enhancers. Similar conclusions can be drawn when we have taken as input the enhancer region associated with colon cancer.

In the case study of cardiovascular and cerebrovascular diseases, the study takes a region of-2/+ 1kb around the promoter region of a batch of heart failure differential genes as input, and the binding sites of GATA4 transcription factors of myocardial cells are obviously enriched. GATA4 is a key transcription factor in cardiac gene regulation, responding to a hypertrophic agonist.

It is noted that in the enrichment analysis of the present invention, a plurality of oncogenes and transcription factors related to diseases are annotated, which indicates that the enrichment analysis system is important for studying the transcriptional regulation of genomic regions in the development and progression of diseases.

Through case research on breast cancer, colon cancer and cardiovascular and cerebrovascular diseases, the data background set collected and collated in the research and an enrichment analysis method are found to play an important role in researching a genome region in a disease occurrence mechanism.

The validation results of this validation example show that assigning an intrinsic weight to an indication can moderately improve the performance of the method relative to the default settings.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by hardware that is instructed to implement by a program, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

While the invention has been described in detail with reference to specific embodiments thereof, it will be apparent to one skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

1. A data processing method for identifying an enhancer and a super enhancer, comprising:

acquiring sample data;

and outputting the sequencing result.

2. The method of claim 1, wherein the enriched region of the stitched sample data is obtained by:

3. The method of claim 1, wherein the preprocessing comprises sequence alignment, identification of enriched regions;

optionally, the preprocessing process further includes: format conversion;

the format conversion process comprises: obtaining the sam file, and converting the sam format file into the bam file by using an algorithm;

obtaining the bam file, and analyzing the bam file by using an algorithm to obtain an enrichment area of the sample data; the enrichment area of the sample data is stored in the bed file.

4. The method of claim 1, wherein the step of sorting the enriched regions of the stitched sample data and the enriched regions of the unstitched sample data based on the signal intensity characteristic data of the chromatin region to obtain the sorting result comprises:

normalizing the enrichment region signal of the sample data in the genome region to obtain the background normalized density level of the enrichment region signal of the sample data; and sequencing the stitched enriched regions of the sample data and the unstitched enriched regions of the sample data according to the background normalized density level of the enriched region signal of the sample data to obtain a sequencing result.

5. The method of data processing for identifying an enhancer and super enhancer as claimed in any one of claims 1 to 4, further comprising: performing image composition according to the sequencing result to obtain a distribution curve based on the signal intensity characteristic data of the chromatin region;

6. The method of claim 1, wherein the sample data comprises the sample data after the screening process; the sample data after screening processing comprises a plurality of groups of sample data, and each group of sample data comprises H3K27ac data and corresponding input data;

screening the original sample data to obtain screened sample data; the sample data after screening processing is unique and non-redundant;

7. An enrichment analysis method based on genome region annotation, comprising:

acquiring a genome region set;

matching the genome region set with a sample data reference set, and calculating the enrichment score values of the genome region set and the sample data reference set; the reference set of sample data comprising an enhancer region and a super enhancer region identified based on the ordering result of claim 5;

obtaining an enrichment prominence ranking for the set of genomic regions based on the enrichment score value;

optionally, the sample data reference set further includes the following: chromatin state data, transcription factors and transcription cofactors, histone modification, open chromatin regions, SNP data, methylation data, LncRNA and mRNA based on hidden markov model algorithm;

optionally, the calculating the enrichment score values of the set of genomic regions and the reference set of sample data comprises, but is not limited to, the following methods: hyper-geometric test method and site overlap analysis;

optionally, the analysis process of the super-geometry inspection method includes: identifying the overlapping part of the genome region set and the region corresponding to the sample data reference set to obtain the overlapping number of the overlapping part of the regions; calculating the overlapping number of the overlapping parts of the regions by adopting a hyper-geometric inspection method to obtain the enrichment score value;

optionally, the overlapping portion of the set of genomic regions and the region corresponding to the reference set of sample data comprises at least one base intersection.

8. A data processing apparatus for identifying an enhancer and a super enhancer, the apparatus comprising: a memory and a processor;

the memory is to store program instructions;

the processor is adapted to invoke program instructions which, when executed, perform the method of data processing of the recognition enhancer and super enhancer of any of claims 1 to 6.

9. A data processing system for identifying an enhancer and a super enhancer, comprising:

an acquisition unit configured to acquire sample data;

and the output unit is used for outputting the sequencing result.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of data processing for identifying an enhancer and a super enhancer of any of the preceding claims 1-6.