CN115455920A

CN115455920A - Genome sequencing data quick annotation method and system based on site mapping

Info

Publication number: CN115455920A
Application number: CN202211165115.7A
Authority: CN
Inventors: 方超; 郎秋蕾; 陈志锋
Original assignee: Hangzhou Lianchuan Biotechnology Co ltd
Current assignee: Hangzhou Lianchuan Biotechnology Co ltd
Priority date: 2022-09-23
Filing date: 2022-09-23
Publication date: 2022-12-09

Abstract

The invention discloses a method and a system for quickly annotating genome sequencing data based on site mapping, and belongs to the technical field of biological information. The invention firstly constructs mapping values of the start sites and the end sites of all the functional components, utilizes the mapping values to construct an index file, similarly obtains the mapping values aiming at the sites to be annotated, further searches the mapping values in the index file, and further judges whether all the sites to be annotated fall between the start sites and the end sites of a certain functional component if the sites to be annotated fall between the start site mapping values and the end site mapping values of the functional component, thereby carrying out annotation. The invention can greatly improve the efficiency of searching for annotations and reduce the time cost and the calculation cost of the annotations.

Description

Genome sequencing data quick annotation method and system based on site mapping

Technical Field

The invention belongs to the technical field of biological information, and particularly relates to a method and a system for quickly annotating genome sequencing data based on site mapping.

Background

Next-generation sequencing (NGS), also known as High-throughput sequencing, is a sequencing-by-synthesis technique developed based on PCR and gene chips. The high-throughput sequencing technology is mainly characterized in that: the sequencing reading length is short, the flux is high, and the accuracy is high. Compared with the first-generation sequencing, the high-throughput sequencing greatly reduces the cost, maintains higher accuracy and greatly reduces the sequencing time, and the high-throughput sequencing is widely applied to the whole group at present. Such as: there are reference transcriptome sequencing, resequencing, DNA methylation sequencing, m6A methylation sequencing, single cell sequencing, etc.

DNA methylation is a main mode of epigenetic modification, can change genetic expression on the premise of not changing a DNA sequence, and plays an important role in regulating and controlling gene expression, chromatin conformation and the like. DNA methylation mainly forms 5-methylcytosine (5-mC) and small amounts of N6-methylpurine (N6-mA) and 7-methylguanine (7-mG), etc. Typically, methylated DNA refers primarily to 5-methylcytosine (5 mC). Methylation in mammalian cells occurs mainly in cytosine of CG dinucleotide, and a large proportion of non-CG (CHH, CHG, H stands for A, C, T) methylation exists in plant cells. 5-methylcytosine (5-mC) is catalyzed by DNA methyltransferase (DNMT) to S-adenosylmethionine (SAM) as a methyl donor, converting cytosine to 5-methylcytosine (mC).

The whole genome methylation sequencing (WGBS) combines a bisulfite conversion (bisulfate conversion) method and a second-generation sequencing technology, and can efficiently detect the methylation state of the whole genome DNA at the level of single base resolution. Bisulfite treatment can deaminate unmethylated cytosines in DNA to uracil, while methylated cytosines remain unchanged; when the desired fragment is PCR amplified, all uracil is converted to thymine. And (3) carrying out high-throughput sequencing on the PCR product, and comparing the PCR product with a reference sequence to judge whether the CpG/CHG/CHH locus is methylated or not. The whole genome methylation sequencing can comprehensively and accurately detect the methylation state of the whole genome DNA, and lays a foundation for deeper epigenetic regulation and control analysis.

The CpG Island in the promoter region of a gene is usually in a demethylated state, promoting gene transcription, and abnormal methylation can lead to transcriptional inactivation. In general, cpG Island methylation leads to gene silencing. DNA methylation plays an important role in genomic imprinting, and hypermethylation in one of the biallelic genes results in expression of a single allele.

The current bioinformatic software has no consistent and rapid annotation method for DNA methylation sequencing data on gene structure regions such as promoter region (promoter), exon region (exonic), intron region (intron) and intergenic region (intergenic) and annotation of CpG Island region.

Disclosure of Invention

In order to solve at least one of the above technical problems, the technical solution adopted by the present invention is as follows:

the invention provides a method for quickly annotating genome sequencing data based on site mapping, which comprises the following steps:

s1, establishing an index file:

obtaining the start site and the end site of the functional module region of the species from which the sequencing sample is derived, and obtaining a mapping value for each site using equation (1):

wherein, G _i Represents the mapping value of the ith position point, INT represents the rounding operation, S _i Represents the value of the ith locus, N is a value determined according to the length of the chromosome of the source species, L _i Represents the number of bits of the ith position, if L _i N or less and L _i -N＝1，

Thereby obtaining mapping values of the start sites and the end sites of all the functional component regions, and constructing an index file according to the following format:

Chr S E s e function

wherein, chr represents the chromosome position information of the functional assembly area, S represents the mapping value of the starting site of the functional assembly area, E represents the mapping value of the ending site of the functional assembly area, S represents the starting site of the functional assembly area, E represents the ending site of the functional assembly area, and function represents the category of the functional assembly area;

s2, obtaining a mapping value of a to-be-annotated locus: the numerical value of the site is Q, and a mapping value G of the site to be annotated is obtained by using a formula (1);

s3, searching the mapping value G obtained in the step S2 in the 2 nd column and the 3 rd column of the index file, and if the mapping value G meets the requirement of S for a certain functional component area j _j ≤G≤E _j Further judging whether Q satisfies s _j ≤Q≤e _j And if so, the position to be annotated can be annotated to be positioned in the jth functional component area.

In the present invention, the functional components and functional elements have equivalent meanings.

In the present invention, the index value and the mapping value have an equivalent meaning.

In some embodiments of the present invention, the method for determining N specifically comprises:

(1) Obtaining the length CL and the gene number GN of each chromosome, and calculating CL/GN;

(2) And obtaining a representative number MN of all chromosomes CL/GN, dividing the representative number by a value q, wherein the integral number of digits of the result of MN/q is the value N, and q = 1-100. .

The N value is obtained as a selection method that the present invention unexpectedly finds to make the post-processing annotation more efficient, and those skilled in the art can also select the N value in other ways, as long as the core idea of the present invention is not violated, and the N value should be considered to fall within the protection scope of the present invention.

For example, the present invention can obtain a representative number of all chromosome lengths, take the square root of the representative number, and obtain the resulting integer number of digits as the value of N.

In some embodiments of the invention, the representative number is selected from one of a median, a mode, and an average.

In some embodiments of the invention, the source species is a mammal. Preferably, the source species is human.

In some embodiments of the invention, the gene sequencing data is DNA methylation sequencing data.

In some embodiments of the invention, the functional module regions comprise promoter regions, exon regions, intron regions, promoter CGIs, endogenous CGIs, 3' transcript CGIs, endogenous CGIs, repeat regions and miRNA regions.

Wherein promoter CGIs, endogenous CGIs, 3' transcript CGIs and endogenous CGIs are defined according to the gene position of CGI:

promoter CGIs	-1000bp TSS to+300bp TSS
		intragenic CGIs	+300bp TSS to+300bp TES
3'transcript CGIs	-300bp TES to+300bp TES
		intergenic CGIs	-300bp TES to-1000bp next gene's promoter

the invention provides a genome sequencing data quick annotation system based on site mapping, which comprises the following modules:

the index library module is used for storing index files, wherein the index files are constructed by the following method:

obtaining the start site and the end site of the functional module region of the species from which the sequencing sample is derived, and obtaining a mapping value using equation (1) for each site:

wherein, the mapping value of the first position point is represented, INT represents rounding operation, S _i Represents the value of the ith locus, N is a value determined according to the length of the chromosome of the source species, and L _i Represents the number of bits of the i-th site, if L _i N or less and L _i -N＝1，

Thus obtaining the mapping values of the starting positions and the ending positions of all the functional component areas, and constructing an index file according to the following format:

Chr S E s e function

wherein, chr represents the chromosome position information of the functional module region, S represents the mapping value of the starting site of the functional module region, E represents the mapping value of the ending site of the functional module region, S represents the starting site of the functional module region, E represents the ending site of the functional module region, and function represents the category of the functional module region,

the input module is used for receiving the sequencing data, obtaining the site to be annotated and calculating the index value of the site to be annotated by using a formula (1),

the searching module is respectively connected with the input module and the index base module and is used for searching the index values of the to-be-annotated locus obtained by the input module in the 2 nd column and the 3 rd column of the index file, and if the index value of the to-be-annotated locus meets the requirement of S for a certain functional component area j, G meets the requirement of S _j ≤G≤E _j Further judging whether Q satisfies s _j ≤Q≤e _j If yes, the position to be annotated can be annotated in the jth functional component area,

and the result output module is used for outputting the annotation result.

(2) And obtaining a representative number MN of all chromosomes CL/GN, dividing the representative number by a value q, wherein the integral number of the MN/q result is the value N, and q = 1-100. .

As above, the obtaining of the N value is an unexpected finding of the present invention to be a more efficient selection method for the annotation after processing, and those skilled in the art can also select the N value in other ways, as long as the N value does not depart from the core idea of the present invention, and the N value should be considered to fall within the protection scope of the present invention.

Wherein, promoter CGIs, endogenous CGIs, 3' transcript CGIs and endogenous CGIs are defined according to the gene position of the CGI:

the invention has the advantages of

Compared with the prior art, the invention has the following beneficial effects:

by utilizing the method and the system, the positions of the functional components are mapped, the method is simple and easy to operate, and the efficiency of searching and annotating can be greatly improved. Taking human chromosome 1 as an example, the search efficiency can be improved by 26804 times. The effect is more obvious for the annotation of a plurality of chromosomes and a plurality of samples.

Drawings

Fig. 1 shows the genetic location to which CGI belongs.

FIG. 2 shows a schematic representation of a site (10540) located in a promoter region ([ 10300,13000 ]).

FIG. 3 shows a schematic diagram of a rapid annotation system for genome sequencing data based on site mapping in an embodiment of the invention.

Detailed Description

Unless otherwise indicated, implicit from the context, or customary in the art, all parts and percentages herein are based on weight and the testing and characterization methods used are in step with the filing date of the present application. Where applicable, the contents of any patent, patent application, or publication referred to in this application are incorporated herein by reference in their entirety and their equivalent family patents are also incorporated by reference, especially as they disclose definitions relating to synthetic techniques, products and process designs, polymers, comonomers, initiators or catalysts, and the like, in the art. To the extent that a definition of a particular term disclosed in the prior art is inconsistent with any definitions provided herein, the definition of the term provided herein controls.

The numerical ranges in this application are approximations, and thus may include values outside of the ranges unless otherwise specified. A numerical range includes all numbers from the lower value to the upper value, in increments of 1 unit, provided that there is a separation of at least 2 units between any lower value and any higher value. For example, if 100 to 1000 is recited, this is meant to explicitly recite all individual values, e.g., 100, 101, 102, etc., as well as all sub-ranges, e.g., 100 to 166, 155 to 170, 198 to 200, etc. For ranges containing a numerical value less than 1 or containing a fraction greater than 1 (e.g., 1.1,1.5, etc.), then 1 unit is considered to be 0.0001,0.001,0.01, or 0.1, as appropriate. For ranges containing single digit numbers less than 10 (e.g., 1 to 5), 1 unit is typically considered 0.1. These are merely specific examples of what is intended to be expressed and all possible combinations of numerical values between the lowest value and the highest value enumerated are to be considered to be expressly stated in this application.

The terms "comprising," "including," "having," and derivatives thereof do not exclude the presence of any other component, step or procedure, and are not intended to exclude the presence of other elements, steps or procedures not expressly disclosed herein. To the extent that any doubt is eliminated, all compositions herein containing, including, or having the term "comprise" may contain any additional additive, adjuvant, or compound, unless expressly stated otherwise. Rather, the term "consisting essentially of 8230 \8230; \8230composition" excludes any other components, steps or processes from the scope of any of the terms hereinafter recited, insofar as they are necessary for performance. The term "consisting of 823070 \8230composition" does not include any components, steps or processes not specifically described or listed. Unless explicitly stated otherwise, the term "or" refers to the listed individual members or any combination thereof.

In order to make the technical problems, technical solutions and advantageous effects solved by the present invention more apparent, the present invention is further described in detail below with reference to the following embodiments.

Examples

The following examples are used herein to demonstrate preferred embodiments of the invention. It will be appreciated by those of skill in the art that the techniques disclosed in the examples which follow represent techniques discovered by the inventor to function in the invention, and thus can be considered to constitute preferred modes for its practice. Those of skill in the art should, in light of the present disclosure, appreciate that many changes can be made in the specific embodiments which are disclosed and still obtain a like or similar result without departing from the spirit or scope of the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs and the disclosures and references cited herein and the materials to which they refer are incorporated by reference.

Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described herein. Such equivalents are intended to be encompassed by the following claims.

The experimental procedures in the following examples are conventional unless otherwise specified. The instruments used in the following examples are, unless otherwise specified, laboratory-standard instruments; the test materials used in the following examples were purchased from a conventional biochemical reagent store unless otherwise specified.

Example 1 method for fast annotation of functional Components based on DNA methylation sequencing

1. Genomic functional component classification and definition

A promoter is a DNA sequence to which a protein binds to initiate transcription of a single RNA transcript from DNA downstream of the promoter. The RNA transcript may encode a protein (mRNA), or may have its own function (e.g., tRNA or rRNA). The promoter is located near the transcription start site of the gene, upstream of the DNA (toward the 5' region of the sense strand). Promoters are about 100-1000 base pairs in length and their sequence is highly dependent on the gene and transcription product, type or class of RNA polymerase and organism species recruited to the site. The promoter region is the binding region for RN A polymerase, and its structure is directly related to the efficiency of transcription.

The Transcription Start Site (TSS) refers to a base, usually a purine, on the DNA strand corresponding to the first nucleotide of the nascent RNA strand. Sequences preceding the start, i.e. at the 5 'end, are often referred to as upstream (updata) and sequences following it, i.e. at the 3' end, are often referred to as downstream (downstream). In describing the position of the base, it is generally indicated by a number that the TSS has a starting point of +1, a downstream direction of +2, +3 8230; an upstream direction of-1, -2, -3 8230; and an upstream direction of-8230; and a sequence of-8230.

In this example, the inventors have made a uniform definition of promoter regions based on TSS as a uniform annotation standard for subsequent DN a methylation sequencing analysis, as shown in table 1:

TABLE 1 promoter region definitions

Promoter region	promoters	-2,200 to +500bp
			Proximal end	Proximal(P)	-200 to +500bp
Middle terminal	Intermediate(I)	-200 to-1,000bp
			Distal end	Distal(D)	-1,000 to-2,200bp

Relative position and CpG content in promoter sequence are important factors for promoter methylation degree, i.e., O/E ratio (observed-to-expected CpG ratio), according to which the inventors classified O/E value of promoter region CpG into three classes of low, medium and high (LCP, ICP and HCP). The calculation formula is as follows:

wherein, the number of CpG is the number of CpG, the number of C is the number of C bases in the sequence, the number of G bases in the sequence is the number of G bases in the sequence, and the Total number of Nucleotides in the sequence expresses the Total number of bases in the sequence.

The inventors defined the CpG island (CGI) as described in table 2 according to the gene location to which the CGI belongs:

TABLE 2 CGI definitions

promoter CGIs	-1000bp TSS to +300bp TSS
		intragenic CGIs	+300bp TSS to +300bp TES
3'transcript CGIs	300bp TES to +300bp TES
		intergenic CGIs	Promoter of the next gene from-300 bp TES to-1000 bp

The gene location definition to which CGI belongs is shown in FIG. 1.

2. Annotating functional component regions to which sequences belong by using a site translation method

The inventors aligned sequencing reads to the genome based on the results of DNA methylation sequencing, and typically the alignment results were output in SAM format. The SAM format comprises site position information, POS: the leftmost position on the alignment, i.e. the position where the reads align to the first base on the genome. Based on the position of the alignment, one skilled in the art needs to quickly note which functional interval (e.g., promoter region, exon region, intron region) of the gene the site is located in, and also needs to confirm whether the site is in the CGI/CGI chord region.

Here, taking the human genome as an example, the length of human chromosome 1 is 249250621 bases, and if several reads align to the 10540 th base position of the chromosome, how to quickly find out whether the position (10540) belongs to the promoter region by site search?

Assuming that the site is located in the promoter region ([ 10300,13000 ]) of one of the genes (as shown in FIG. 2), if the search annotation is traversed by the method of single site translation, the site can be found only by cycling 10540 times (traversing from the 1 st base), and the method is very inefficient to be performed.

3. Annotating functional component regions to which sequences belong by site mapping

In order to improve the searching efficiency of functional annotation, the inventor creates a fast annotated site mapping searching method, and the detailed process is as follows:

(1) First, the inventors performed mapping creation of functional components of a genome using the following formula:

wherein, G _i Representing the mapped value of the ith locus, S _i Represents the value of the ith locus, N is a value determined according to the length of the chromosome of the source species, and L _i Represents the number of bits of the i-th site, if L _i N or less and L _i -N＝1。

For the value of N, the selection is carried out according to the following method:

1. obtaining the length (namely bp number) CL and the gene number GN of each chromosome, and calculating CL/GN;

2. the median MN of all chromosomes CL/GN is obtained, and divided by the value q, the integer number of MN/q result is the N value (e.g. 10000, i.e. N = 4), where q = 1-100.

For humans, the calculation results are shown in table 3:

TABLE 3 chromosome length information

Chromosome numbering	CL(Mbp)	GN	CL/GN
				1	250	4778	5232.31
2	242	3600	6722.22
				3	198	2780	7122.30
4	190	2251	8440.69
				5	182	2393	7605.52
6	171	2828	6046.68
				7	159	2598	6120.09
8	145	2003	7239.14
				9	138	2132	6472.80
10	134	2042	6562.19
				11	135	2806	4811.12
12	133	2398	5546.29
				13	114	1327	8590.81
14	107	1951	5484.37
				15	102	1721	5926.79
16	90	1845	4878.05
				17	83	2326	3568.36
18	80	938	8528.78
				19	59	2420	2438.02
20	64	1243	5148.83
				21	47	726	6473.83
22	51	1129	4517.27
				X	156	1973	7906.74
Y	57	496	11491.94

The median MN of CL/GN was 6296.44. If q =1, MN/q =6296.44, n =4; if q =10, MN/q =629.6, n =3; if q =100, MN/q =62.96, n =2.

Chr S E s e function

wherein, chr represents the chromosome position information of the functional module region, S represents the mapping value of the starting site of the functional module region, E represents the mapping value of the ending site of the functional module region, S represents the starting site of the functional module region, E represents the ending site of the functional module region, and function represents the category of the functional module region.

At the time of response, the mapping value of the site to be annotated is first obtained: the numerical value of the position point is Q, and the mapping value G of the position point to be annotated is obtained by the formula; searching the mapping value G in the 2 nd column and the 3 rd column of the index file, and if the mapping value G meets the requirement of S for a certain functional component area j _j ≤G≤E _j Further judging whether Q satisfies s _j ≤Q≤e _j And if so, the position to be annotated can be annotated in the jth functional component area, so that annotation is completed quickly.

Example 2 application of the annotation method based on site mapping.

To make a quick annotation of the functional components using the method established in example 1, the inventors chose q =10 and thus N =3.

The above promoter regions [10300,13000]]For example, the promoter region start site (10300), S _i ＝10300，L _i ＝5，G _i ＝10300 mod 10 ³ +5×10 ² ＝510。

Similarly, the index values for the promoter region termination site (13000) are: g _j ＝513。

The inventor creates the following index for the promoter area in the method for storage, and the format is as follows:

Chr	S	E	s	e	function
						Chr1	510	513	10300	13000	promoter

wherein, chr represents the chromosome position information of the functional module region, S represents the mapping value of the starting site of the functional module region, E represents the mapping value of the ending site of the functional module region, S represents the starting site of the functional module region, E represents the ending site of the functional module region, and function represents the category of the functional module region. The categories of functional component areas include, but are not limited to: promoter region (promoter), exon region (exon), intron region (intron), promoter CGIs, endogenous CGIs, 3' transcript CGIs, endogenous CGIs, repeat region (repeat region), and miRNA region.

And performing similar index conversion on all known functional components to obtain an index file. It should be noted that these functional modules are stored in the complete location interval of the start and end positions of the functional module on the genome.

By this method, the inventor performs data compression on the original sites, [510,513] the corresponding original sites of the interval are: 10000,13999, namely, the original 4000 sites are stored by using 4 sites, the search efficiency can be improved by 1000 times, and the effect is very obvious for the genome data with large data pet.

When annotating, annotation can be carried out only by the annotated locus and the corresponding index value after conversion:

at the time of searching, according to G _k Searching the index file for the second column (S) and the second column (E) data, and the inventor searches the index file according to the queryIf S is less than or equal to G _k E is not more than E and s is not more than Q _k e.ltoreq.Q _k This point may be annotated as a corresponding functional component.

For example, for the 10540 site (Q) described above _k ) The conversion of the corresponding index value is performed by using the method, and the converted numerical value is as follows:

G _k ＝10540 mod 10 ³ +5×10 ² ＝510

the site satisfies 510 ≦ 513, and 10300 ≦ 10540 ≦ 13000, so that the site Q _k Annotated as promoter.

Example 3 application of a method for rapid annotation of functional Components based on DNA methylation sequencing to the entire chromosome

In this example, taking chromosome 1 as an example, the sequence of chromosome 1 has a total of 249250621 bases, and if the traditional site traversal method is used, it needs to traverse 249250621 times at most to find the annotated site. According to the site mapping search method of the embodiment 1, N =4, at most 9249 index values (about 50 functional components) are divided, that is, the site can be found by searching at most 9299 times, and the search efficiency is improved by 26804 times.

The specific efficiency improvement comparisons are shown in table 3:

TABLE 3 comparison of efficiency

Therefore, the search performance can be greatly improved by using the site mapping search method of the embodiment 1.

Example 4 genome sequencing data quick annotation System based on site mapping

As shown in fig. 3, this embodiment provides a system to implement the above fast annotation method, where the system includes:

the index library module is used for storing the constructed index file

An input module for receiving the sequencing data, obtaining a site Q to be annotated, and calculating an index value G of the site to be annotated by using the formula,

and the result output module is used for outputting the annotation result.

All documents mentioned in this application are incorporated by reference in this application as if each were individually incorporated by reference. Furthermore, it should be understood that various changes and modifications of the present invention can be made by those skilled in the art after reading the above teachings of the present invention, and these equivalents also fall within the scope of the present invention as defined by the appended claims.

Claims

1. A method for quickly annotating genome sequencing data based on site mapping is characterized by comprising the following steps:

s1, establishing an index file:

wherein G is _i Represents the mapping value of the ith position point, INT represents the rounding operation, S _i Represents the value of the ith locus, N is a value determined according to the length of the chromosome of the source species, L _i Represents the number of bits of the ith position, if L _i N or less and L _i -N＝1，

Chr S E s e function

2. The method for rapidly annotating genome sequencing data according to claim 1, wherein the determination method of N is specifically as follows:

(2) And obtaining a representative number MN of all chromosomes CL/GN, dividing the representative number by a value q, wherein the integral number of the MN/q result is the value N, and q = 1-100.

3. The method for rapid annotation of genome sequencing data according to claim 2, wherein the representative number is selected from one of median, mode and mean.

4. The method for rapid annotation of genome sequencing data according to claim 1, wherein the source species is mammal.

5. The method for rapid annotation of genome sequencing data according to claim 4, wherein said functional component regions comprise promoter regions, exon regions, intron regions, promoter CGIs, endogenous CGIs, 3' transcript CGIs, endogenous CGIs, repeat regions and miRNA regions.

6. A genome sequencing data quick annotation system based on site mapping is characterized by comprising the following modules:

the index library module is used for storing index files, wherein the construction method of the index files comprises the following steps:

wherein G is _i Representing the mapped value of the ith locus, S _i Represents the value of the ith locus, N is a value determined according to the length of the chromosome of the source species, and L _i Represents the number of bits of the i-th site, if L _i N or less and L _i -N＝1，

Chr S E s e function

the input module is used for receiving sequencing data, obtaining a site to be annotated, calculating an index value of the site to be annotated by using a formula (1),

and the result output module is used for outputting the annotation result.

7. The system for rapid annotation of genome sequencing data according to claim 6, wherein the method for determining N is as follows:

8. The rapid annotation system of genome sequencing data of claim 7, wherein the representative number is selected from one of median, mode, and mean.

9. The system for rapid annotation of genome sequencing data according to claim 6, wherein the source species is a mammal.

10. The rapid annotation system of genome sequencing data according to claim 9, wherein the functional component regions comprise promoter regions, exon regions, intron regions, promoter CGIs, endogenous CGIs, 3' transcript CGIs, endogenous CGIs, repeat regions and miRNA regions.