WO2013186306A1 - Method for identifying transcriptional regulatory elements - Google Patents

Method for identifying transcriptional regulatory elements Download PDF

Info

Publication number
WO2013186306A1
WO2013186306A1 PCT/EP2013/062260 EP2013062260W WO2013186306A1 WO 2013186306 A1 WO2013186306 A1 WO 2013186306A1 EP 2013062260 W EP2013062260 W EP 2013062260W WO 2013186306 A1 WO2013186306 A1 WO 2013186306A1
Authority
WO
WIPO (PCT)
Prior art keywords
nucleic acid
method
candidate nucleic
acid molecule
promoter
Prior art date
Application number
PCT/EP2013/062260
Other languages
French (fr)
Inventor
Cosmas ARNOLD
Alexander Stark
Original Assignee
Boehringer Ingelheim International Gmbh
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to EP12004520.8 priority Critical
Priority to EP12004520 priority
Application filed by Boehringer Ingelheim International Gmbh filed Critical Boehringer Ingelheim International Gmbh
Publication of WO2013186306A1 publication Critical patent/WO2013186306A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1051Gene trapping, e.g. exon-, intron-, IRES-, signal sequence-trap cloning, trap vectors

Abstract

The present invention generally relates to the field of molecular biology and more specifically to methods of biomolecule detection and identification. The present invention is related to the field of gene transcription, and in particular, non-coding sequences involved in the regulation of gene transcription. Means and methods are provided that allow comprehensively identifying sequences that can function as transcriptional enhancers or repressors, respectively, in a direct and quantitative manner in, for example, entire genomes.

Description

METHOD FOR IDENTIFYING TRANSCRIPTIONAL

REGULATORY ELEMENTS

Field of Invention

[001 ] The present invention generally relates to the field of molecular biology and more specifically to methods of biomolecule detection and identification. The present invention is related to the field of gene transcription, and in particular, non-coding sequences involved in the regulation of gene transcription.

Background of the Invention

[002] The majority of the mammalian genome is composed of non-coding sequences. These sequences contain different types of regulatory elements which control gene transcription. Some of the regulatory elements are able to regulate the transcription of a gene from a long distance and in an orientation-independent manner. In some instances, regulation is observed even on a gene located at a different chromosome. Regulatory elements which were found to up-regulate gene transcription are called "enhancers," while "repressors" or "silencers" are able to repress or inhibit gene activity.

[003] Eukaryotic transcription is highly regulated by enhancers or repressors. Yet their large- scale identification remains challenging and is dependent on indirect approaches. It has been found that enhancers (Banerji et al. Expression of a beta-globin gene is enhanced by remote SV40 DNA sequences, Cell 27, 299-308 (1981 )), upon binding of transcription factors (TFs), regulate the transcription of target genes in a cell-type specific manner. They are thought to be the main determinants of cell-type specific gene expression, govern cell differentiation and development, and drive morphological evolution (reviewed in (Buecker et al., Enhancers as information integration hubs in development: lessons from genomics, Trends Genet 28, 276-284 (2012); Carroll, Evo-devo and an expanding evolutionary synthesis: a genetic theory of morphological evolution, Cell 134, 25-36

(2008) ; Visel et al. Genomic views of distant-acting enhancers, Nature 461 , 199-205

(2009) ; and Levine et al. Transcription regulation and animal diversity, Nature 424, 147- 151 (2003)). Transcription factors contain transcription activation domains. In the prior art scientists were looking for transcription activation domains that act on, e.g., enhancer or promoter sequences. A typical example thereof is Stamminger (2002), J. Virol. 76(10), 4836-4847. This publication discloses the isolation of transcriptional activation domains (proteins), i.e. trans-acting regulatory proteins from human Cytomegalovirus (HCMV) by translationally fusing potential ORFs or portions thereof to the DNA-binding domain of GAL4 (the typical principle of a activator trap). If an activation domain is present, it will activate transcription of the T-antigen and the plasmid can replicate, thereby allowing the detection and isolation of the activation domain. For this purpose Stamminger cloned fragments by regular cloning and could thus not reach high complexity. A number of studies have shown that variations of transcriptional regulatory elements (enhancers or promoters) can contribute to diseases, including thalassemias, preaxial polyactyly, and Hirschsprung disease. Therefore, the finding of the identification of transcriptional regulatory elements will enable human genetic studies to explore the role of disease- causing mutations in these elements. However, despite the importance and ongoing efforts to comprehensively map and characterize them, enhancer discovery within animal genomes has remained challenging. In fact, rather few enhancers have been described and functionally characterized, likely due to their versatile genomic locations with respect to their target genes and the diversity of enhancer sequences (Visel et al. 2009 and Buecker 2012).

A standard assay applied to the evaluation of putative enhancers involves cloning potential enhancer sequences into a plasmid-based reporter construct for analysis in vitro or in vivo. Cultured cells, zebra fish embryos, mouse embryos have been used as systems into which the constructs were used for analysis. An enhancer sequence will drive the expression of the reporter gene which is detected using various reporter strategies, including cell-based reporter readout (e.g. luciferase), live embryo readout of fluorescent reporter (e.g. GFP), fixed embryo readout of β-galactosidase activity (e.g. LacZ). .However, the method is time consuming and inefficient as it requires the testing of the candidate sequences one by one, or if performed in batch, requires placing a "barcode" sequence allocated to each candidate (Patwardhan et al., Massively parallel functional dissection of mammalian enhancers in vivo, Nat Biotechnol 30, 265-270 (2012) and Melnikov et al., Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay, Nat Biotechnol 30, 271-277 (2012).)

Specifically, Patwardhan (2012) assesses the activity of transcriptional regulatory sequences ('enhancers') by a heterologous reporter transcript that contains a DNA barcode. Accordingly, a classical enhancer screening assay is used that places a candidate nucleic acid molecule suspected to be an enhancer or repressor upstream of a promoter in the hope of observing transcriptional enhancement or repression. Similarly, WO 2008/073303 refers to testing 'transcription regulatory sequences' in the classical and well-established setup in which candidates are combined with a minimal promoter and a "heterologous reporter sequence in an expression vector such that the expression of the reporter sequences is under transcriptional control of the transcription regulatory sequence.

Two recently developed methods, DHS-seq and ChlP-seq, enable the prediction of cellular enhancers by assessing enhancer-associated chromatin features across entire genomes. DHS-seq (deep sequencing of DNasel hypersensitive sites) allows the mapping of open chromatin (Boyle et al., High-resolution mapping and characterization of open chromatin across the genome, Cell 132, 31 1-322 (2008)). ChlP-seq (chromatin immunoprecipitation followed by deep sequencing; Johnson et al., Genome-wide mapping of in vivo protein-DNA interactions, Science 316, 1497-1502 (2007)) allows the detection of regulator (e.g. transcription factor or co-factor) binding sites and enhancer- associated histone modifications (e.g. H3K4me1 or H3K27ac). Combined, these methods allow the genome-wide prediction of putative cellular enhancers (Heintzman et al., Histone modifications at human enhancers reflect global cell-type-specific gene expression, Nature 459, 108-1 12 (2009) and Heintzman et al., Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome, Nat Genet 39, 31 1-318 (2007)). Both methods, however, do not to provide a direct functional and quantitative readout of enhancer activity. Furthermore, the methods are not scalable to the millions of tests required for genome-wide enhancer identification.

[006] Since current enhancer methods are either limited in scale or do not provide a quantitative picture of the genomic regulatory potential, there is still a need to allow the genome-wide discovery of novel transcriptional regulatory elements with a direct read-out of their regulatory potential.

Summary

[007] The present inventors have developed STARR-seq (self-transcribing-active-regulatory- region-sequencing), a massively parallel reporter assay. STARR-seq allows comprehensively identifying sequences that can function as transcriptional enhancers or repressors, respectively, in a direct and quantitative manner in, for example, entire genomes. STARR-seq makes use of the fact that enhancers or repressors, respectively, function independently of their position relative to their target gene and places candidate nucleic acid molecules downstream of the transcription start site (TSS) into the reporter transcript (see Figure 1 ). Therefore, active sequences enhance or repress, respectively, their own transcription such that their activity is reflected quantitatively by their abundance or lack, respectively, among RNA. This direct coupling of candidate nucleic acid molecules to their enhancer or repressor activity, respectively, allows the parallel assessment of millions of fragments from arbitrary sources of DNA in a single assay, i.e., in batch. [008] The present invention provides, in one aspect, a method of identifying or obtaining a transcriptional regulatory element, such as an enhancer or repressor and a screening system comprising a vector constructed to carry out the method. The vector comprises a candidate transcriptional regulatory element downstream of a promoter, so that the transcriptional regulatory element is transcribed. The transcripts are then quantified so that the presence or absence of a transcriptional regulatory element can be determined. A candidate transcriptional regulatory element can be identified to have enhancer or repressor activity if the candidate contributes to enhance or repress its own transcription driven by the promoter used in the vector, by observing the abundance or lack of the transcripts of the candidate. This method is illustrated in Figure 1 .

[009] Preferably, the present invention comprises the steps of preparing a reporter library by constructing vectors in which candidate nucleic acid molecules are inserted downstream of a preferably pre-selected promoter; subjecting the library to conditions allowing transcription from the preferably pre-selected promoter; optionally reverse transcribing the obtained RNA into cDNA, and quantifying RNA or cDNA.

[0010] In one preferred embodiment, the method comprises

(a) optionally providing candidate nucleic acid molecules,

(b) preparing a reporter library of the candidate nucleic acid molecules by inserting the molecules into a vector downstream of the preferably pre-selected promoter

(c) subjecting the library to conditions allowing transcription from the preferably preselected promoter,

(d) optionally reverse transcribing RNA obtained in step (c) into cDNA,

(e) quantifying RNA obtained in step (c) or the cDNA obtained in step (d), and

(f) determining the presence of one or more transcriptional regulatory element from the candidate nucleic acid molecules based on the quantification.

[001 1] Furthermore, the present invention provides transcriptional regulatory elements identified by the method described herein, which includes, but are not limited to, SEQ ID NO:1 - 1500. Sequences have at least 50% identity with any of SEQ ID NO: 1 -1500 is also encompassed by the present invention.

[0012] The candidate nucleic acid molecules used for screening can be obtained by any means and from any sources. They can be either DNA or RNA molecules (either double- stranded (ds) or single-stranded (ss) or both, i.e., partially single-stranded or double- stranded, or vice versa) and can be either naturally occurring or artificial. A skilled person will readily recognize that the present invention is applicable to millions of candidate fragments from any arbitrary sources of DNA or RNA in parallel.

[0013] In accordance with one aspect of the invention, a method of determining the level of transcriptional regulatory activity of nucleic acid molecules is provided. The method comprises:

(a) providing a candidate nucleic acid molecule,

(b) inserting the molecule into a vector downstream of the a promoter,

(c) subjecting the vector to conditions allowing transcription from the promoter,

(d) optionally reverse transcribing RNA obtained in step (c) into cDNA,

(e) quantifying RNA obtained in step (c) or the cDNA obtained in step (d), and

(f) determining the level of transcriptional regulatory activity of the nucleic acid molecule based on the quantification.

[0014] In accordance with another aspect of the invention, a method of optimizing a transcriptional regulatory element is provided. The method comprises:

(a) providing candidate nucleic acid molecules comprising a transcriptional regulatory element and mutants thereof,

(b) preparing a reporter library of the candidate nucleic acid molecules by inserting the molecules into a vector downstream of the a promoter,

(c) subjecting the library to conditions allowing transcription from the promoter,

(d) optionally reverse transcribing RNA obtained in step (c) into cDNA,

(e) quantifying RNA obtained in step (c) or the cDNA obtained in step (d), (f) determining the level of transcriptional regulatory activity of the candidate nucleic acid molecules based on the quantification, and

(g) selecting at least one candidate nucleic acid molecule which has higher transcriptional regulatory activity than the transcriptional regulatory element.

[0015] Furthermore, transcriptional regulatory elements optimized by the methods described herein are included in the scope of the present invention.

[0016] In a further aspect, the present invention provides a method of providing a transcription or expression vector, which comprises

(a) optionally providing candidate nucleic acid molecules,

(b) preparing a reporter library of the candidate nucleic acid molecules by inserting the molecules into a vector downstream of a promoter

(c) subjecting the library to conditions allowing transcription from the promoter,

(d) optionally reverse transcribing RNA obtained in step (c) into cDNA,

(e) quantifying RNA obtained in step (c) or the cDNA obtained in step (d),

(f) determining the presence of one or more transcriptional regulatory element from the candidate nucleic acid molecules based on the quantification, and

(g) constructing a transcription or expression vector comprising the transcriptional regulatory element and the promoter.

[0017] The present invention therefore provides a transcription or expression vector comprising the identified sequences as transcriptional regulatory elements, including any of SEQ ID NO: 1 -1500 or sequences having at least 50% identity with any of SEQ ID NO: 1 -1500.

Brief Description of the Drawings

[0018] Figure 1 shows the principle of STARR-seq - a genome-wide quantitative enhancer assay. [0019] Figure 2 shows the distribution of STARR-seq enrichments for putative enhancer regions in S2 cells (Fig. 2a) and OCS cells (Fig. 2b).

[0020] Figure 3 is a view of the srp locus showing STARR-seq cDNA (blue) and input (grey) read densities using UCSC genome browser (USCS GB).

[0021] Figure 4 shows that STARR-seq enrichments are linearly correlated with luciferase activity of individually tested peaks (whiskers indicate the min. and max. of two independent biological replicates).

[0022] Figure 5a shows that there is strong linear correlation between STARR-seq and luciferase assay for sequences that occur upstream or within transcribed regions in their endogenous genomic contexts.

[0023] Figure 5b shows that cDNA fragments are not substantially depleted by transcript- destabilizing elements.

[0024] Figure 6 shows the reproducibility of STARR-seq in significantly enriched putative

STARR-seq regions and genome-wide.

[0025] Figure 7 shows (A) fragment size distribution within two STARR-seq input libraries with median fragment sizes of 588 respectively 642bp. (B) Cumulative and non-cumulative coverage of sequence fragments assessed on the non-repetitive euchromatic portion of the Drosophila genome. More than 90% of the genome are covered by more than 10 independent fragments in both STARR-seq input libraries. (C) The GC-content over the full genome was determined in non-overlapping 25bp windows and then binned into 10 groups ranging from low (1 ) to high (10) GC-content. Each boxplot (10th, 25th, 50th, 75th, 90th percentiles) shows the read depth distribution of all single positions within the respective region. Both STARR-seq input libraries and an input library from a twist ChlP- seq experiment [Bardet AF et al., 201 1 ] are shown. (D) Read depth distribution across STARR-seq and ChlP-seq input libraries in different genomic regions.

[0026] Figure 8 shows (A) Distribution of STARR-seq enrichments for putative enhancer regions in S2 cells and (B) OSC within a range of 30-fold. [0027] Figure 9 shows (A) Genomic regions with and without significant STARR-seq enrichment located in a 2kb up- and 2kb downstream window around the TSS were tested for their enhancer potential in a luciferase assay. Both up- and downstream tested fragments are indicated and independent linear fits were computed for both data sets as indicates by the R2, the slope and interception values, as well as the dotted lines. (B) STARR-seq (cDNA) fragments are not substantially depleted by transcript-destabilizing elements. Even at genomic sites that contain annotated microRNA hairpins, 3' gene ends, splice acceptors, or splice donors, only slightly more fragments were anti-sense to these elements compared to sense. A similar result was obtained for regions that contained at least 5 poly-adenylation motifs (AATAAA) or 3 seed sites for the microRNAs bantam, miR-14, miR-34, miR-2a, or mi R-2b, which are all highly expressed in S2 cells. Also genome-wide, we observed that all significant peaks had equal contribution from sense and anti-sense fragments, with no significant deviations. (C) Reproducibility of STARR- seq in significantly enriched putative STARR-seq regions and genome-wide. Read counts are normalized to 1 million mapped reads in each library.

[0028] Figure 10 shows genomic distribution of S2 enhancers (A,B) and OSC enhancers (C,D).

Panels (B), (D) show the enrichment respectively the depletion of peaks in the respective regions for S2-peaks and OSC-peaks.

[0029] Figure 11 shows reproducibility of RNA-seq between two biological replicates in S2 (A)

and OSC (B) cells. A pseudo-count of 0.001 is added to each RPKM value prior to log2 scaling. The first panel shows the correlation between the RPKM values for each gene, while the second shows the correlation on all exonic nucleotides. (C) Density plot of RPKM distributions between and across cell-types.

[0030] Figure 12 shows reproducibility of DHS-seq between two biological replicates in S2 (A)

and OSC (B). The first panel shows correlation of DHS data on DHS peak regions called by MACS (5% FDR) while the second plot shows the genome-wide correlation.

[0031] Figue 13 shows length distribution of DHS open regions called by MACS. (B) Number of

DNase I accessible sites with regard to the closest TSS. [0032] Figure 14 shows peak ranks of STARR-seq elements in S2 cells and OSC are plotted against the expression of the respective target genes as measured by RNA-seq RPKM values. Boxplots show the 10th, 25th, 50th, 75th, and 90th percentile of the data with the median values drawn as a white line within the box. The grey box is a control build up from 500 randomly chosen genomic location reflected the same genomic feature composition (intronic, intergenic, etc.) as the STARR-seq enhancer elements. (B) DHS- seq enrichment values are plotted against the STARR-seq peak ranks and represented as boxplots. (C) and (D) show the medians for panels (A) and (B) with both cell types plotted on top.

[0033] Figure 15 shows USCS genome browser view of HOX genes. STARR-seq enhancers in closed chromatin are shown, which are marked by H3K4me1 . (A) Abd-A, (B) Abd-B and (C) Antp.

[0034] Figure 16 shows STARR-seq enhancers in open and closed chromatin were tested for their enhancer potential in a luciferase assay. Both open and closed enhancers are indicated and independent linear fits were computed as indicated by the R2, the slope and interception values, as well as the solid lines. (B) Venn diagram showing DHS open regions with no STARR-seq enrichment. Promotor: H3K4me3, Repressed: H3K9me3 or H3K27me3, Enhancers: H3K4me1 , Insulator: CP190 or CTCF. (C) UCSC genome browser view of the Fmr1 locus showing overlap of open chromatin not exhibiting STARR-seq enhancers and insulators.

[0035] Figure 17 shows STARR-seq enrichment correlates strongly with the luciferase activity of the respective tested putative enhancer sequence fragments covering a wide range of enhancer strength. In OSC, an enrichment of 1.4, 1.7, and 0.7-fold over input is the minimum found in our significant enhancer elements (p<0.001 , FDR=0.8%). Error bars show the maximum and the minimum luciferase measurement from two independent replicates with the median plotted as dot. Pearson correlation coefficients and R2 values are indicated above the graphs. Fitted linear regressions are plotted on top of the data points. Figure 18 shows UCSC genome browser view of the shn locus, showing luciferase validations for cell type-specific enhancers, which are open in both cell types.

Items of the Invention

The present invention can also be characterized by the following Items

A method of identifying a transcriptional regulatory element which regulates a promoter comprising:

(a) optionally providing candidate nucleic acid molecules,

(b) preparing a reporter library of the candidate nucleic acid molecules by inserting the molecules into a vector downstream of the promoter,

(c) subjecting the library to conditions allowing transcription from the promoter,

(d) optionally reverse transcribing RNA obtained in step (c) into cDNA,

(e) quantifying RNA obtained in step (c) or the cDNA obtained in step (d), and

(f) determining the presence of one or more transcriptional regulatory element from the candidate nucleic acid molecules based on the quantification.

The method of item 1 , wherein the transcriptional regulatory element is an enhancer.

The method of item 1 or 2, wherein determination in (f) comprises comparing abundance of the candidate nucleic acid molecule in an input library and the cDNA.

The method of any one of the preceding items, wherein a candidate nucleic acid molecule having enhancer activity transcribes itself, thereby increasing the number of its own transcripts.

The method of any one of the preceding items, wherein the abundance of a candidate nucleic acid molecule is a read-out for the enhancer activity of the candidate nucleic acid molecule.

The method of any one of the preceding items, wherein direct coupling of the candidate nucleic acid molecule to the transcriptional readout of its potential enhancer activity allows the identification of an enhancer element. The method of item 1 , wherein the transcriptional regulatory element is a repressor.

The method of item 1 , wherein determination in (f) comprises comparing the lack of the candidate nucleic acid molecule in an input library and the cDNA.

The method of item 7 or 8, wherein a candidate nucleic acid molecule having repressor activity represses transcription of itself, thereby decreasing the number of its own transcripts.

The method of any one of items 7 to 9, wherein the lack of a candidate nucleic acid molecule is a read-out for the repressor activity of the candidate nucleic acid molecule

The method of any one of items 7 to 10, wherein direct coupling of the candidate nucleic acid molecule to the transcriptional readout of its potential repressor activity allows the identification of a repressor element.

The method of any one of the preceding items wherein the insertion of the candidate nucleic acid molecule into the vector places the candidate nucleic acid molecule on the transcript produced in step (c).

The method of any one of the preceding items, wherein the quantifying step (e) is carried out by next generation sequencing or microarray hybridization.

The method of any one of the preceding items, wherein the candidate nucleic acid molecule is obtained from eukaryote, prokaryote, or virus.

The method of any one of the preceding items, wherein the candidate nucleic acid molecules are obtained from cDNA, bacterial artificial chromosome, yeast artificial chromosome, bacterial vectors or eukaryotic vectors.

The method of any one of the preceding items, wherein the candidate nucleic acid molecule is naturally occurring or artificial DNA or RNA.

The method of any one of the preceding items, wherein the vector comprises a polyadenylation site which is downstream of the candidate nucleic acid molecule. The method of any one of the preceding items, wherein the vector is linear or circular.

The method of any one of the preceding items, wherein linkers are added to both ends of the nucleic acid molecule before inserting it into the vector.

The method of item 19, wherein the linkers are made compatible for bacterial recombination.

The method of any one of the preceding items, wherein step (c) takes place in vitro.

The method of any one of the preceding items, wherein step (c) takes place in a host or host cell.

The method of any one of the preceding items, wherein reverse transcription of step (d) is coupled with an amplification step (RT-PCR).

The method of item 22, wherein the host cell is a prokaryotic or eukaryotic host cell.

The method of any one of the preceding items, wherein the promoter is a core promoter.

The method of any one of the preceding items, wherein the promoter is a naturally occurring or artificial promoter.

The method of any one of the preceding items, wherein the promoter is a cell- type specific promoter.

The method of any one of the preceding items, wherein reporter library of comprises at least 107 members of nucleic acid molecules.

A transcriptional regulatory element identified according to any one of items 1 -28. The transcriptional regulatory element in item 29 comprising any of SEQ ID NO: 1 -1500.

A transcriptional regulatory element which is at least 50% identical with any of the sequences as recited in SEQ ID NO: 1 -1500.

A vector comprising a transcriptional regulatory element of any one of items 29 to 31 .

The vector of item 32 further compriing a nucleic acid molecule of interest, expression of said nucleic acid molecule is driven by a promoter and is additioally regulated by a transcriptional regulatory element of any one of items 29 to 31 or a vector of item 32.

A host or host cell comprising a transcriptional regulatory element of any one of items 29 to 31 or a vector of item 32.

A method for the production of a polypeptide of interest comprising culturing a host cell of item 34 under conditions allowing expression of said polypeptide and recovering said polypeptide.

A method of determining the level of transcriptional regulatory activity of nucleic acid molecules comprising

(a) providing a candidate nucleic acid molecule,

(b) inserting the molecule into a vector downstream of a promoter,

(c) subjecting the vector to conditions allowing transcription from the promoter,

(d) optionally reverse transcribing RNA obtained in step (c) into cDNA,

(e) quantifying RNA obtained in step (c) or the cDNA obtained in step (d), and

(f) determining the level of transcriptional regulatory activity of the nucleic acid molecule based on the quantification.

A method of optimizing a transcriptional regulatory element comprising

(a) providing candidate nucleic acid molecules comprising a transcriptional regulatory element and mutants thereof,

(b) preparing a reporter library of the candidate nucleic acid molecules by inserting the molecules into a vector downstream of a promoter,

(c) subjecting the library to conditions allowing transcription from the promoter,

(d) optionally reverse transcribing RNA obtained in step (c) into cDNA, (e) quantifying RNA obtained in step (c) or the cDNA obtained in step (d),

(f) determining the level of transcriptional regulatory activity of the candidate nucleic acid molecules based on the quantification, and

(g) selecting at least one candidate nucleic acid molecule which has higher transcriptional regulatory activity than the transcriptional regulatory element.

Detailed Description of the Invention

Gene expression is regulated by genomic enhancers, the identification of which has remained challenging and depended on indirect measures of activity. The present inventors provide STARR-s e q (self-transcribing-active-regulatory-region-sequencing) which allows directly and quantitatively assessing transcriptional regulation, thereby the identification enhancer and/or repressor activity for millions of candidates from arbitrary sources of nucleic acid molecules, and therefore, enabling screens across entire genomes including unknown genomes. Exemplarily applied to the Drosophila genome in two cell-types, STARR-seq identifies thousands of cell-type specific enhancers across a broad continuum of strengths, linking differential gene expression to differences in enhancer activity, and revealing several independent enhancers for many - even ubiquitously expressed - genes. Endogenously, most enhancers display active chromatin marks but one-third carry general and repressive marks; conversely, some are inactive in one cell-type despite active marks in both, suggesting regulation at the level of both chromatin structure and regulator-DNA binding. However, the means and the methods of the present invention allow the identification of even otherwise "hidden" enhancers. Also, the means and the methods of the present invention allow the identification of enhancers that become active when induced, i.e., inducible enhancers. Induction of enhancers may be caused , e.g . by a chemical or biological compound. Non-limiting examples of biological and chemical compounds are hormones, signal transduction molecules such as cytokines, interferons, interleukins, cAMP, neurotransmitters, hormones, pathogens, such as viruses or bacteria. Of course, as described and taught herein, the present invention also allows the identification of repressors by applying the means and methods described herein. Likewise with inducible enhancers, the means and methods of the present invention allow the identification of repressors that become active as repressor when induced, i.e. inducible repressors. Inducing agents may be selected from the non- limiting examples as described above in the context of inducible enhancers. In fact, it is known that both biological and chemical agents can either induce or repress enhancer activity. The identification of inducible enhancers or repressors may be particularly useful in analyzing effects of a medicament or a treatment regimen on gene expression in order to find out potential advantageous or may be disadvantageous effects on gene expression. Such an approach may be particularly useful when, e.g. chemotherapeutic agents are developed that may, because of their nature, influence gene expression or chromatin structure.

In case an inducible enhancer or repressor is preferred to be isolated, the source of candidate nucleic acids, such as mammalian cells or any other source as described herein, is brought into contact/treated with a biological and/or chemical substance as described herein. Accordingly, in a preferred embodiment the methods described herein comprise a step of bringing into contact/treating a source for candidate nucleic acids with a biological and/or chemical substance as described herein, preferably prior to all subsequent steps of the methods of the present invention.

It is thus for the first time possible to identify or obtain or improve transcriptional regulatory elements in a direct and quantitative manner, even on a genome-wide scale. Without being limited to particular theory, the invention is based on the finding that the transcriptional regulatory element, when introduced downstream of a promoter which it regulates as enhancer, is transcribed more frequently such that there is a quantitative relationship between the strength of the transcriptional regulatory element and the number of the transcripts. This finding allows a large-scale genome-wise assay to identify or select transcriptional regulatory element and assess the enhancer activity at quantitative levels. [0039] In case of the identification of a repressor, without being limited to particular theory, the invention is based on the finding that the transcriptional regulatory element, when introduced downstream of a promoter which it regulates as a repressor, is less frequently transcribed such that there is a quantitative relationship between the strength of the transcriptional regulatory element and the number of the transcripts. This finding allows a large-scale genome-wise assay to identify or select transcriptional regulatory element and assess the repressor activity at quantitative levels.

[0040] Specifically, the present inventors have created a reporter library in which candidate nucleic acid molecules are part of the transcript driven by a preferably pre-selected promoter such that active enhancers transcribe themselves while inactive fragments do not, i.e., the abundance of each enhancer fragment in the RNA population is a read-out for the candidate nucleic acid molecule's enhancer activity or the lack of each repressor fragment in the RNA population is a read-out for the candidate nucleic acid molecule's repressor activity.

[0041] Also, the reporter library can be used to isolate or obtain or improve enhancers or repressors. Namely, for that purpose the candidate nucleic acid molecules are not or at least only in low amounts part of the transcript driven by a preferably pre-selected promoter such that active repressors do not or at least do not essentially transcribe themselves while active fragments (see the above described approach to isolate or obtain or improve enhancers) do not, i.e., the lack of each repressor fragment in the RNA population is a read-out for the candidate nucleic acid molecule's repressor activity.

[0042] In brief, the claimed method can be performed as follows: In a first 'library preparation step', candidate nucleic acid molecules are cloned into an acceptor site between a preferably pre-selected promoter, preferably containing a transcription start site and preferably a poly-adenylation site (so they are part of a transcript), i.e. candidate nucleic acid molecules are inserted downstream of the preferably pre-selected promoter. This set-up is sometimes referred to herein as "reporter construct" or "reporter library". Preferably, the cloning protocol for the candidate nucleic acid molecules that is applied by the present inventors allows the cloning of random fragments, e.g. sheared DNA as obtained by ultrasound, at a very large scale (e.g. shared BAC DNA with several hundred kb or an entire eukaryotic genome); see appended Examples..

[0043] In a second 'screening step', the library is introduced into the cells of interest (e.g. by electroporation), RNA is isolated, and the reporter RNA is selectively amplified and made ready for any quantification of transcript abundance (in case of enhancers) or lack of transcripts (in case of repressors), such as next-generation sequencing (NGS) or microarray hybridization. Alternatively, the library can be transcribed in vitro, RNA is isolated, and the reporter RNA is selectively amplified and made ready for next- generation sequencing (NGS) or microarray hybridization. However, it is also possible to refrain from NGS by sequencing RNA as described herein.

[0044] In a last 'data analysis step', the sequenced nucleic acid molecules are quantified and the enhancer or repressor activity, respectively, is determined from the abundance or lack, respectively, of each fragment in the RNA pool. The quantification can either be done on cDNA level or RNA level. When RNA that is obtained by transcription as described herein is meant, "RNA" means preferably mRNA.

[0045] The present method makes use of the fact that transcriptional regulatory elements function independently of their position relative to their target gene and places candidate sequences downstream of the transcription start site (TSS). It has been surprisingly found that active sequences are able to enhance their own transcription such that their activity is reflected quantitatively by their abundance among transcribed RNA. This direct coupling of candidate sequences to their enhancer activity allows the parallel assessment of millions of fragments from arbitrary sources of DNA in batch.

[0046] Accordingly, in a first aspect, the present invention provides a method for identifying putative nucleic acid sequence which acts as a transcriptional regulatory element for a given promoter.

As defined herein, a transcriptional regulatory element is any element involved in regulating the transcription of a nucleic acid molecule such as a gene or a target gene. The transcriptional regulatory element is a nucleic acid . It may act in "cis" or "trans", preferably it acts in "cis", i.e. it activates expression of genes located on the same nucleic acid molecule, e.g. a chromosome or plasmid, where the transcriptional regulatory element is located. There are no limits on the distance at which the transcriptional regulatory element exerts its effect, e.g., it may act over a distance of 1 bp to more than - 1000 kb, as observed for naturally occurring regulatory elements. The transcriptional regulatory element is preferably a cis-acting transcriptional regulatory element or a transacting transcriptional regulatory element. "Trans" means that the transcriptional regulatory element acts on the expression of genes located on a nucleic acid molecule, e.g . a ch romosome that is d ifferent from the n ucleic acid molecu le where the transcriptional regulatory element is located. The transcriptional regulatory element is preferably an enhancer or repressor or may even act as both enhancer and repressor. The enhancer may be an inducible enhancer as explained herein. The repressor may be an inducible repressor as also explained herein. The nucleic acid molecule regulated by a transcriptional regulatory element does not necessarily have to encode a functional peptide or polypeptide, but it is not excluded that the nucleic acid molecule can encode a functional peptide or polypeptide.

An enhancer (may also be called activator herein) is defined herein as any nucleic acid molecule that increases transcription of a nucleic acid molecule when functionally linked to a promoter regardless of its relative position.

A repressor (also sometimes called herein silencer) is defined as any nucleic acid molecule which inhibits the transcription when functionally linked to a promoter regardless of relative position. "Functionally linked" is to be understood broadly and means that there is an influential relationship between two or more nucleotide regions. The method of identifying a transcriptional regulatory sequence comprises:

(a) optionally providing candidate nucleic acid molecules,

(b) preparing a reporter library of the candidate nucleic acid molecules by inserting the molecules into a vector downstream of a promoter (c) subjecting the library to conditions allowing transcription from the promoter,

(d) optionally reverse transcribing RNA obtained in step (c) into cDNA,

(e) quantifying RNA obtained in step (c) or the cDNA obtained in step (d), and

(f) determining the presence of one or more transcriptional regulatory element from the candidate nucleic acid molecules based on the quantification.

Outline of library generation

[0047] In brief, the generation of a library is preferably as follows: in a first step, standard DNA linkers are ligated to both ends of the (random) candidate nucleic acid molecules fragments. In a second step, the linkers are extended to make them compatible with established bacterial recombination technologies (e.g., Gateway or In-Fusion). In a third step, all fragments are cloned in batch into an entry plasmid using preferably bacterial recombination, thus avoiding restriction digestion and preserving the original nucleic acid molecules. The resulting library can, e.g. be amplified in E.coli. The resulting library, however, does not need to be transformed or transfected in a host cell, but is used as it is for being transcribed. Thus, transcription is done in vivo or in vitro. "In vitro" means in a system or envi ron ment free of intact cells such as host cel ls described herein . Accordingly, the library is transcribed and the transcripts are either directly quantified or are reverse transcribed and then quantified.

Step (a) Providing Candidate Nucleic Acid Molecules

[0048] In accordance with the present invention, the method comprises optionally providing candidate nucleic acid molecules (sometimes referred to as "candidate fragments" or simply "candidates" herein) for the screening or identification of a putative transcription regulatory element.

A "candidate" is a nucleic acid molecule that has or is suspected or assumed to have potential enhancer or repressor activity, respectively, and is preferably subjected to the methods of the present invention, for example, with the aim of identifying as to whether said candidate has enhancer or repressor activity, respectively. A nucleic acid molecule that is subjected to the methods of the present invention includes fragments of nucleic acid molecules of various length, preferably as described herein, that originate from the sources as described herein.

Preferably, a plurality of candidates are provided, such as at least 2, 3, 4, 5, 10, 50, 100, 200, 300, 500, 1000 or more members. Since the present invention is suitable for genome-wide identification of transcriptional regulatory elements, as discussed earlier, the number of candidate nucleic acid molecules may be more than 102, 103, 104, 105, 106, or 107 members.

[0049] The size of the candidate nucleic acid molecules may be between 10-104 bp, such as at least 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 120, 130, 140, 150, 160, 170, 180, 190, 200, 500, 1000, 2000, 3000, 4000 5000, 6000, 7000, 8000, 9000 bp long. Preferably, the candidate nucleic acid molecules is between 100-10kb, such as 150-1000bp long.

[0050] The nucleic acid molecules may be a DNA or an RNA, including dsDNA, ssDNA, dsRNA, ssRNA and/or combinations thereof, i.e. hybrid or chimeric DNA RNA molecules. The source of the candidate nucleic acid molecules is not limited in any way; it can be naturally-occurring or artificial. Artificial nucleic acid molecules can be derived from naturally-occurring sequences by addition, substitution and deletion of one or more nucleic acids. By "derived from" is meant that the nucleic acid molecule was either made or designed from a given nucleic acid molecule. Preferably, the candidates are prepared from a genomic DNA extracted from any organisms, such as mammals including, e.g., humans, horse, sheep, cow, pig, dog, horse, mouse, rat, rabbit, or cell lines or tissue including healthy and/or diseased tissue from any of the aforementioned mammals; archaea, prokaryotes such as gram-positive or gram-negative bacteria or eukaryotes including plants, insects, spiders, fungi, yeasts, algae, or it can be extracted from viruses, such as DNA or RNA viruses.

To demonstrate that STARR-seq is, for example, applicable to human cells, the present inventors screened pools of BACs each containing ~150kb of human genomic DNA (-1 Mb total) in HeLa cells. This resulted in strongly enriched peaks that could be validated using luciferase assays, while regions that were marked by typical enhancer associated chromatin marks but had no STARR-seq signal did not function .For the screen in HeLa cell, a modified screening vector based on the pGL4.10 backbone (Promega) was used to which an adapted STARR-seq screening cassette containing the Super Core Promoter 1 (SCP1 ), a synthetic intron (pl RESpuro3, Clontech), sgGFP (Qbiogene Inc), a ccdB suicide gene flanked by homology arms, and the pGL3's SV40 late polyA-signal was added. This proof-of-principle demonstrates that STARR-seq is applicable to human cells and to human genomic DNA. Luciferase tests showed that STARR-seq displays high specificity and sensitivity (see (see Arnold (2013), Science 339, 1074-1077.

Preferably, the nucleic acid molecules are obtained from cDNA or genomic libraries. In some preferred embodiments, the candidates are obtained from cDNA, bacterial artificial chromosome (BAC), yeast artificial chromosome (YAC), bacterial vectors or eukaryotic vectors. The candidates may also be obtained from healthy or disease tissues or cells. They may be, for example, from cells defective in cellular processes, such as tumor suppression, cell cycle control, or cell surface adhesion. The nucleic acids may also be from cells infected with pathogenic organisms, for example, cells infected with viruses or bacteria. Particularly preferred nucleic acid molecules are obtained from bacterial, fungal, viral and mammalian DNA or RNA. In further embodiments, the candidates are obtained from cells or cells lines, such as S2, OSC, BG3, CI.8, Kc167, embryonic stem cells (ESCs), neuronal precursors (NPs), HeLa, 3T3, mbn-2, CHO. As an alternative, randomized nucleic acid sequences are used. Randomized nucleic acid sequences can be formed by any number of methods. For example, automated DNA synthesis can be used to generate multiple random sequences by providing mixtures of the different nucleic acid residues at each coupling step.

For example, the means and methods of the present invention can thus be used to screen different sources of DNA, including the genomic DNA of closely related species in defined cells to assess the functional consequences of sequences mutations/changes. This includes the screening of DNA from patients and/or specific tissue-samples (e.g. tumor samples) that harbor germline or somatic mutations potentially related to the disease, such that it will be applicable to genetic diseases, somatic mutations, or cancer genomes. It is assumed that the functional assessment of these mutations will provide insights into potential disease causes and allow the discovery of specific biomarkers.

Step (b) Preparation of Vector

The method provided present invention comprises the preparation of a reporter library of the candidate nucleic acid molecules. A "library" refers to a plurality of nucleic acids in the form of vectors. A reporter library is formed by inserting the candidate nucleic acid molecules downstream of a promoter, preferably pre-selected promoter, such that when the library is subjected to suitable conditions, transcription of the candidate nucleic acid molecules will take place. As such, the "reporter" contains preferably the candidate nucleic acid molecule itself. Namely, nucleic acid molecules that may enhance their own transcription such that their activity is reflected quantitatively by their abundance among RNA act then as "reporter". This direct coupling of candidate nucleic acid molecules to their enhancer activity allows the parallel assessment of millions of fragments from arbitrary sources of DNA in a single assay.

Conversely, nucleic acid molecules that may repress their own transcription such that their activity is reflected quantitatively by their lack among RNA act then as "reporter". This direct coupling of candidate nucleic acid molecules to their repressor activity then allows the parallel assessment of millions of fragments from arbitrary sources of DNA in a single assay.

A promoter is defined as an array of nucleic acids that directs the transcription of a nuceic acid molecule, e.g., a gene, and includes necessary nucleic acid sequences near the start site of transcription. In the present invention, the promoter directs the transcription of the candidate nucleic acid molecules into RNA. Preferably, the promoter used in the present invention is a core promoter (also known as minimal promoter). In general, a core promoter contains a TATA box and a GC rich region associated with a CAAT box. These elements act to bind RNA polymerase II to the promoter and assist the polymerase in locating the RNA initiation site. Some promoters do not have a TATA box or CAAT box but instead contain an initiator element that encompasses the transcription initiation site. A core promoter is the minimal sequence required to direct transcription initiation. The selection of suitable promoter is within the skilled artisan. For the screening or identification of enhancers, preferably, promoters that have low basal activity are engineered to reduce basal activity are preferred. On the other hand, for the screening or identification of repressors, promoters that have high basal activity are preferred.

The promoter that is used in the embodiments of the invention is preferably a "preselected promoter" i.e. a promoter having a pre-selected transcriptional activity. In particular, if a transcriptional regulatory element is to be identified that should function as an enhancer, it is preferred to use (pre-select) a promoter whose transcriptional activity can be increased , preferably a promoter that has either a weak or essentially no transcriptional activity, more preferably a promoter that has no detectable transcriptional activity. Fragments of a size between about 100-2500 bp are cloned for example before a reporter gene and the activity of said reporter gene is measured . Conversely, if a transcriptional regulatory element is to be identified that is a repressor, a promoter is preselected that has transcriptional activity. A wide variety of promoters functional in viruses, prokaryotic cells and eukaryotic cells are known in the art and may be employed for the present invention. The selection of promoter may depend upon on the host cell, if used, for the transcription step. The selection of a suitable promoter may also depend on the source of candidate nucleic acid sequences. Since specificity between transcriptional regulatory elements and promoters have been observed in some cases; therefore, the promoters and the candidate nucleic acid sequences may be derived from the same source. [0052] The core promoter may include a TATA-box consensus element and an Initiator (INR), TFIIB recognition element (BRE), motif ten element (MTE), downstream promoter element (DPE), downstream core element 8DCE), TCT motif or combinations thereof. Preferably, the promoter selected is minimally active when silent and is inducible. An inducible promoter is a promoter under environmental control.

[0053] In a preferred embodiment, the promoter is a cell type-specific promoter. Such promoters primarily drive expression in certain cell types or tissue types. Examples of promoters which can be used in the application include Hsp70, DSCP, SCP1 , SCP2, CMV, CMV mini, 4.26 and EF1 a.

To demonstrate that STARR-seq can be combined with any minimal or core promoter, the present inventors screened Drosophiia S2 cells with a screening vector that contained the core promoter of the heat shock protein 70 (hsp70). The screen was reproducible across two biological replicates with independent transfections. It revealed highly enriched enhancer candidates peaks which demonstrates that STARR-seq can be combined with any minimal or core promoter (see Arnold (2013), Science 339, 1074- 1077 and its Supplementary materials).

[0054] The term "vector" refers to a carrier nucleic acid molecule which has the ability to incorporate and transcribe heterologous nucleic acid sequences in a host, host cell or in vitro. The vector may be an expression vector or transcription vector. Selection of appropriate expression or transcription vectors is within the knowledge of those skilled in the art. Many prokaryotic and eukaryotic expression vectors are commercially available. Examples of vectors used in the present invention include plasmids, viruses, phagemids, bacteriophages, retroviruses, cosmids or F-factors. Specific vectors may be used for specific host or host cell types. Numerous examples of vectors are known in the art and are commercially available (Sambrook and Russell, Molecular Cloning: A Laboratory Manual , 3rd edition (Jan . 1 5, 2001 ) Cold Spring Harbor Laboratory Press, ISBN : 0879695765). Examples of vectors commonly used with bacteria include the pET series (Novagen), pGEX series (Ge Healthcare), pBAD-series (Invitrogen). Examples of vectors in yeasts are the pPic series for Pichia (Invitrogen), the pKlac system from Kluyveromyces lactis (New England biolabs), S. cereviseae vectors (Patel et al. Biotechnol Lett. 2003 25(4):331 -334) and the pYes system for S. cereviseae (Invitrogen). Examples of vectors for use in fungi are the pBAR series (described in Pall et al.1993. Fungal Genetics Newsletter 40: 59-61 ). The plEx plasmid based system (Merck) or the baculovirus based system (Merck) are two examples of systems useful for insect cells. Examples of vectors for use in insect cells include the tetracycline regulated systems pTet and pTre, the adenovirus-based system Adeno-X, the retrovirus-based system Retro-X (Clontech) and the pcDNA vectors (Invitrogen). Examples of in vitro transcription vectors include pSP64 or pSP65. The vector may be naturally-occurring or artificial, linear or circular. The vector may also contain an intron.

Preferably, the vector is capable of replication in a host cell. As defined herein, a host cell includes any cultivatable cell that can be modified by the introduction of heterologous DNA. Heterologous DNA may be integrated into the host genome and replicated as part of the chromosomal DNA, or it may be DNA which replicates autonomously, as in the case of a plasmid. A host cell of the present invention includes prokaryotic cells and eukaryotic cells. Prokaryotes include gram negative or gram positive organisms, for example, E. Coli or Bacilli. Suitable prokaryotic host cells for transformation include, for example, E. coli, Bacillus subtilis, Salmonella typhimurium, and various other species within the genera Pseudomonas, Streptomyces, and Staphylococcus. Eukaryotic cells include, but are not limited to, yeast cells, plant cells, fungal cells, insect cells (e.g., baculovirus), mammalian cells, and the cells of parasitic organisms, e.g., trypanosomes. As used herein, the term "yeast" includes not only yeast in a strict taxonomic sense, i.e., unicellular organisms, but also yeast-like multicellular fungi of filamentous fungi. Exemplary species include Kluyverei lactis, Schizosaccharomyces pombe, and Ustilaqo maydis, with Saccharomyces cerevisiae being preferred. Other yeast which can be used in practicing the present invention are Neurospora crassa, Aspergillus niger, Aspergillus nidulans, Pichia pastoris, Candida tropicalis, and Hansenula polymorpha. Mammalian host cell culture systems include, but are not limited to established cell lines such as COS cells, L cells, 3T3 cells, Chinese hamster ovary (CHO) cells, embryonic stem cells, and HeLa cells. The host cells can be used in step (c) to allow transcription of the nucleic acid molecules. A skilled person will recognize that the vector used will depend on the host, host cell or the in vitro transcription system in which the vector will be used.

[0056] In a preferred embodiment, the vector comprises a polyadenylation site which is downstream of the candidate nucleic acid molecule. The site is used to terminate transcription and produce a truncated message. Linkers may be added to both ends of the nucleic acid molecule before inserting it into the vector. Linkers are generally short segments of DNA that promote recombinational joining of unrelated DNA fragments. The linkers are preferably made compatible for bacterial recombination by incorporating suitable restriction sites. In one preferred embodiment, the linkers serve as sequencing tags for next generation sequencing.

[0057] In other embodiments, the vector contains a screenable marker gene or reporter gene linked to the candidate nucleic acid molecules. However, the use of screenable marker or reporter is not necessary. A screenable marker or reporter can be used to detect the presence of the vector in a host or host cell or to detect the transcript of the nucleic acid molecule. In accordance with the present invention, the marker may be any marker or marker gene that, upon integration of a vector containing the selectable marker into the host cell genome, permits the selection of a cell containing or expressing the marker gene. Suitable such selectable markers include, but are not limited to, a neomycin gene, a hypoxanthine phosphribosyl transferase gene, a puromycin gene, a dihydrooratase gene, a glutamine synthetase gene, a histidine D gene, a carbamyl phosphate synthase gene, a dihydrofolate reductase gene, a multidrug resistance gene, an aspartate transcarbamylase gene, a xanthine-guanine phosphoribosyl transferase gene, an adenosine deaminase gene, chloramphenicolacetyltransferase and a thymidine kinase gene. A reporter gene may be any fluorescent protein such as GFP, YFP, BFP, lacZ, or luciferase.

[0058] The vector may also contain an intron preceding a candidate nucleic acid molecule. It is not relevant from which source the intron is derived. Preferably, the origin originates from the same source where the candidate nucleic acid molecules originate from. However, the intron can also be heterologous to the candidate nucleic acid molecules, i.e., it is from a source other than the candidate nucleic acid molecules.

[0059] Candidate nucleic acid molecule can be inserted into a vector by ligation into a cloning site by way of restriction sites and/or by recombination as is known in the art. Prior to be transformed or transfected into a host cell or host as described herein, it is preferred that the vector comprising the candidate nucleic acid molecules is, after said candidate nucleic acid molecules have been inserted, ethanol-precipitated, optionally washed, optionally dried, frozen for at least 30 min at -80°C and overnight (for at least 3 hours) at -20°C.

[0060] It is preferably envisaged that the set-up of the nucleic acid molecules inserted in a vector as described herein is such that nonsense-mediated decay (NMD) of RNA transcribed from said vector does essentially not occur, preferably does not occur. For example, the vector may, as described herein, comprise downstream of a promoter optionally an intron; optionally a reporter or a nucleic acid molecule encoding a peptide or a polypeptide; and a candidate nucleic acid molecule. In such a set-up, NMD does essentially not occur, preferably does not occur.

Step (c) Transcription of Candidate Nucleic Acid Molecule

[0061] In accordance with the present invention, the reporter library is subjected to appropriate conditions which allows the candidate nucleic acid molecules to be transcribed. The insertion of the candidate nucleic acid molecule into the vector in the step (b) places the molecule on the transcript produced. Appropriate conditions refer to the environmental condition which promotes transcription. A skilled person will readily recognize the conditions suitable for the initiation of transcription. This step can be performed, for example, by introducing the vector into appropriate host cell by any appropriate means and methods known in the art, e.g., by electroporation, calcium phosphate precipitation, or the like, and subjecting the host cell to conditions in which the nucleic acid molecules is allowed to be transcribed under the control of the upstream promoter. A skilled person will be able to determine the host cell suitable for transcription, for example as described in Sambrook et al., Molecular Cloning: A Laboratory Manual, 3rd edition (Jan. 15, 2001 ) Cold Spring Harbor Laboratory Press, ISBN: 0879695765.

[0062] In one embodiment, the vector comprising the nucleic acid molecule may be directly delivered into a host. Direct injection of naked DNA into a host has been known. The possibility of detecting gene expression by directly injecting naked DNA into animal tissues was first demonstrated by Dubenski et al., Proc. Nat. Acad. Sci. US, 81 :7529-33, who showed that viral or plasmid DNA injected into the liver or spleen of mice was expressed at detectable levels. Others have directly injected gene into rat hearts or muscles. Other delivery methods include the Sendai virus-liposome delivery systems, cationic liposomes, polymeric delivery gels or matrices, porous balloon catheters. Liposomes allow for the incorporation into the lumen high molecular weight molecules, particularly nucleic acid of 1 kbp or more.

[0063] In yet a further embodiment, the transcription may take place in vitro. In vitro transcription is also known and used in the art, for example, as described in Melton et al. Nucl. Acid. Res. 12:7035 1984 or Tymms In Vitro Transcription and Translation Protocols, Methods in Molecular Biology, Vol. 37 ISBN: 978-0-89603-288-0.

Step (d) and (e) Reverse Transcription and Quantification.

[0064] The present invention is based on the surprising finding that a candidate nucleic acid molecule having transcription regulatory activity is able to up-regulate or down-regulate its own transcription. The direct coupling of the candidate nucleic acid molecule to the transcriptional read-out of its potential enhancer/repressor activity allows the identification of an enhancer or repressor. Transcribed candidates is first isolated from cellular RNA, if transcription took place in host cell, or in the liver of the host, or from in vitro transcription system. To quantify the transcripts of the candidate nucleic acid molecules, either the RNA is reverse transcribed to cDNA and the cDNA is preferably amplified or the RNA is not reverse transcribed, but quantified as such, for example by RNA sequencing (Ozsolak, F and Milos, PM (2010a). Direct RNA Sequencing. Experimental Medicine 28: 2574-2580; Ozsolak, F and Milos, PM (2010b). RNA sequencing: advances, challenges and opportunities. Nat Rev Genet. 201 1 Feb;12 (2):87-98; Ozsolak, F and Milos, PM (2010). Direct RNA Sequencing. Experimental Medicine 28: 2574-2580). Reverse transcription can be performed using any known technique in the art (e.g. as described in Sambrook et al., Molecular Cloning: A Labroratory Manual, fourth Ed. Cold Spring Harbor Press 2001 ; Khan et al., Biochem. Biophys. Acta. 1423:17-28 1999). Reverse transcription involves production of a DNA complement to an RNA sequence mediated by reverse transcriptase, which are DNA polymerases that can use RNA as a template for replication. Reverse transcriptases are generally RNA-dependent DNA polymerases. In one preferred embodiment, reverse-transcriptase polymerase chain reaction (RT-PCR) is used. PCR reaction may be performed more than once, such as 2, 3, 4, 5, 6, 7 or more times, and cDNA generated from different reactions can be pooled to form pooled cDNA. In one embodiment, one step RT-PCR that combines the cDNA synthesis and amplification of cDNA is used.

The term "quantification" should be understood broad ly. It may be performed by measuring the amount (such as raw count) or concentration of the candidate nucleic acid molecule, semi-quantitatively or quantitatively. Quantification may be carried out by any technique known to a skilled person . Suitable methods include Real Time PCR, quantitative PCR (Sagneret al. Biochemica 3, 15-17, 2001 ) hybridization onto a DNA microarray (Kawasaki et al. NAnn. N. Y. Acad. Sci. 1020 (2004) 92-100). DNA microarrays provide a platform for exploring the genome, including analysis of gene expression by hybridization with sequence specific oligonucleotide probes attached to chips in precise arrays (e.g., Schena et al., Science 270:467-470, 1995; Shalon et al., Genome Res. 6:639-645, 1996; Pease et al., Proc. Natl. Acad. Sci. USA 91 :5022-26, 1994). Microarray technology is an extension of previous hybridization-based methods, such as Southern and Northern blotting, that have been used to identify and quantify nucleic acids in biological samples (Southern, J. Mol. Biol. 98:503-17, 1975; Pease et al., Proc. Natl. Acad. Sci. USA 93:10614-19, 1996). Identification of a target nucleic acid in a sample generally involves fluorescent detection of the nucleic acid hybridized to an oligonucleotide at a particular location on the array.

[0066] Preferably, the quantification step is performed by next generation sequencing (NGS).

The phrase next generation sequencing refers to sequencing technologies having increased throughput as compared to traditional Sanger- and capillary electrophoresis- based approaches. The advantage of NGS is the high throughput production of sequence data, for example with the ability to generate millions of small sequence reads at a time. Some examples of next generation sequencing include, but are not limited to, 454's high throughput pyrosequencing (454 Life Sciences) (Margulies et al. (2005) Nature 437, 376- 380; Wheeler et al. (2008) Nature 452, 872-826; Ronaghi, et al. (1996) Anal Biochem 242, 84-89; Ronaghi et al. (1998) Science 281 , 363-365), lllumina/Solexa sequencing by synthesis from single clones on a surface (lllumina) (Margulies et al. (2005) Nature 437, 376-380; Wheeler et al. (2008) Nature 452, 872-826), ABI's SOLiD technology ("Supported Oligonucleotide Ligation and Detection", Applied Biosystems)) (Cloonan et al. (2008) Nat Methods 5, 613-619), and PacBio RS and SMRT sequencing by Pacific Biosciences (www.pacificbiosciences.corm). True Single Molecule sequencing (tSMS) by Helicos Biosciences (www.helicosbio.com) as described in U.S. Pat. No. 7,875,440, 7,897,345 can also be used, which has the advantage that amplification of cDNA is not required.

[0067] Next generation sequencing technologies simplify and accelerate sequencing by eliminating the need for individual cloning in sample preparation as required in traditional sequencing; by enabling the parallel preparation of millions of sequences to be analyzed, and by simultaneously detecting sequencing signals in millions of events. Various next- generation sequencing techniques are reviewed, e.g., in Metzker (2010) "Sequencing technologies-The next generation" Nature Reviews Genetics 1 1 :31 -46, Voelkerding et al. (2009) "Next-generation sequencing: From basic research to diagnostics" Clin Chem 55:641 -658, Dhiman et al. (2009) "Next-generation sequencing: A transformative tool for vaccinology" Expert Rev Vaccines 8:963-967, and Turner et al. (2009) "Next-generation sequencing of vertebrate experimental organisms" Mamm Genome 20:327-338. Nanopore sequencing is reviewed, e.g., in Branton et al. (2008) "The potential and challenges or nanopore sequencing" Nature Biotech 26:1 146-1 153.

However, the quantification of the candidate nucleic acid molecules can also be done on RNA level, for example by RNA sequencing; see Oszolak and Mios (2010a), Oszolak (2010b), Oszolak and Milos (2009), all cited hereinabove.

Step (f) Determination of transcriptional regulatory element.

[0069] The quantification of the cDNA allows the determination of presence or absence of a transcriptional regulatory element. The determination is performed based on observing whether any nucleic acid molecules is overtranscribed or undertranscribed. A putative enhancer can be identified by the increased or high number of the transcripts. On the other hand, a candidate having repressor activity can be identified by the low or decreased number of transcripts.

[0070] In one embodiment, the determination is based on counting the abundance of the cDNA of a given candidate nucleic molecule in the total obtained cDNA and the input library, "input library" refers to all the candidate nucleic acid molecules provided in step (a). A putative enhancer will have higher abundance (more frequent or abundant) compared to that in the input library. If DNA microarray is used for quantification, the determination can be based on observing stronger microarray signal. Likewise, a putative repressor will have lower abundance (less frequent or abundant) compared to that in the input library. This is termed as a "lack" of candidate nucleic acid molecules.

Transcriptional Regulatory elements obtained

[0071] The present invention also encompasses transcriptional regulatory elements which are obtained or obtainable by the method as disclosed herein. Some of the enhancers which have been identified by the inventors are recited in SEQ ID NO: 1 to 1500. The present invention accordingly provides a transcriptional regulatory element comprising any one of the sequences as recited under SEQ ID NO:1 -1500.

[0072] Also comprised herein are transcriptional regulatory elements having at least 30% identity, such as at least 40%, 50%, 60%, 70%; 80%; 85%, 90%, 92%, 95%, 98% identity with one of SEQ ID NO: 1 -1500. Further comprised are transcriptional regulatory elements that hybridize to any of the nucleotide sequences shown in SEQ ID NO: 1 -1500. The term "hybridizes" as used in accordance with the present invention may relate to hybridizations under stringent or non-stringent conditions. If not further specified, the conditions are preferably non-stringent. Said hybridization conditions may be established according to conventional protocols described, for example, in Sambrook, Russell "Molecular Cloning, A Laboratory Manual", Cold Spring Harbor Laboratory, N.Y. (2001 ); Ausubel, "Current Protocols in Molecular Biology", Green Publishing Associates and Wiley Interscience, N.Y. (1989), or Higgins and Hames (Eds.) "Nucleic acid hybridization, a practical approach" IRL Press Oxford, Washington DC, (1985). The setting of conditions is well within the skill of the artisan and can be determined according to protocols described in the art. Thus, the detection of only specifically hybridizing sequences will usually require stringent hybridization and washing conditions such as O.l xSSC, 0.1 % SDS at 65°C. Non-stringent hybridization conditions for the detection of homologous or not exactly complementary sequences may be set at 6xSSC, 1 % SDS at 65°C. [0073] Also encompassed by the present invention are transcriptional regulatory elements which are derivatives of the nucleotide sequences shown in SEQ ID NO:1 -1500. It is known in the art that a transcriptional regulatory sequence can be mutagenized, deletions and/or insertions and/or substitutions of nucleotides can be made without losing the transcriptional activity. Such derivatives preferably include the transcriptional regulatory elements described herein that share the degree of identity with the nucleotide sequence as shown in any one of SEQ ID NO: 1 -1500 as described herein and those which hybridize to with the nucleotide sequence as shown in any one of SEQ ID NO: 1 -1500 as described herein.

[0074] It is known that transcriptional regulatory sequences still retains its function even with mutations, substitutions, deletions and/or insertions (see Meireles-Filho AC, Stark A. (2009), Curr Opin Genet Dev. 2009 Dec;19( 6):565-570; Fisher et al. (2006), Science 312, 276)

[0075] Accordingly, it is preferred that a derivative of a transcriptional regulatory element as described herein still retains its function, i.e., has transcriptional activity, e.g., either as enhancer or repressor. The method described below can be used to assess whether the regulatory function is retained.

[0076] A variety of sequence based alignment methodologies, which are well known to those skilled in the art, are useful in determining identity among sequences. These include, but not limited to, the local identity/homology algorithm of Smith, F. and Waterman, M. S. (1981 ) Adv. Appl. Math. 2: 482-89, homology alignment algorithm of Peason, W. R. and Lipman, D. J . (1988) Proc. Natl. Acad. Sci. USA 85: 2444-48, Basic Local Alignment Search Tool (BLAST) described by Altschul, S. F. et al. (1990) J. Mol. Biol. 215: 403-10, or the Best Fit program described by Devereau, J. et al. (1984) Nucleic Acids. Res. 12: 387-95, and the FastA and TFASTA alignment programs, preferably using default settings or by inspection. In one preferred embodiment, identity is calculated by Fast alignment algorithms based upon the following parameters: mismatch penalty of 1 .0; gap size penalty of 0.33, joining penalty of 30 (see "Current Methods in Comparison and Analysis" in Macromolecule Sequencing and Synthesis: Selected Methods and Applications, p. 127-149, Alan R. Liss, Inc., 1998). Another example of a useful algorithm is PI LEU P. PI LEU P creates multiple sequence alignment from a group of related sequences using progressive, pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. PILEUP uses a simplification of the progressive alignment method of Feng, D. F. and Doolittle, R. F. (1987) J. Mol. Evol. 25, 351-60, which is similar to the method described by Higgins, D. G. and Sharp, P. M. (1989) CABIOS 5: 151-3. Useful parameters include a default gap weight of 3.00, a default gap length weight of 0.10, and weighted end gaps. Another example of a useful algorithm is the family of BLAST alignment tools initial described by Altschul et al. (see also Karlin, S. et al. (1993) Proc. Natl. Acad. Sci. USA 90: 5873-87). A particularly useful BLAST program is WU-BLAST-2 program described in Altschul , S. F. et al. (1 996) Methods Enzymol. 266: 460-80. WU-BLAST uses several search parameters, most of which are set to default values. The adjustable parameters are set with the following values: overlap span=1 , overlap fraction=0.125, word threshold (T)=1 1 . The HSP S and HSP S2 parameters are dynamic values and are established by the program itself depending upon the composition of the particular sequence and composition of the particular database against which the sequence of interest is being searched; however, the values may be adjusted to increase sensitivity. An additional useful algorithm is gapped BLAST as reported by Altschul, S. F. et al. (1997) Nucleic Acids Res. 25: 3389- 402. Gapped BLAST uses BLOSSOM-62 substitution scores; threshold parameter set to 9; the two-hit method to trigger ungapped extensions; charges gap lengths of k at cost of 10+k; Xu set to 16, and Xg set to 40 for database search stage and to 67 for the output stage of the algorithms. Gapped alignments are triggered by a score corresponding to -22 bits. Speific programs have been developed to may and assemble NGS data, e.g. the program BOWTIE.

The present invention also provides a vector comprising a transcriptional regulatory element identifiable in accordance with the methods described herein or as described herein. Said vector preferbaly further comprises a nucleic acid molecule of interest, expression of said nucleic acid molecule is driven by a promoter and is additioally regulated by a transcriptional regulatory element of the present invention.

[0078] Also provided by the present invention is a host, e.g., mouse, rat, xenopus or zebrafish, or a host cell such as a eukaryotic or prokaryotic host cell comprising a transcriptional regulatory or a vector described herein.

[0079] Furthermore, the present invention envisages a method for the production of a polypeptide of interest comprising culturing a host cell as described herein under conditions allowing expression of said polypeptide and recovering said polypeptide.

Method of determining the level or transcriptional regulatory activity

[0080] The present invention can be advantageously applied to determine the level of transcriptional regulatory activity of a nucleic acid molecule. The method comprises:

(a) providing a candidate nucleic acid molecule,

(b) inserting the molecule into a vector downstream of a promoter,

(c) subjecting the vector to conditions allowing transcription from the promoter,

(d) reverse transcribing RNA obtained in step (c) into cDNA,

(e) quantifying the cDNA obtained in step (d), and

(f) determining the level of transcriptional regulatory activity of the nucleic acid molecule based on the quantification.

The determination may be carried out by comparing the quantity of the cDNA by comparing with a pre-set reference or with values obtained from known transcriptional regulatory elements that is analyzed in parallel.

Method of optimizing a transcriptional regulatory element

[0081] A skilled person will readily appreciate that present invention is not restricted to identifying transcriptional regulatory elements. For example, the present invention is also applicable for constructing or improving a transcription or expression vector by incorporating an enhancer or repressor as identified herein.

[0082] Furthermore, the present invention can be used to optimize a transcriptional regulatory element. The term "optimization" or "optimize" means altering the sequence of transcriptional regulatory element such that it's enhancer or repressor regulatory activity is improved as compared to the starting element. The method comprises

(a) providing candidate nucleic acid molecules comprising a transcriptional regulatory element and mutants thereof,

(b) preparing a reporter library of the candidate nucleic acid molecules by inserting the molecules into a vector downstream of a promoter,

(c) subjecting the library to conditions allowing transcription from the promoter,

(d) reverse transcribing RNA obtained in step (c) into cDNA,

(e) quantifying the cDNA obtained in step (d),

(f) determining the level of transcriptional regulatory activity of the candidate nucleic acid molecules based on the quantification, and

(g) selecting at least one candidate nucleic acid molecule which has improved transcriptional regulatory activity than the transcriptional regulatory element and thereby obtaining an optimized transcriptional regulatory element.

[0083] To optimize a transcriptional regulatory element, the candidate nucleic acid molecules are preferably mutants or derivatives of the transcriptional regulatory element. Mutants or derivatives can be obtained by any techniques know to those skilled in the art. For example, the mutants can be obtained by mutagenesis, such as random mutagenesis, exposure to mutagens, error prone PCR. Mutants may also be obtained by addition, substitution and deletion of one or more nucleic acids of the transcriptional regulatory element. To obtain an optimized transcriptional regulatory element, the candidate(s) which has improved transcriptional regulatory activity than the transcriptional regulatory element should be selected. An "improved" enhancer activity refers to an increased transcription of a target gene; and an "improved" repressor activity refers to a decreased transcription of a target gene.

Transcription/expression system

[0084] Furthermore, transcriptional regulatory elements identified by the present invention can be advantageously used in a transcription or expression vector to increase or decrease transcription or expression of a target gene. Accordingly, a method of providing a transcription or expression vector is further provided. The method comprises

(a) optionally providing candidate nucleic acid molecules,

(b) preparing a reporter library of the candidate nucleic acid molecules by inserting the molecules into a vector downstream of a promoter

(c) subjecting the library to conditions allowing transcription from the promoter,

(d) reverse transcribing RNA obtained in step (c) into cDNA,

(e) quantifying the cDNA obtained in step (d),

(f) determining the presence of one or more transcriptional regulatory element from the candidate nucleic acid molecules based on the quantification, and

(g) constructing a transcription or expression vector comprising the transcriptional regulatory element and the promoter.

[0085] Preferably, the transcriptional regulatory element is inserted upstream of the promoter.

Those skilled in the art will be able to construct such vectors. The vectors may also contain more than one promoter or any combinations of marker, such as negative or positive selection markers, or reporter gene, amplifiable gene. The additional promoter may be the same or different with preferably a pre-selected promoter. The promoters may be promote constitutive or regulated expression as described herein above. Regulated expression may be inducible or repressible expression or both.

[0086] As described earlier, positive and negative regulation of transcriptional activity are mediated by enhancer and repressor, respectively. These regulatory elements are position and orientation independent, and that may be proximal to the promoter of target gene or distal and active over a large distance. Therefore, the transcriptional regulatory element can be up-stream or downstream of the promoter, but preferably upstream.

Examples

Source of DNA for library generation

[0087] Genome-wide libraries were generated from genomic DNA, isolated by standard phenol/chloroform extraction, including RNaseA digestion, from D. melanogaster embryos of the sequenced strain (y; cn bw sp) (Drosophila melanogaster reference strain Adams et al., The genome sequence of Drosophila melanogaster, Science 287, 2185- 2195 (2000). BAC libraries were generated from BAC DNA obtained from BACPAC Resource Center (BPRC) Oakland, California, USA and isolated from DH 10B bacterial culture by QIAGEN large construct kit (cat. no. 12462).

Generation of screening libraries

[0088] DNA was sheared by sonication (Covaris S220) and DNA fragments (500bp-700bp length) were size-selected on a 1 % agarose gel. lllumina Multiplexing Adapters (lllumina Inc.; cat. no. PE-400-1001 ) were ligated to ^g - 5 μg of size-selected DNA fragments following the instructions of N EBNext® DNA Library Prep Reagent Set for lllumina® (NEB), except the final PCR amplification step. Five 10 cycle PCRs (98°C for 45s; followed by 10 cycles of 98°C for 15s, 65 °C for 30s, 72 °C for 30s) with 1 μΙ adaptor ligated DNA as template were performed, using KAPA Hifi Hot Start Ready Mix (KAPA Biosystems; cat. no. KK2602) and primers PE1 .0_Agel and MP2.0_Sall, adding a specific 15nt extension to both adapters for directional cloning using recombination (Clontech In-Fusion HD). The five PCR reactions were pooled, purified, and size selected with Agencourt AMPureXP DNA beads (ratio beads/PCR 0.7; cat. no. A63881 ), followed by column purification (QIAquick PCR purification kit; cat. no. 28106). [0089] The purified PCR products were recombined (Clontech In-Fusion HD; cat. no. 639650) to the screening vector (linearized by a 3h digestion with NEB Agel-HF and Sall-HF, followed by agarose gel electrophoresis, QIAquick gel extraction [cat. no. 28706], QIAquick PCR purification, and Qiagen MinElute PCR purification [cat. no. 28006]) in a total of 20 1 0 μ I reactions. Four I n-Fusion H D reactions were pooled and ethanol precipitated (elution in 12.5μΙ 10mM Tris-HCL pH8), washed, dried, frozen for 30min at - 80°C and overnight (for at least 3 hours) at -20°C. 20 aliquots (20 μΙ each) of Invitrogen MegaX DH 10B Electrocompetent Bacteria (cat. no. C6400-03) were transformed with 2.5μΙ DNA each according to the manufacturer's protocol. After one hour recovery at 37°C, five transformation reactions were pooled, transferred to 500ml LBAMP medium, and grown to OD 0.6-0.9. If further amplification was needed, each of the 500ml cultures was transferred to 2I LBAMP medium and further incubated. Bacterial cultures were always harvested at OD 0.6 - 0.9. The plasmid libraries were extracted using QIAGEN Plasmid Plus Mega Kit (cat. no. 12981 ).

Screening vector

[0090] The present inventors constructed a screening vector based on the pGL3-Promotor backbone (Promega; cat. no. E1751 ) with a DSCP core promoter (Pfeiffer et al. Tools for neuroanatomy and neurogenetics in Drosophila. PNAS 2008; 105(28): 9715-9720), followed by the constitutively spliced mhc16 intron, a sgGFP ORF (Qbiogene, Inc.), a ccdB suicide gene flanked by homology arms for cloning of the enhancer candidates, and the pGL3's SV40 late polyA-signal.

Cell culture and transfection

[0091] The present i nvention (termed "self-transcribing-active-regulatory sequencing" or

"STARR-seq") was applied to three D. melanogster cell-types, S2 cells, OSC cells and BG3 cells. [0092] S2 (Invitrogen) and OSC (Saito et al., A regulatory circuit for piwi by the large Maf gene traffic jam in. Drosophila. Nature 461, 1296-1299 (2009)) were cultured in Schneider's Medium (Gibco; cat. no.21720-024) supplemented with 10% FCS and 1% P/S at 27°C, Shield & Sangs M3 (Sigma; cat. no. S3652)(supplemented with 10% FCS, 1% glutathione, 1% Insulin, 2% fly extract, 1% P/S at 27°C, respectively. Transfection of plasmid libraries (^g DNA/1x106 cells) was performed with 1x109 cells at 70-80% confluence using Gene Pulser MXcell™ Electroporation System (24 well plate; Bio-Rad; cat. no. 165-2682). 1x107 cells in 800μΙ K-PBS were subjected to each well (corresponding to a standard 0.4mm electroporation cuvette), containing 10μg of plasmid library in 100μΙ EB. After 15' incubation, S2 and OSC were pulsed with 450V-250 F- 1000Ω and 450\/-350μΡ-1000Ω, respectively. Every 1x107 cells were transferred to 9.2ml growth medium and incubated for24h before RNA isolation.

[0093] ML-DmBG3-c2 (BG3) (DGRC) (Cherbas et al., The transcriptional diversity of 25

Drosophila cell lines, Genome Res.21(2): 301—314 (2011)) were cultured in M3 BP YE supplemented with 10% FCS, 10μg/ml insulin, 1% P/S at 25°C, respectively. Transfection of plasmid libraries (^g DNA/1x106 cells) was performed with 1x109 cells at 70-80% confluence using Gene Pulser MXcell™ Electroporation System (24 well plate; v+c.-nr). 1x107 cells in 800μΙ K-PBS were subjected to each well (corresponding to a standard 0.4mm electroporation cuvette), containing 10μg of plasmid library in 100μΙ EB. After 5' incubation, BG3 cells were pulsed with 500\/-250μΡ-1000Ω and spun down in batches of 6x107 cells. Each batch of cells was resuspended in 10ml growth medium, and incubated for24h before RNA isolation. RNA was isolated, reverse transcribed and quantified using the same method as described above.

RNA isolation from cells

[0094] 24h post electroporation cells were counted, washed in 1xPBS and concentrated. Total

RNA was extracted from all surviving cells using Qiagen RNeasy maxi prep kit (cat. no. 75162), and the polyA+ RNA fraction was isolated using Invitrogen Dynabeads Oligo- dT25 (scaling up the manufacturer's protocol 5-fold per tube; cat. no. 610-05) and subjected to treatment with Ambion turboDNase (cat. no. AM2239) at a concentration of at most 1 SOng/μΙ for 30' at 37°C. Every three reactions (50μΙ) were pooled and subjected to Qiagen RNeasy MinElute reaction clean up (cat. no. 74204), to inactivate turboDNase and concentrate the RNA.

Reverse transcription

[0095] First strand cDNA synthesis was performed with Invitrogen Superscriptlll (50°C 60', 70°C

15'; cat. no. 18080085) using a reporter-RNA specific primer (GSP10) and 2.5^g of polyA+ RNA in 20 reactions. Five reactions were pooled and 1 μΙ of 10mg/ml RNaseA was added (37°C 1 h) followed by column purification (QIAquick PCR purification kit).

Quantification

[0096] The present inventors amplified the reporter cDNA for Solexa sequencing by a 2-step nested PCR with the KAPA Hifi Hot Start Ready Mix. In the first PCR (10-20, e.g. 15 cycles), 35-50ng cDNA were amplified using 2 reporter-specific primers (junction 1.0 & junction2.0), one of which spans the splice junction of the mhc16 intron (5 nts at the 3'end protected by phophorothioate bonds). This specifically amplifies the reporter cDNA and suppresses residual plasmid background. The second PCR (8-13 cycles) uses the lllumina primers (PE1 .0 & MP2.0 or IDX1 -IDX48; template already present in the reporter at both ends of the candidate enhancers) to prepare the sample for Solexa sequencing. After each PCR, the PCR products are purified by Agencourt AMPureXP DNA beads (ratio beads/PCR 0.7). Finally, the concentration and quality of the library is determined by qPCR and a DNA-Chip1000 (Agilent Bioanalyzer 2100). Each library is sequenced on a GAIIX platform, following manufacturer's protocol.

[0097] Four to eight reactions with 100ng reporter library constructs (template) were amplified using the same conditions as described for the cDNA were amplified as input. These reactions serve as input. Just the forward primer for the junction PCR was different (binds inside the non-spliced intron; junction3.0). The PCR products were run and isolated from a 1 % agarose gel (QIAquick gel extraction kit) after both rounds of amplification.

Computational processing of STARR-seq data

[0098] To find peaks or regions in which the cDNA is significantly enriched over input, we used only paired-end fragments which unique genomic locations and strand to exclude potential biases from PCR duplicates. We refer to this set of reads as sequence fragments or position merged reads as opposed to all reads. From the cDNA peaks we seeded our peak calling pipeline with those top 10% positions which showed the highest fragment coverage (up to 250 independent fragments in S2 cells). Based on these putative overlapping regions we extracted the nucleotide position with the highest coverage in cDNA within a non-overlapping fixed window of 500 bp. We then computed corrected enrichment values for the putative STARR-seq enhancer summit position. The final data set contained a set of non-overlapping STARR-seq enhancer regions with a fixed window of 500bp, a summit location, an enrichment value of cDNA input and a associated binomial p-value.

[0099] In order to assign an enhancer peak to a putative target gene we used the closest TSS with regard to the summit of the STARR-seq peak region.

Statistical analysis

[00100] The present inventors used the method as described in the publicly available, open- source program "R" (http://www.r-project.org) for all statistical analysis.

Identified Enhancers

[00101 ] This yielded 5499 regions that were significantly enriched over input ("peaks"; binomial p- value < 0.001 ; empirical FDR=1 .8%) and supported by between 7 and 250 independent fragments (9 to 94873 sequence reads; the genome average was 2 fragments or 34 reads). Peaks were found at various genomic positions, both near housekeeping genes and developmental regulators, and included weak and strong (1953 with ≥ 3 fold enrichment) enhancers over a wide dynamic range (21 -fold or 1900-fold when using all reads). Fig 2 shows the distribution of STARR-seq enrichments for putative enhancer regions in S2 cells (Fig. 1 a) and OCS cells (Fig. 1 b).

[00102] Fig. 3 shows the view of STARR-seq cDNA (blue) and input (grey) read densities in the srp locus using UCSC genome browser (USCS GB). Fujita et al., The UCSC Genome Browser database: update 201 1 , Nucleic Acids Research 39, D876-82 (201 1 ).

[00103] As defined, strong enhancers have an enrichment level of 3-fold above input with a p- value lower than 1 e-3 which results in 1953/5499 (36%) strong enhancers. The top 500 strongest enhancers for S2, OSC and BG3 cells are listed, according to their strength, under SEQ ID NO: 1 -500 for S2 cells, SEQ ID NO: 501 -1000 for OSC cells and SEQ ID NO: 1001 -1500 for BG3 cells.

Correlation with Luciferase Assay

[00104] To test the enhancer activity of identified STARR-seq peaks, we individually assayed 71 peak sequences chosen across a wide range of enrichments and 39 negative controls by standard luciferase reporter assays.

[00105] In the luciferase reporter assay, the SV40 promotor of pGL3-promotor (pGL3_attLRJuc+) was replaced by DSCP and a gateway-cassette was inserted in the MCS, just upstream of the core promoter, to allow Gateway cloning. Selected regions were PCR amplified, cloned into pCR8-TOPO-GW (Invitrogen; cat. no. K252020) and shuttled to pGL3_attLRJuc+ by LR clonasell recombination (Invitrogen; cat. no.1 1791 100). Individual constructs were tested by cotransfecting cells with the respective firefly construct and a ubiquitous expressing renilla plasmid (ubi-63E-RL). Using Promega Dual Luciferase Assay kit (cat. no. E1960), luciferase activity was measured at a Bio-Tek synergy fluorescence plate reader and relative luciferase activity was determined by normalizing firefly luciferase to renilla luciferase activity. [00106] The normalized luciferase values were used for determining the activity of a specific putative enhancer element. We used the Students t-test to derive a p-value from our enhancer set in relation to a negative control set of 39 constructs with normalized luciferase values of 19.4 ± 15.2 (mean and standard deviation). For plotting the luciferase activity of the constructs versus the STARR-seq enrichment values we used the center position of the construct to derive the STARR-seq enhancer enrichment at that location. All luciferase constructs we used for the final analysis were non-overlapping regarding their genomic loci.

[00107] The result shows that a strong linear relation exists between STARR-seq enrichment and luciferase activity over the entire range of enrichment values (PCC=0.83; Fig. 4). In particular, 92% of all tested peaks showed luciferase activities that were significantly (Fisher's exact test p-value≤ 0.05) between 2 and 1000-fold above the negative controls.

[00108] Among the enhancers identified as strong enhancers, 53 out of the 55 (96%) were validated as positive in the luciferase assay, with a median luciferase enrichment of 41 fold above the negative control. 50 out of these 53 were enriched at least 3 fold, 45 at least 5 fold and 37 at least 10 fold.

[00109] This establishes that STARR-seq quantitatively assesses the ability of candidate sequences to enhance transcription and is suitable for genome-wide functional enhancer identification.

Assessment of Reproducibility

[001 10] There is no evidence showing that the position of enhancer candidates inside the transcript during STARR-seq would bias their assessment: sequences that in their endogenous genomic contexts occur upstream or within transcribed regions showed the same strong linear correlation between STARR-seq and luciferase assays. Fig. 5a shows the genomic regions with and without significant STARR-seq enrichment located in a 2kb up- and 2kb downstream window around the TSS were tested for their enhancer potential in a luciferase assay. Both up- and downstream tested fragments are indicated and independent linear fits were computed for both data sets as indicates by the R2, the slope and interception values, as well as the dotted lines.

[001 1 1 ] It has been found that even sequences that contain transcript-destabilizing elements were not substantially depleted during STARR-seq. Fig. 5b shows that the STARR-seq (cDNA) fragments are not substantially depleted by transcript-destabilizing elements. Even at genomic sites that contain annotated microRNA hairpins, 3' gene ends, splice acceptors, or splice donors, only slightly more fragments were anti-sense to these elements compared to sense. A similar result was obtained for regions that contained at least 5 poly-adenylation motifs (AATAAA) or 3 seed sites for the microRNAs bantam, miR-14, miR-34, miR-2a, or miR-2b, which are all highly expressed in S2 cells. Also genome-wide, we observed that all significant peaks had equal contribution from sense and anti-sense fragments, with no significant deviations.

[001 12] We used biological replicates to show the reproducibility of the methods of the present invention. Accordingly, all steps were repeated at least twice in parallel.

[001 13] The biological replicate yielded highly similar results with a Pearson correlation coefficient

(PCC) of 0.84 along the entire genome and 0.92 for the 5499 peak summits, indicating that STARR-seq is highly reproducible. Fig. 6 shows the reproducibility of STARR-seq in significantly enriched putative STARR-seq regions and genome-wide. Read counts are normalized to 1 million mapped reads in each library.

[001 14] The present invention is unique in its ability to report quantitatively on enhancer strength and to discover regulatory elements directly based on their ability to enhance transcription. It is widely applicable to test candidate fragments from arbitrary sources of DNA in any cell-type or tissue that allow the introduction of candidate nucleic acid molecules. With the method described herein the inventors have successfully identified thousands of sequence that can function as cell-type specific enhancers with a continuum of enhancer activities over a wide range. STARR-sea reveals transcription regulatory elements

[001 15] Gene expression is regulated by genomic enhancers, the identification of which has remained challenging and depended on indirect measures of activity. The present application provides STARR-seq to directly and quantitatively assess enhancer activity for millions of candidates from arbitrary sources of nucleic acid molecules, enabling screens across entire genomes. Exemplarily applied to the Drosophila genome in two cell-types, STARR-seq identifies thousands of cell-type specific enhancers across a broad continuum of strengths, linking differential gene expression to differences in enhancer activity, and revealing several independent enhancers for many - even ubiquitously expressed - genes.

[001 16] Specifically, the present inventors cloned a genome-wide reporter library from randomly sheared genomic DNA of a Drosophila melanogaster reference strain. This library contained at least 1 1 .3 million independent candidate fragments with a median length of ~600bp as revealed by paired-end sequencing (Fig. 7A). It covered 96% of the non- repetitive genome at least 10 fold and is therefore sufficiently complex to comprehensively represent the entire 169Mb D. melanogaster genome (Fig. 7B, C, D; SOM).

[001 17] Next, the present inventors transfected the library into 1 billion D. melanogaster S2 cells and isolated STARR-seq reporter transcripts as part of the entire poly-adenylated cellular RNA pool. We selectively reverse-transcribed, PCR amplified, and sequenced candidate fragments by lllumina Solexa paired-end sequencing and mapped the paired sequence tags to the reference genome, quantifying for each position the enrichment of cDNA over input.

[001 18] This yielded 5499 regions that were significantly enriched over input ("peaks"; binomial p- value < 0.001 ; empirical FDR=1 .8%) and supported by 7 to 250 independent fragments (9 to 94873 sequence reads; the genome average was 2 fragments or 34 reads). Peaks were fou nd at various genom ic positions , both near housekeepi ng genes and developmental regulators and included weak and strong (1953 with≥ 3 fold enrichment) enhancers over a wide dynamic range (21 -fold or 1900-fold when using all reads; Fig.

8A). Importantly, the present inventors did not find any evidence that the position of enhancer candidates inside the transcript during STARR-seq would bias their assessment: sequences that in their endogenous genomic contexts occur upstream or within transcribed regions showed the same strong linear correlation between STARR- seq and luciferase assays (Fig. 9A). Even sequences that contain transcript-destabilizing elements were not substantially depleted during STARR-seq (Fig. 9B; SOM). A biological replicate yielded highly similar results with a Pearson correlation coefficient (PCC) of 0.84 along the entire genome and 0.92 for the 5499 peak summits, indicating that STARR-seq is highly reproducible (Fig. 9C).

[001 19] To test the enhancer activity of identified STARR-seq peaks, the present inventors e individually assayed 71 peak sequences chosen across a wide range of enrichments and 39 negative controls by standard luciferase reporter assays. This revealed a strong linear relation between STARR-seq enrichment and luciferase activity over the entire range of enrichment values (PCC=0.83). In particular, 92% of all tested peaks showed luciferase activities that were significantly (t-test p-value < 0.05) between 2 and 1000-fold above the negative controls. This establishes that STARR-seq quantitatively assesses the ability of candidate sequences to enhance transcription and is suitable for genome-wide functional enhancer identification.

[00120] The majority (55.6%) of the identified sequences that function as enhancers are located within introns, especially in the first intron (37.2%) and in intergenic regions (22.6%) (Fig. 10A; SOM). Surprisingly, 4.5% of the STARR-seq enhancers were located inside core promoters and overlapped annotated TSS, suggesting that these sequences can both initiate transcription and enhance transcription from a remote core promoter.

[00121 ] Interestingly, of the 284 strong enhancers within 500bp of a TSS, the vast majority (79%) were located downstream of the TSS within the 5' UTR and the first intron, emphasizing the importance of these regions for transcriptional regulation. [00122] The strongest enhancers were neighboring housekeeping genes such as enzymes (e.g.

CTP:phosphocholine cytidylyltransferase 1; Cct1) or constituents of the cytoskeleton (e.g. Actin5C) but also close to developmental regulators such as the TFs luna (#37), shn, or pnt, or the fly FGF receptor heartless htl. In fact, the strongest identified enhancer was located in the intron of the TF zfhl, and 18 of the top 100 and 364 of all strong enhancers were in TF gene loci. The only prominent class of genes with poorly ranking enhancers were the ribosomal protein genes (e.g. RpS3,, presumably because the enhancers of those genes require a "TCT" motif containing core promoter.

[00123] STARR-seq revealed an unexpected complexity of transcriptional regulation as even in a single cell-type many genes appeared to be regulated by several enhancers, each of which can function independently during STARR-seq (e.g. shn;: 434 genes had at least two enhancers within 2kb of their TSS and 56 genes had three or more. This trend was even stronger when considering the entire locus for each gene: 203 genes had more than 5 and 26 more than 10 independently functioning enhancers. Among the 56 genes with significant enhancer clustering within 2kb around the TSS (p < 0.001 ) are 14 transcription factors but also 30 ubiquitously expressed housekeeping genes, including Actin5C and Cct1.

[00124] Interestingly, while the number of peaks per gene does not correlate with the gene's expression level as measured by RNA-seq (PCC: -0.12; Fig. 1 1 A, B, C; SOM), the sum of the peaks' STARR-seq signals correlates well on average (PCC: 0.91 ). This directly links the expression level of a gene with the activity of its enhancers and provides a causal explanation for the wide range of gene expression levels observed (Fig. 1 1 C; SOM).

[00125] Together, this suggests that transcription for a large number of genes - even in a single cell-type - is controlled by many enhancers, which presumably function additively or to ensure robustness (15, 16). Surprisingly, this was even true for ubiquitously expressed housekeeping genes . [00126] STARR-seq assesses the ability of a DNA sequence to enhance transcription in a heterologous context given the regulatory trans environment within a cell, which can be viewed as the sequence's regulatory potential. The complementary DHS-seq and ChlP- seq determine enhancer-associated characteristics such as DNA accessibility and histone modifications in the endogenous genomic context of the cell . We sought to compare and combine the information provided by all three methods.

[00127] We performed DHS-seq in S2 cells (Fig. 12A, 13; SOM) and found that the vast majority

(69%) of strong STARR-seq enhancers were accessible (DHS enrichment p<0.05) and all weak enhancers showed above random DHS enrichment on average (Fig. 14B, D), suggesting that they are active in their endogenous genomic context.

[00128] Interestingly however, this appeared not to be the case for 604 (31 %) strong STARR-seq enhancers that were not accessible. Such closed STARR-seq enhancers occurred for example in introns of the homeobox (Hox) transcription factors Antp, Ubx, abd-A, and Abd-B, which are all not expressed in S2 cells (RNA-seq RPKM values < 0.1 ; Fig. 15). Indeed globally, genes next to closed STARR-seq enhancers were expressed at significantly lower levels compared to genes next to open STARR-seq enhancers (25-fold difference on median RNA-seq RPKM values; Wilcoxon p<2.2x10"16).

[00129] As open and closed enhancers also function in luciferase assays and show a linear relation between STARR-seq and luciferase signals (Fig. 16A), this suggests that these sequences have enhancer potential, yet are silenced in their endogenous genomic contexts, presumably at the chromatin level.

[00130] Indeed, in stark contrast to open STARR-seq enhancers, closed enhancers are not marked by H3K27ac, a histone modification associated with active enhancers, but lie in broad domains of repressive H3K27me3, suggestive of Polycomb-mediated repression. Strikingly, open and closed enhancers are marked to similar extents by H3K4me1 , which labels enhancers irrespective of their activity. The precise labeling of closed enhancers by H3K4me1 is particularly evident in Hox gene loci (Fig. 15) and holds genome-wide, suggesting that these sequences are recognized as functional enhancers in their endogenous genomic contexts, yet are actively repressed.

[00131 ] The present inventors also identified accessible regions directly (i.e. independently of

STARR-seq) by their DHS-seq enrichment using MACS (FDR < 5%). This revealed 4544 accessible regions, of which 3066 overlapped annotated TSS and presumably constitute open core promoters (1342 [44%] also functioned as enhancers). Of the 1478 TSS distal regions, the majority (877, 60%) overlapped with STARR-seq peaks and an additional 1 12 might constitute weak enhancers with significant STARR-seq enrichment (p<0.05) that d id not reach the stringent cutoffs req ui red d uring genome-wide enhancer identification. The remaining 489 regions showed strong ChI P signals for insulator proteins, particularly CP190 and CTCF, which was significant (p<0.05) for 393 regions suggesting that they might function as insulators (Fig. 16B, C).

[00132] The present inventors were wondering whether some of the open regions act as strong enhancers in a different cell type and applied STARR-seq to D. melanogaster adult ovarian somatic cells (OSC (22)), which have a largely distinct gene expression profile (SOM). This identified a comparable number of enhancers (4682 p<0.001 ; FDR=0.2%) with a similar genomic distribution (Fig 10C, D, 1 1A, B, 12B) and range of enhancer strengths, which - as for S2 cells - could be confirmed quantitatively by luciferase assays (PCC 0.75; Fig. 17).

[00133] Out of 8659 enhancers found in S2 or OSC cells, 5404 (62.4%) changed at least 2-fold and 2138 (24.7%) at least 4-fold between both cell types. Importantly, luciferase assays confirmed the differences of STARR-seq signals between both cell-types quantitatively with a strong linear agreement (R2=0.72; PCC=0.85). Changes in enhancer strengths between the two cell-types were reflected in the differential mRNA abundance of the flanking genes. For example, tj, which is exclusively expressed in OSC is next to a strong OSC-specific enhancer and the S2 cell specific gene nvy is next to three S2-specific peaks. Overall, 88% of all peaks near genes that are two-fold up-regulated in S2 cells are at least two-fold higher in S2 cells (42% are at least 4-fold higher), and the same trend holds for OSC enriched genes. This establishes a direct and causal link between quantitative differences in genome-wide enhancer activities and differential gene expression. Interestingly however, we also observed 514 genes for which cell-type specific enhancers had compensating effects: while individual enhancers changed more than two-fold between the cell-types, the sum of enhancer activities and the gene expression levels remained constant (<2-fold change; e.g. kuz).

[00134] 2593 S2 and 1901 OSC enhancers were strictly cell-type specific as they were not detectably active in the respective other cell-type (STARR enrichment < 1 .1 -fold; p>0.05). Surprisingly, 484 (19%) of the S2 specific and 193 (10%) of the OSC-specific enhancers nevertheless appeared to be open and accessible in OSC and S2 cells, respectively. This included an enhancer in the shn locus for which we validated the S2-specific activity with luciferase assays (Fig. 18). This suggests that enhancer regions can be open in cells in which they are not active, presumably by regulatory proteins that bind to accessible DNA.

[00135] Taken together, the present inventors present STARR-seq, which complements ChlP- seq and DHS-seq as the third principal method to comprehensively study transcriptional regulatory elements at a genome-wide level. STARR-seq is unique in its ability to report quantitatively on enhancer strength and to discover regulatory elements directly based on their ability to enhance transcription. It is widely applicable to test candidate fragments from arbitrary sources of DNA in any cell-type or tissue as described hereion that allow the introduction of candidate fragments.

[00136] Applied to two distinct Drosophila cell-types, STARR-seq revealed thousands of sequences that can function as cell-type specific enhancers with a continuum of enhancer activities over a wide range (Fig. 8). The cell-type specific enhancer activities correlated with the expression levels of inferred target genes, providing a direct and causal link between sequence-encoded enhancer activities and differential gene expression . The en hancers' genomic distribution reveals a complex picture of transcriptional regulation in which even broadly expressed "housekeeping" genes have multiple enhancers in a single cell-type, which might act additively or redundantly to increase robustness.

[00137] Combining STARR-seq with information on open chromatin and histone modifications revealed that a vast majority of the identified enhancer sequences were utilized in vivo.

[00138] The above described STARR-seq method is thus applicable for the identification of transcription regulatory sequences. Though the above exemplifies the identification of enhancers, it is equally applicable for the identification of repressors as well as for any other purpose as decsribed herein by applying the teaching described herein.

Claims

Claims
1 . A method of identifying a transcriptional regulatory element which regulates a promoter comprising:
(a) optionally providing candidate nucleic acid molecules,
(b) preparing a reporter library of the candidate nucleic acid molecules by inserting the molecules into a vector downstream of the promoter,
(c) subjecting the library to conditions allowing transcription from the promoter,
(d) optionally reverse transcribing RNA obtained in step (c) into cDNA,
(e) quantifying RNA obtained in step (c) or the cDNA obtained in step (d), and
(f) determining the presence of one or more transcriptional regulatory element from the candidate nucleic acid molecules based on the quantification.
2. The method of claim 1 , wherein the transcriptional regulatory element is an enhancer.
3. The method of claim 1 or 2, wherein determination in (f) comprises comparing abundance of the candidate nucleic acid molecule in an input library and the cDNA.
4. The method of any one of the preceding claims, wherein a candidate nucleic acid molecule having enhancer activity transcribes itself, thereby increasing the number of its own transcripts.
5. The method of any one of the preceding claims, wherein the abundance of a candidate nucleic acid molecule is a read-out for the enhancer activity of the candidate nucleic acid molecule.
6. The method of any one of the preceding claims, wherein direct coupling of the candidate nucleic acid molecule to the transcriptional readout of its potential enhancer activity allows the identification of an enhancer element.
7. The method of claim 1 , wherein the transcriptional regulatory element is a repressor.
8. The method of claim 1 , wherein determination in (f) comprises comparing the lack of the candidate nucleic acid molecule in an input library and the cDNA.
9. The method of claim 7 or 8, wherein a candidate nucleic acid molecule having repressor activity represses transcription of itself, thereby decreasing the number of its own transcripts.
10. The method of any one of claims 7 to 9, wherein the lack of a candidate nucleic acid molecule is a read-out for the repressor activity of the candidate nucleic acid molecule
1 1 . The method of any one of claims 7 to 10, wherein direct coupling of the candidate nucleic acid molecule to the transcriptional readout of its potential repressor activity allows the identification of a repressor element.
12. The method of any one of the preceding claims wherein the insertion of the candidate nucleic acid molecule into the vector places the candidate nucleic acid molecule on the transcript produced in step (c).
13. The method of any one of the preceding claims, wherein the quantifying step (e) is carried out by next generation sequencing or microarray hybridization.
14. The method of any one of the preceding claims, wherein the candidate nucleic acid molecule is obtained from eukaryote, prokaryote, or virus.
15. The method of any one of the preceding claims, wherein the candidate nucleic acid molecules are obtained from cDNA, bacterial artificial chromosome, yeast artificial chromosome, bacterial vectors or eukaryotic vectors.
16. The method of any one of the preceding claims, wherein the candidate nucleic acid molecule is naturally occurring or artificial DNA or RNA.
17. The method of any one of the preceding claims, wherein the vector comprises a polyadenylation site which is downstream of the candidate nucleic acid molecule.
18. The method of any one of the preceding claims, wherein the vector is linear or circular.
19. The method of any one of the preceding claims, wherein linkers are added to both ends of the nucleic acid molecule before inserting it into the vector.
20. The method of claim 19, wherein the linkers are made compatible for bacterial recombination.
21 . The method of any one of the preceding claims, wherein step (c) takes place in vitro.
22. The method of any one of the preceding claims, wherein step (c) takes place in a host or host cell.
23. The method of any one of the preceding claims, wherein reverse transcription of step (d) is coupled with an amplification step (RT-PCR).
24. The method of claim 22, wherein the host cell is a prokaryotic or eukaryotic host cell.
25. The method of any one of the preceding claims, wherein the promoter is a core promoter.
26. The method of any one of the preceding claims, wherein the promoter is a naturally occurring or artificial promoter.
27. The method of any one of the preceding claims, wherein the promoter is a cell- type specific promoter.
28. The method of any one of the preceding claims, wherein reporter library of comprises at least 107 members of nucleic acid molecules.
29. A method of determining the level of transcriptional regulatory activity of nucleic acid molecules comprising
(a) providing a candidate nucleic acid molecule,
(b) inserting the molecule into a vector downstream of a promoter,
(c) subjecting the vector to conditions allowing transcription from the promoter,
(d) optionally reverse transcribing RNA obtained in step (c) into cDNA,
(e) quantifying RNA obtained in step (c) or the cDNA obtained in step (d), and
(f) determining the level of transcriptional regulatory activity of the nucleic acid molecule based on the quantification. A method of optimizing a transcriptional regulatory element comprising
(a) providing candidate nucleic acid molecules comprising a transcriptional regulatory element and mutants thereof,
(b) preparing a reporter library of the candidate nucleic acid molecules by inserting the molecules into a vector downstream of a promoter,
(c) subjecting the library to conditions allowing transcription from the promoter,
(d) optionally reverse transcribing RNA obtained in step (c) into cDNA,
(e) quantifying RNA obtained in step (c) or the cDNA obtained in step (d),
(f) determining the level of transcriptional regulatory activity of the candidate nucleic acid molecules based on the quantification, and
(g) selecting at least one candidate nucleic acid molecule which has higher transcriptional regulatory activity than the transcriptional regulatory element.
PCT/EP2013/062260 2012-06-15 2013-06-13 Method for identifying transcriptional regulatory elements WO2013186306A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP12004520.8 2012-06-15
EP12004520 2012-06-15

Publications (1)

Publication Number Publication Date
WO2013186306A1 true WO2013186306A1 (en) 2013-12-19

Family

ID=48699739

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2013/062260 WO2013186306A1 (en) 2012-06-15 2013-06-13 Method for identifying transcriptional regulatory elements

Country Status (1)

Country Link
WO (1) WO2013186306A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008073303A2 (en) 2006-12-07 2008-06-19 Switchgear Genomics Transcriptional regulatory elements of biological pathways, tools, and methods
US7875440B2 (en) 1998-05-01 2011-01-25 Arizona Board Of Regents Method of determining the nucleotide sequence of oligonucleotides and DNA molecules
US7897345B2 (en) 2003-11-12 2011-03-01 Helicos Biosciences Corporation Short cycle methods for sequencing polynucleotides

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7875440B2 (en) 1998-05-01 2011-01-25 Arizona Board Of Regents Method of determining the nucleotide sequence of oligonucleotides and DNA molecules
US7897345B2 (en) 2003-11-12 2011-03-01 Helicos Biosciences Corporation Short cycle methods for sequencing polynucleotides
WO2008073303A2 (en) 2006-12-07 2008-06-19 Switchgear Genomics Transcriptional regulatory elements of biological pathways, tools, and methods

Non-Patent Citations (66)

* Cited by examiner, † Cited by third party
Title
"A typical example thereof is Stamminger", J. VIROL., vol. 76, no. 10, 2002, pages 4836 - 4847
"Macromolecule Sequencing and Synthesis: Selected Methods and Applications", 1998, ALAN R. LISS, INC., article "Current Methods in Comparison and Analysis", pages: 127 - 149
"Nucleic acid hybridization, a practical approach", 1985, IRL PRESS OXFORD
ADAMS ET AL.: "The genome sequence of Drosophila melanogaster", SCIENCE, vol. 287, 2000, pages 2185 - 2195
ALTSCHUL, S. F. ET AL., J. MOL. BIOL., vol. 215, 1990, pages 403 - 10
ALTSCHUL, S. F. ET AL., METHODS ENZYMOL., vol. 266, 1996, pages 460 - 80
ALTSCHUL, S. F. ET AL., NUCLEIC ACIDS RES., vol. 25, 1997, pages 3309 - 402
ARNOLD COSMAS D ET AL: "Genome-Wide Quantitative Enhancer Activity Maps Identified by STARR-seq", SCIENCE (WASHINGTON D C), vol. 339, no. 6123, March 2013 (2013-03-01), pages 1074 - 1077, XP002712625, ISSN: 0036-8075 *
ARNOLD, SCIENCE, vol. 339, 2013, pages 1074 - 1077
AUSUBEL: "Current Protocols in Molecular Biology", 1989, GREEN PUBLISHING ASSOCIATES AND WILEY INTERSCIENCE
BANERJI ET AL.: "Expression of a beta-globin gene is enhanced by remote SV40 DNA sequences", CELL, vol. 27, 1981, pages 299 - 308
BOYLE ET AL.: "High-resolution mapping and characterization of open chromatin across the genome", CELL, vol. 132, 2008, pages 311 - 322
BRANTON ET AL.: "The potential and challenges or nanopore sequencing", NATURE BIOTECH, vol. 26, 2008, pages 1146 - 1153
BUECKER CHRISTA ET AL: "Enhancers as information integration hubs in development: lessons from genomics", TRENDS IN GENETICS, vol. 28, no. 6, 7 April 2012 (2012-04-07), pages 276 - 284, XP002685092, ISSN: 0168-9525 *
BUECKER ET AL.: "Enhancers as information integration hubs in development: lessons from genomics", TRENDS GENET, vol. 28, 2012, pages 276 - 284
CARROLL: "Evo-devo and an expanding evolutionary synthesis: a genetic theory of morphological evolution", CELL, vol. 134, 2008, pages 25 - 36
CHERBAS ET AL.: "The transcriptional diversity of 25 Drosophila cell lines", GENOME RES., vol. 21, no. 2, 2011, pages 301 - 314
CLOONAN ET AL., NAT METHODS, vol. 5, 2008, pages 613 - 619
DEVEREAU, J. ET AL., NUCLEIC ACIDS. RES., vol. 12, 1984, pages 387 - 95
DHIMAN ET AL.: "Next-generation sequencing: A transformative tool for vaccinology", EXPERT REV VACCINES, vol. 8, 2009, pages 963 - 967
DUBENSKI ET AL., PROC. NAT. ACAD. SCI. US, vol. 81, pages 7529 - 33
FENG, D. F.; DOOLITTLE, R. F., J. MOL. EVOL., vol. 25, 1987, pages 351 - 60
FISHER ET AL., SCIENCE, vol. 312, 2006, pages 276
FUJITA ET AL.: "The UCSC Genome Browser database: update 2011", NUCLEIC ACIDS RESEARCH, vol. 39, 2011, pages D876 - 82
HEINTZMAN ET AL.: "Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome", NAT GENET, vol. 39, 2007, pages 311 - 318
HEINTZMAN ET AL.: "Histone modifications at human enhancers reflect global cell-type-specific gene expression", NATURE, vol. 459, 2009, pages 108 - 112
HIGGINS, D. G.; SHARP, P. M., CABIOS, vol. 5, 1989, pages 151 - 3
JOHNSON ET AL.: "Genome-wide mapping of in vivo protein-DNA interactions", SCIENCE, vol. 316, 2007, pages 1497 - 1502
KARLIN, S. ET AL., PROC. NATL. ACAD. SCI. USA, vol. 90, 1993, pages 5873 - 87
KAWASAKI ET AL., NANN. N. Y. ACAD. SCI., vol. 1020, 2004, pages 92 - 100
KHAN ET AL., BIOCHEM. BIOPHYS. ACTA., vol. 1423, 1999, pages 17 - 28
LEVINE ET AL.: "Transcription regulation and animal diversity", NATURE, vol. 424, 2003, pages 147 - 151
MARGULIES ET AL., NATURE, vol. 437, 2005, pages 376 - 380
MEIRELES-FILHO AC; STARK A, CURR OPIN GENET DEV., vol. 19, no. 6, December 2009 (2009-12-01), pages 565 - 570
MELNIKOV ET AL.: "Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay", NAT BIOTECHNOL, vol. 30, 2012, pages 271 - 277
MELTON ET AL., NUCL. ACID. RES., vol. 12, 1984, pages 7035
METZKER: "Sequencing technologies--The next generation", NATURE REVIEWS GENETICS, vol. 11, 2010, pages 31 - 46
OZSOLAK, F; MILOS, PM, DIRECT RNA SEQUENCING. EXPERIMENTAL MEDICINE, vol. 28, 2010, pages 2574 - 2580
OZSOLAK, F; MILOS, PM: "Direct RNA Sequencing", EXPERIMENTAL MEDICINE, vol. 28, 2010, pages 2574 - 2580
OZSOLAK, F; MILOS, PM: "RNA sequencing: advances, challenges and opportunities", NAT REV GENET., vol. 12, no. 2, February 2011 (2011-02-01), pages 87 - 98
PALL, FUNGAL GENETICS NEWSLETTER, vol. 40, 1993, pages 59 - 61
PATEL ET AL., BIOTECHNOL LETT., vol. 25, no. 4, 2003, pages 331 - 334
PATWARDHAN ET AL.: "Massively parallel functional dissection of mammalian enhancers in vivo", NAT BIOTECHNOL, vol. 30, 2012, pages 265 - 270
PATWARDHAN RUPALI P ET AL: "Massively parallel functional dissection of mammalian enhancers in vivo", NATURE BIOTECHNOLOGY, vol. 30, no. 3, March 2012 (2012-03-01), pages 265 URL, XP002685091 *
PEASE ET AL., PROC. NATL. ACAD. SCI. USA, vol. 91, 1994, pages 5022 - 26
PEASE ET AL., PROC. NATL. ACAD. SCI. USA, vol. 93, 1996, pages 10614 - 19
PEASON, W. R.; LIPMAN, D. J., PROC. NATL. ACAD. SCI. USA, vol. 85, 1988, pages 2444 - 48
PFEIFFER ET AL.: "Tools for neuroanatomy and neurogenetics in Drosophila", PNAS, vol. 105, no. 28, 2008, pages 9715 - 9720
RONAGHI ET AL., ANAL BIOCHEM, vol. 242, 1996, pages 84 - 89
RONAGHI ET AL., SCIENCE, vol. 281, 1998, pages 363 - 365
SAGNERET, BIOCHEMICA, vol. 3, 2001, pages 15 - 17
SAITO ET AL.: "A regulatory circuit for piwi by the large Maf gene traffic jam in. Drosophila", NATURE, vol. 461, 2009, pages 1296 - 1299
SAMBROOK ET AL.: "Molecular Cloning: A Laboratory Manual", 15 January 2001, COLD SPRING HARBOR LABORATORY PRESS
SAMBROOK ET AL.: "Molecular Cloning: A Labroratory Manual", 2001, COLD SPRING HARBOR PRESS
SAMBROOK; RUSSELL: "Molecular Cloning, A Laboratory Manual", 2001, COLD SPRING HARBOR LABORATORY
SAMBROOK; RUSSELL: "Molecular Cloning: A Laboratory Manual", 15 January 2001, COLD SPRING HARBOR LABORATORY PRESS
SCHENA ET AL., SCIENCE, vol. 270, 1995, pages 467 - 470
SHALON ET AL., GENOME RES., vol. 6, 1996, pages 639 - 645
SMITH, F.; WATERMAN, M. S., ADV. APPL. MATH., vol. 2, 1981, pages 482 - 89
SOUTHERN, J. MOL. BIOL., vol. 98, 1975, pages 503 - 17
STAMMINGER THOMAS ET AL: "Open reading frame UL26 of human cytomegalovirus encodes a novel tegument protein that contains a strong transcriptional activation domain", May 2002, JOURNAL OF VIROLOGY, VOL. 76, NR. 10, PAGE(S) 4836-4847, ISSN: 0022-538X, XP002685090 *
TURNER ET AL.: "Next-generation sequencing of vertebrate experimental organisms", MAMM GENOME, vol. 20, 2009, pages 327 - 338
TYMMS: "In Vitro Transcription and Translation Protocols", METHODS IN MOLECULAR BIOLOGY, vol. 37
VISEL ET AL.: "Genomic views of distant-acting enhancers", NATURE, vol. 461, 2009, pages 199 - 205
VOELKERDING ET AL.: "Next-generation sequencing: From basic research to diagnostics", CLIN CHEM, vol. 55, 2009, pages 641 - 658
WHEELER ET AL., NATURE, vol. 452, 2008, pages 872 - 826

Similar Documents

Publication Publication Date Title
Khalil et al. A novel RNA transcript with antiapoptotic function is silenced in fragile X syndrome
Namy et al. Impact of the six nucleotides downstream of the stop codon on translation termination
Venables et al. Multiple and specific mRNA processing targets for the major human hnRNP proteins
Ji et al. Progressive lengthening of 3′ untranslated regions of mRNAs by alternative polyadenylation during mouse embryonic development
Ulitsky Evolution to the rescue: using comparative genomics to understand long non-coding RNAs
US10273501B2 (en) RNA-guided human genome engineering
Liu et al. Experimental discovery of sRNAs in Vibrio cholerae by direct cloning, 5S/tRNA depletion and parallel sequencing
Buratti et al. RNA folding affects the recruitment of SR proteins by mouse and human polypurinic enhancer elements in the fibronectin EDA exon
Fath et al. Multiparameter RNA and codon optimization: a standardized tool to assess and enhance autologous mammalian gene expression
Ebbesen et al. Circular RNAs: identification, biogenesis and function
Jeck et al. Circular RNAs are abundant, conserved, and associated with ALU repeats
Kojima et al. Circadian control of mRNA polyadenylation dynamics regulates rhythmic protein expression
Zhong et al. Genome-wide identification of binding sites defines distinct functions for Caenorhabditis elegans PHA-4/FOXA in development and environmental response
Bieberstein et al. First exon length controls active chromatin signatures and transcription
Kadener et al. Antagonistic effects of T‐Ag and VP16 reveal a role for RNA pol II elongation on alternative splicing
Pamudurti et al. Translation of circRNAs
Jackson et al. The balance sheet for transcription: an analysis of nuclear RNA metabolism in mammalian cells
Bryant et al. Chromosome position effects on gene expression in Escherichia coli K-12
Deplancke et al. A gateway-compatible yeast one-hybrid system
Palazzo et al. Non-coding RNA: what is functional and what is junk?
Bland et al. Global regulation of alternative splicing during myogenic differentiation
Raab et al. Human tRNA genes function as chromatin insulators
Haarhuis et al. The cohesin release factor WAPL restricts chromatin loop extension
Stark et al. Discovery of functional elements in 12 Drosophila genomes using evolutionary signatures
Zarnack et al. Direct competition between hnRNP C and U2AF65 protects the transcriptome from the exonization of Alu elements

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13731712

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13731712

Country of ref document: EP

Kind code of ref document: A1