WO2005058931A2 - Methods and algorithms for identifying genomic regulatory sites - Google Patents

Methods and algorithms for identifying genomic regulatory sites Download PDF

Info

Publication number
WO2005058931A2
WO2005058931A2 PCT/US2004/042172 US2004042172W WO2005058931A2 WO 2005058931 A2 WO2005058931 A2 WO 2005058931A2 US 2004042172 W US2004042172 W US 2004042172W WO 2005058931 A2 WO2005058931 A2 WO 2005058931A2
Authority
WO
WIPO (PCT)
Prior art keywords
genomic
fragments
sub
dna
cells
Prior art date
Application number
PCT/US2004/042172
Other languages
French (fr)
Other versions
WO2005058931A3 (en
Inventor
John A. Stamatoyannopoulos
Michael Mcarthur
Original Assignee
Regulome Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Regulome Corporation filed Critical Regulome Corporation
Publication of WO2005058931A2 publication Critical patent/WO2005058931A2/en
Publication of WO2005058931A3 publication Critical patent/WO2005058931A3/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1072Differential gene expression library synthesis, e.g. subtracted libraries, differential screening

Definitions

  • the invention relates to methods for identifying regulatory sites in a genomic locus on the basis of their relative sensitivity of chromatin sensitivity to a DNA modifying agent.
  • the invention relates generally to methods of DNA analysis and more specifically to methods for analysis of genomic sequences.
  • the invention also relates to the use of these regulatory sites, databases comprising the same, and their use in regulating gene expression, disease diagnosis and therapy, and identification of therapeutic drugs.
  • Chromatin architecture plays a defining role in the control of eukaryotic genes in vivo as it determines the accessibility of critical genomic sequences to the regulatory and transcriptional machineries [Felsenfeld G. (1996) Cell 86,13-9 ; Felsenfeld, G. & Groudine, M. (2003) Nature 421, 448- 53]. Active regulatory foci within genomic sequences are detectable experimentally on the basis of pronounced sensitivity to cleavage when intact nuclei are exposed to DNA modifying agents, canonically the non-specific endonuclease DNasel [Gross, D. S., and Garrard, W. T. (1988) Annu. Rev. Biochem. 57, 159-197 (1988); Elgin, S. C.
  • HSs DNasel Hypersensitive Sites
  • cz's-active elements spans the spectrum of known transcriptional and chromosomal regulatory activities including transcriptional enhancers, promoters, and silencers, insulators, locus control regions, and domain boundary elements [Felsenfeld G. (1996) Cell 86,13-9; Gross, D. S., and Garrard, W. T. (1988) Annu. Rev. Biochem.
  • Such arrays have the potential to detect transcripts from virtually all actively transcribed regions of a cell or cell population, provided the availability of an organism's complete genomic sequence, or at least a sequence or library comprising all of its gene transcripts.
  • Such arrays may be employed to monitor simultaneously large numbers of expressed genes within a given cell population.
  • the simultaneous monitoring technologies particularly relate to identifying genes implicated in disease and in identifying drug targets (see, e.g., U.S. Patent Nos. 6,165,709; 6,218,122; 5,811,231; 6,203,987; and 5,569,588).
  • these array technologies generally rely on direct detection of expressed genes and therefore reveal only indirectly the activity of genetic regulatory pathways that control gene expression itself.
  • a detection system directed toward sensing the activity of particular genetic regulatory pathways or cis-acting regulatory elements could provide deeper information concerning a cell's regulatory state. Accordingly, the detection of active regulatory elements, particularly in related and interacting groups, potentially could become extremely important for delineation of regulatory pathways, and provide critical knowledge for design and discovery of disease diagnostics and therapeutics.
  • Most research in the area of gene regulation has focused on finding and using individual sequences either upstream or downstream of individual coding gene targets. Generally, the presence of absence of a particular DNA sequence is linked with increased or decreased expression of a nearby gene when determining the regulatory effect of the sequence.
  • the beta-like globin gene was shown to contain four major DNAase I hypersensitive sites of possible regulatory function by studies that removed or added these sequences and that looked for an effect on gene expression in erythroid cells. See Grosveld et. al. U.S. Patent No. 5,532,143. From related studies, Townes et al. asserted that two of the four DNAse hypersensitive sites might control genes generally in cells of erythroid lineage. Although an interesting development, these observations generally are limited to detection of effects on nearby coding sequences of known genes. Multiple regulatory units, which behave coordinately, are not readily amenable to analysis by these techniques. Multiple gene and protein elements interact for even simple biological processes.
  • any tool that can provide simultaneous regulation system information would give rich benefits in terms of improved diagnosis, clinical treatment and drug discovery.
  • the basic chromatin fiber consists of an array of nucleosomes, each packaging around 200 base pairs of DNA; 146 is wound around the histone octamer, with the remainder forming a link to the next nucleosome.
  • all genomic DNA in the nucleus is packaged into chromatin, the architecture of which plays a central role in regulating gene expression (for reviews see Felsenfeld, G.
  • this packaging serves two purposes: (i) it is physically necessary to condense the mass of sequence information into a well-ordered regular structure that can be contained within the nucleus; and (ii) it imparts a level of site-specific 'epigenomic' information (Felsenfeld, G., 1992, Nature 355, 219-24), for example discriminating between sequences which are never to be transcribed and are stored in highly condensed heterochromatin, and those sequences which are actively transcribed and are maintained in a more accessible chromatin state.
  • Gene expression is regulated by several different classes of c ⁇ -regulatory DNA sequences including enhancers, silencers, insulators, and core promoters (Felsenfeld and Groudine, 2003, Nature 421, 448-53; Butler and Kadonga, 2002, Genes Dev 16: 2583-2592; Gill, G., 2001, Essays Biochem 37: 33-43).
  • the core promoter is the site of formation of the RNA pol II transcription complex.
  • Enhancers and silencers act over distances of several kilobases (or more) to potentiate or silence pol II function. Insulator sequences prevent enhancers and silencers targeted to one gene from inappropriately regulating a neighbouring gene.
  • tissue-specific genes during development and differentiation occurs first at the level of chromatin accessibility and results in the formation of transcriptionally- competent genetic loci characterized by increased sensitivity (relative to inactive loci) to digestion with Dnasel (Groudine et al, 1983, Proc NatlAcad Sci USA. 80:7551-7555; Tuan et al, 1985, Proc Natl Acad Sci USA. 82:6384-6388; Forrester et al, 1986, Proc Natl Acad Sci U S A. 83:1359-1363).
  • Loci in an accessible chromatin configuration can subsequently respond to acutely activating signals, often conveyed by non-tissue-specific transcriptional factors that can gain access to the open locus and recruit or activate the basal transcriptional machinery.
  • the initial observation that active genes reside within domains of generally increased sensitivity to nucleases was made nearly 30 years ago (Weintraub, H. & Groudine, M., 1976, Science 193, 848-56).
  • HSs The literature connecting DNasel-hypersensitive sites with genomic regulatory elements is extensive. DNase hypersensitivity studies had been employed to delineate the transcriptional regulatory elements of over 100 human gene loci. Typically, between 1 and 5 hypersensitive sites had been visualized for each of these loci. However, only a fraction of these had been precisely localized at the sequence level.
  • a critical defining feature of HSs is that the function of the DNA sequence component, i.e. its complex-forming activity, is intrinsic. The principal evidence for this is the fact that these sequences can be excised and inserted into other positions in the genome, where they exhibit the same functional chromatin activities.
  • HSs can form when included in either constructs used to create stably transfected cell lines (Fraser et al, 1990 Nucleic Acids Res 18:3503-3508)or transgenic animals (Lowrey et al, 1992, Proc Natl Acad Sci U S A 89, 1143-7; Levy-Wilson et al, 2000, Mol Cell Biol Res Commun 4, 206-11).
  • An important finding has been that HS sequences are rendered functional only upon assembly into nuclear genomic chromatin. These DNA sequences are thought to potentiate formation of a nucleoprotein complex in a manner that dramatically increases its probability of activation vs. neighboring DNA regions.
  • the stochasticity of nucleoprotein complex formation can be manipulated through the introduction of point mutations or small deletions or insertions in critical DNA binding bases or in juxtaposed sequences that affect overall stability (e.g., Stamatoyannopoulos et ⁇ /., 1995, EMBOJ 14, 106-16). Cooperative binding of regulatory factors in the context of chromatin results in sequence-specific 'remodeling' of the local chromatin architecture (Felsenfeld and Groudine, 2003. Nature 421; 448-453).
  • This focal 'remodeling' is the signature of active regulatory foci within genomic sequences and is detectable experimentally on the basis of pronounced sensitivity to cleavage when intact nuclei are exposed to DNA modifying agents, canonically the non-specific endonuclease Dnasel (Gross and Garrard 1988. Annu. Rev. Biochem. 57; 159-197, Elgin 1984. Nature 309; 213-4, Wu 1980. N twre 286; 854-860).
  • HSs DNasel Hypersensitive Sites
  • cf ⁇ -active elements spans the spectrum of known transcriptional and chromosomal regulatory activities including transcriptional enhancers, promoters, and silencers, insulators, locus control regions, and domain boundary elements (Felsenfeld 1996. Cell 86, 13-9, Gross and Garrard 1988. Annu. Rev. Biochem. 57; 159-197, Burgess-Beusse et al, 2002. Proc. Natl. Acad. Sci. USA 99; 16433-7 ).
  • HSs have also been observed to coincide with sequences governing fundamental genomic processes including attachment to the nuclear matrix (Jarman and Higgs 1988. EMBO J.
  • DNase hypersensitivity studies collectively comprise the most successful and extensively validated methodology for discovery of regulatory sequences in vivo, and had been employed to delineate the transcriptional regulatory elements of > 100 human gene loci.
  • Over 25 years of experimentation and legion publications by many investigators have established an inviolable connection between sites of DNase hypersensitivity in vivo and functional non-coding sequences that regulate the genome.
  • a genomic regulatory activity has ultimately been disclosed, even if such function is not immediately apparent due to temporal or spacial restriction of activity (e.g., Wai et al, 2003. EMBO J. 22; 4489-4500).
  • DNasel HSs are biological phenomena of independent significance, they are extensively reported even without specific studies of their contribution to transcription. Conversely, in every published case where a regulatory sequence with documented in vivo activity (e.g., a promoter or enhancer discovered with other means) has been assayed for nuclease hypersensitivity, the expected result has been found. It is now generally accepted that DNase HSs mark genomic sequences that bind regulatory factors in vivo with consequent disruption of the nucleosome array (Felsenfeld 1996. Cell 86; 13-19).
  • Nuclease hypersensitive sites are biologically bounded by (a) the positions of flanking nucleosomes and (b) limits on the area of DNA over which thermodynamically stable nucleoprotein complexes may form.
  • the extent of the regulatory domain is contained within the inter-nucleosomal interval, approximately 150-250bp. This interval corresponds to the size of sequence that is needed to place a canonical nucleosome and it has been a common assumption that HSs represent a break in the nucleosomal array that defines the vast majority of chromatin.
  • a core domain can be identified which is restricted to a region of approximately 80-120 base pairs in length, over which critical DNA- protein interactions take place (e.g., Lowrey et al, 1992. Proc. Natl.
  • Nuclease hypersensitive sites are biologically bounded by (1) the positions of flanking nucleosomes and (2) limits on the area of DNA over which thermodynamically stable nucleoprotein complexes may form.
  • the extent of the regulatory domain is contained within the inter-nucleosomal interval, approximately 150-250bp. This interval corresponds to the size of sequence that is needed to plRS a canonical nucleosome and it has been a common assumption that HSs represent a break in the nucleosomal array that constitutes the vast majority of chromatin.
  • a core domain can be identified which is restricted to a region of approximately 80- 120 base pairs in length, over which DNA-protein interactions take plRS (e.g., Lowrey et al, 1992, Proc Natl Acad Sci USA 89, 1143-7). Cooperative binding of transcription factors to such core regions is sufficient to exclude a nucleosome in vitro (Adams and Workman, 1995, Mol Cell Biol 15, 1405-1421) and this has been proposed as a common mechanism for how these sites may form in vivo. Nucleosomal mapping experiments have shown that HSs such as the Drosophila hsp26 promoter (Lu et al, 1995 EMBO J.
  • Flanking sequences surrounding the core region appear to modulate the activity of this core region, though this effect tapers off sharply.
  • the boundaries of the sequences needed for hypersensitivity can be defined functionally by performing deletion analyses followed by stable transfection of cells (Philipsen et al, 1993, EMBO J 12, 1 077-85) or transgenic studies (Lowrey et al, 1992, Proc Natl Acad Sci U S A 89, 1143-7; These approaches define the minimum extent of sequence required to retain the biological function associated with the HS under examination. It is observable that many hypersensitive sites occur within broader domains of increased DNase sensitivity and therefore appear to be components of higher-order chromatin structures.
  • the nuclei are aliquoted and treated with with a series of increasing intensities of DNasel (typically with increasing concentrations of the nuclease at fixed incubation time or alternatively with a fixed DNasel concertration with increasing incubation times).
  • the products are then deproeinated.
  • samples from each aliquote are digested with a restriction enzyme, run over an agarose gel, and transferred to a membrane.
  • a probe is selected that is proximal to either the 5' or 3' end of the restriction fragment. Fragments are often probed from both ends to visualize cutting over both strands.
  • Hybridization of a radiolabeled probe with the membrane highlights the parental band and sites that increase in intensity with increasing DNase concentration.
  • numerous technical barriers have prevented the broader application of conventional hypersensitivity assays to systematic detection of cw-active sequences on a genomic scale.
  • the protocol (a) is extremely labor intensive; (b) is dependent on the presence of suitably- positioned restriction sites; (c) is further dependent on the availablility of a suitable ⁇ 500+bp sequence juxtaposed to a restriction site that can function as a specific probe (i.e., does not contain any repetitive sequences); (d) is highly consumptive of tissue resources, and therefore quite vulnerable to tissue preparation-to-preparation variability; (e) it suffers from numerous technical sources of variability including gel composition and running conditions, success of membrane transfer, success of probe labeling, hybridization conditions, wash conditions, and exposure conditions; and (f) it does not provide quantitative data.
  • C/s-regulatory variation could manifest functionally in a variety of ways by impacting (a) the magnitude of gene expression; (b) regulation of tissue-specificity; (c) control over timing of expression during development and differentiation; (d) response to environmental stimuli (such as pharmacologic agents); or (e) some combination thereof.
  • a) the magnitude of gene expression a gene that influences tissue-specificity
  • c) control over timing of expression during development and differentiation e.
  • response to environmental stimuli such as pharmacologic agents
  • some combination thereof e.
  • lesions in one or more of the cognate c ⁇ -regulatory sites should be comparatively common.
  • cw-variation would provide the ideal substrate for a complex, semi-quantitatively varying phenotype.
  • Hum. Mutat. 12; 289) Gene induction is a well-described response to a variety of external stimuli, classically xenobiotics. Metabolism of diverse pharmaceuticals is also heavily influenced by inter-individual variation in expression of metabolizing genes.
  • enzymes which are known to be impacted by regulatory polymorphism are acetylcholinesterase (Shapira et al, 2000. Hum. Mol. Genet. 9; 1273-1281), glutathione-S-transferase (Coles et al, 2001. Pharmacogenetics 11; 663-669), monoamine oxidase (Denney et al, 1999. Hum. Genet. 105; 542-551; Sabol et al, 1998.
  • TFBS transcription factor binding site
  • This class includes algorithms such as the Gibbs sampler (Lawrence et al., 1993. Science, 262(5131):208-214), MEME (Bailey and Elkan, 1994. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pages 28-36) and Consensus (Hertz and Stormo, 1999. Bioinformatics, 15(7):563-577).Recent research in this area focuses on building richer motif models (Xing et al., 2003. Advances in Neural Information Processing Systems, Cambridge, MA, 2003.
  • Algorithms in the second class operate on much larger sequence databases; however, these algorithms generally assume that the statistical properties of a small collection of transcription factor binding sites are known a priori. Here, the problem is to locate statistically significant clusters of these binding sites, called regulatory modules, in genomic DNA. Three groups of algorithms for recognizing regulatory modules have been proposed. Algorithms in the first group use a sliding window approach, scoring each subsequence that appears in the window with respect to a given collection of motifs (Prestridge, 1995. Journal of Molecular Biology, 249:923-932, Kondrakhin et al, 1995.
  • HMMs hidden Markov models
  • the Fisher kernel support vector machine (SVM) method uses a discriminative algorithm based upon a hidden Markov model. In the presence of a small amount of data, discriminative techniques typically achieve better performance than similar, generative techniques.
  • Non-motif-based methods The third class of algorithms for identifying cis-regulatory elements is the most general, requiring as input only a database of genomic DNA and producing as output, for example, the predicted locations of promoter regions or CpG islands. Many techniques in this class are non-motif based, capitalizing instead on compositional statistics (see Zhang (2002) Nature Reviews Genetics, 3:698-710, for a review). Some methods augment these statistics using libraries of known TFBS's (Crowley et al., 1997. Journal of Molecular Biology, 268:8-14) or libraries of words extracted in an unsupervised fashion from sequence databases (Scherf et al., 2000. Journal of Molecular Biology, 297:599-606). While most promoter recognition techniques are generative, at least one discriminative method has been described (Davuluri et al., 2001. Nature Genetics, 29(4):412-417).
  • This DNA can be efficiently isolated on paramagnetic strepavidin-coated beads, while the remaining fragments are washed away.
  • a second adapter is added to the captured DNA and the product cut from the beads with a rare-cutter. This population is enriched in DNasel-cut sites and is retained for the subtraction step.
  • a second population, depeleted in DNasel cut sites is prepared for the subtraction step. It is made by cutting DNasel-treated genomic DNA with N/ lll, or other restriction enzymes that introduce a four- nucleotide 3' overhang. Digestion of this D ⁇ A with Exonuclease III will preserve the N/ lll- N/ i ⁇ fragments, while any end generated by DNasel will be efficiently digested.
  • the resultant single stranded D ⁇ A is removed by treatment with Mung Bean Nuclease and the remaining population biotinylated in vitro. An excess of this population is mixed with the first, the sample denatured and allowed to reanneal. Those (non-biotinylated) fragments generated by repeated DNasel digestion at a specific site (i.e., a hypersensitive site) will be more likely to self-anneal, than find a partner in the depleted-population to form a heteroduplex. Extraction of the mixture with paramagnetic beads isolates the non-biotinylated homoduplexes, that are enriched in DNasel cut sites and hypersensitive sites. These are then amplified and cloned to make the genomic libraries.
  • promoter elements may be situated several hundred bases or even >lkb distant from the TSS [Davuluri R.V., Grosse I., Zhang M.Q. (2001). Nat Genet. 29:412-7].
  • the promoter region may be located downstream from the TSS within the first intron [Reisman, D., Greenberg, M., and Rotter, V.
  • ACS clusters provide higher discriminant value. Poor correspondence between ACSs and DNasel HSs in intergenic regions prompted us to search for meta-features that might be more predictive of hypersensitive sites.
  • DNasel HSs are defined by a high frequency of DNasel cut sites over a given genomic interval. However, hypersensitivity only becomes manifest when cutting is averaged across a large population of individual chromosomes, each of which is cut in a stochastic fashion. We therefore hypothesized that in the context of a library of ACSs where each member represents a unique cutting event, DNasel hypersensitive sites would ultimately appear as clusters of ACSs mapping over small genomic intervals ( ⁇ 1000bp) as larger numbers of clones were analyzed.
  • CNGs Evolutionarily conserved non-genic sequences
  • cloning and analysis of ACSs should be applicable to any eukaryotic cell type providing the basis for the accumulation of comprehensive databases of c ⁇ -regulatory sequences.
  • Analysis of a large library of active chromatin sequences has provided several novel insights into the relationship between cliromatin structure and gene expression.
  • a remarkable feature of the distribution of ACSs around transcriptional start sites was its symmetry. This suggests that proximal intron regions may be a rich reservoir of regulatory sequences [Aronow, B., D.
  • the present invention provides libraries and methods for creating libraries comprising arrays of cloned genomic fragments adjacent or co-incident with regulatory sites, such as nuclease hypersensitive sites, in the chromatin of any given organism or tissue type.
  • Another embodiment of the process comprises protocols that renders the sequence of each concatamerized genomic fragment recognizable and of sufficient length to routinely and unambiguously or stochastically define the genomic location of the parental fragment, with concomitant benefit for the efficiency and scope of the genome-wide study of, for example, regulatory sites.
  • Other embodiments of the invention include the statistical analysis of the distribution of the mapped genomic locations of a collection of the fragments.
  • genomic locations where there is a high incidence of genomic tags in order to increase the predictive value of genomic tags for mapping, for example, regulatory sites.
  • methods for ascertaining the effect of an agent or other environmental perturbation on an active chromatin element profile of a genetic locus by obtaining and analysing a concatamerised library associated with a biological sample, unexposed to an agent or perturbation; obtaining and analysing a second concatamerised library associated with a biological sample, exposed to the agent or perturbation; and comparing the first analysis with the second to determine regulatory sites that are effected by the agent perturbation.
  • the perturbation occurs before obtaining the sample from a tissue
  • the environmental perturbation is selected from the group consisting of an infection of the eukaryotic organism from a microorganism, loss in immune function of the eukaryotic organism, exposure of the tissue to high temperature, exposure of the tissue to low temperature, cancer of the tissue, cancer of another tissue in the eukaryotic organism, irradiation of the tissue, exposure of the tissue to a chemical or other pharmaceutical compound; and aging.
  • the present invention provides methods for the large-scale isolation of regulatory sites in chromatin, comprising treating chromatin with an agent that modifies DNA at regulatory sites and fragmenting the modified chromatin, and isolating sub-fragments that contain or are adjacent to DNA modifications.
  • the modified genomic DNA is prepared by treating chromatin with an enzyme, a chemical agent, radiation, a shearing device, or a combination thereof.
  • the chromatin can be obtained from cell nuclei of a biological sample, for example samples of primary cells, mammalian cells, human cells, murine cells, plant cells, fly cells, worm cells, fish cells, diseased cells, cancerous cells, yeast cells, transformed cells and cell lines, embryonic cells, stem cells, yeast artificial chromosomes containing mammalian DNA sequences, plant artificial chromosomes containing eukaryotic DNA sequences, and nuclear extracts and combinations thereof.
  • the agent is selected from the group consisting of enzymes (e.g., a nuclease, a non-specific endonuc lease, a sequence-specific endonuclease, a topoisomerase (e.g., topoisomerase II), a methylase, a histone acetylase, a histone deacetylase, or a combination thereof), radiation (e.g., UV light, lasers, and ionizing radiation), chemical agents (e.g., a clastogen or a cross-linker), shearing, centrifugation, or electrophoretic devices, and combinations thereof.
  • enzymes e.g., a nuclease, a non-specific endonuc lease, a sequence-specific endonuclease, a topoisomerase (e.g., topoisomerase II), a methylase, a histone acetylase, a histone deacetylase
  • the enzyme can be endogenous or exogenous.
  • the DNA modifying enzyme is a non-specific endonuclease such as DNase I or a sequence-specific endonuclease such as a restriction endonuclease.
  • one or more oligonucleotide linkers are ligated to fragments containing modifications directly or indirectly resulting from treatment of chromatin with the DNA-modifying agent.
  • the oligonucleotide linkers contain one member of a binding pair, such as biotin. In specific modes of the embodiment, the other member of the binding pair is streptavidin-coated paramagnetic beads.
  • the fragments can be isolated by binding the biotinylated oligonucleotides to streptavidin- coated paramagnetic beads.
  • the fragments are amplified prior to isolation, for example by ligating one or more oligonucleotide linker adapters to the isolated fragments and amplifying the fragments with primers complementary to said oligonucleotide linkers.
  • the methods of the invention further comprise the step of releasing a sub- fragment from the ends of the isolated fragments.
  • a preferred embodiment of the foregoing methods further includes a step where the sub-fragments are self-ligated to form higher-molecular weight concatamers.
  • the present invention provides methods obtaining a library of concatamerized genomic sub- fragments, comprising: (a) treating a sample with a DNA modifying agent; (b) isolating genomic subfragments associated with the site of modification (c) self-ligating said genomic sub-fragments to form concatamers, each concatamer comprising two or more sub-fragments; and (d) creating a library containing the concatamers.
  • the present invention provides methods for detecting genomic regions of coincidence of positions of genomic sub-fragments, comprising detecting genomic regions of co-incidence of the positions of genomic sub-fragments, e.g., unique genomic sub-fragments, within a collection of sub-fragments that is greater than the co-incidence expected if the sub- fragments were distributed uniformly within the genome.
  • the present invention further provides methods for detecting genomic positions that show increased sensitivity to a DNA modifying agent, comprising detecting genomic regions of co-incidence of the positions of genomic sub-fragments, e.g., unique genomic sub-fragments, within a collection of sub- fragments that is greater than the co-incidence expected if the sub-fragments were distributed uniformly within the genome, wherein the collection of sub-fragments is generated from a sample treated with said modifying agent.
  • the methods further entail mapping such sub-fragments to genomic locations of origin.
  • the present invention provides, inter alia, methods for detecting genomic regions of co-incidence of positions of genomic sub-fragments, comprising: (a) treating a sample with a DNA modifying agent; (b) isolating genomic subfragments associated with the site of modification (c) self-ligating said genomic sub- fragments to form concatamers, each concatamer comprising two or more sub-fragments; (d) creating a library containing the concatamers; (d) sequencing the genomic sub-fragments; and (e) mapping the genomic sub-fragments to unique positions within the genome.
  • a library e.g., a library of concatamerized genomic sub-fragments, contains anywhere from 10 to 1 million members, more preferably from 100 to 100,000 members. In a specific embodiment, a library contains from 1,000 to 10,000 members. In certain embodiments, the library contains at least 100, more preferably at least 250, 500, 1000, 2000, 5,000 members. In other embodiments, the library contains at more than 10,000 members.
  • the library preferably comprises the sub- genomic fragments or concatamers of sub-genomic fragments in a plasmid or phage vector, although any other vehicle suitable for construction of a library may be used.
  • the tag sequences e.g., of the genomic sub-fragments
  • MPSS Massively Parallel Signature Sequencing
  • the present methods can be utilized to ascertain the effect of an agent or other environmental perturbation on the composition of a concatamerised library.
  • the method encompasses (a) obtaining a first concatamerised library from a biological sample unexposed to the agent or perturbation; (b) obtaining a second concatamerised library from a biological sample exposed to the agent or perturbation; and (c) comparing the composition of the first library with that of the second to determine the regulatory sites effected by the agent perturbation.
  • the perturbation occurs before obtaining the sample from a tissue.
  • the perturbation can be, for example, an infection of the eukaryotic organism from a microorganism, loss in immune function of the eukaryotic organism, exposure of the tissue to high temperature, exposure of the tissue to low temperature, cancer of the tissue, cancer of another tissue in the eukaryotic organism, irradiation of the tissue, exposure of the tissue to a chemical or other pharmaceutical compound; and aging.
  • the perturbation can occur after obtaining the sample from a tissue.
  • the perturbation can be exposure of the tissue to high temperature, exposure of the tissue to low temperature, irradiation of the tissue, exposure of the tissue to a chemical or other pharmaceutical compound, and aging.
  • the present invention provides computer readable media comprising the genomic locations of genomic sub-fragments associated with a particular treatment of a sample.
  • the present invention provides computer readable media comprising the genomic locations of co-incidences of genomic sub-fragments associated with a particular treatment of a sample.
  • the computer readable medium can optionally contain information relating genomic sub-fragments associated with a particular disease or disorder, such as cancer, or a specific cell, e.g., a mammalian cell, a diseased cell, and/or a cell that has been treated with a drug or agent.
  • the present invention yet further provides methods of detecting a disease or disorder in a subject, comprising: a. creating a computer readable medium of the invention associated with a diseased state; and b.
  • the present invention yet further provides methods of qualifying a patient for a clinical trial or therapy, comprising: (a) creating a computer readable medium of the invention associated with a patient treated with an agent; and (b) comparing said computer- readable medium with a computer readable medium of the invention associated with a suitable patient treated with same agent.
  • the methods of the present invention are facilitated by the use of clustering algorithms, referred to herein as a "HSC algorithm" or HSCA.
  • the HSC algorithm works by identifying genomic regions of statistical discrepancy subject to a uniformity assumption.
  • the Algorithm The HSCA works as follows. 1. A tag library, a series of windowed neighborhoods, a set of density weights for those windows, a minimum required density, and a minimum required standard deviation for observation for clusters are input. 2. The algorithm considers windows of increasing size over the range of input windows around each library tag, looking for cases where the number of observed clones has exceeded the expected mean plus the minimum number of standard deviations. 3. The positions of the clones in these windows are averaged and a measure of their dispersion is computed. 4. A cluster-merging step then takes place where consecutive, overlapping regions of high density are merged subject to the windowing requirements.
  • the algorithm looks for structure in the ranges from the minimum to the maximum input range, but prefers optimizing to the set of input weighting factors and the larger window sizes. 5. Finally, cluster centers, chromosome, position, and structural information about the average number of clones per window and dispersion are output.
  • HSCA algorithms of the invention may advantageously be applied to the HSCA algorithms of the invention: a. Incorporate recognition of known HS sites for input data simplification. b. Remove ensemble transcripts replace with gene names where possible for better genetic marking. c. Replace binary intron region marking with closest distance to an exon boundary. d. Distinguish between tags upstream in the probable promoter region from those in the first exons or introns. e. Distinguish between tags in the first intron and other introns. f. Add a category for tags inside coding regions. g. Flag Separate category for case where cluster is 3' of gene I but closer to 5' of that gene than 5' of gene II downstream h. Reduce the significance threshold for non-cluster tags that fall in conserved regions. i. Recognition of mitochondrial and alpha-satellite DNA sequences and subsequent filtering of those tags, j . Flag if near an expressed gene.
  • FIG. 1 Distribution of Active Chromatin Sequences parallels genes. Distribution of Active Chromatin Sequences (ACSs) (small vertical bars, top) and genes (Ensembl; HG12) are shown along 33.1Mb of human chromosome 21. Stacking of ACSs and genes is due to compactness of the horizontal axis.
  • ACSs Active Chromatin Sequences
  • FIG. 1 Density of Active Chromatin Sequences peaks at transcription start sites and CpG islands.
  • X-axis normalized distance (b.p.) relative to transcription start sites (panels 2a, 2b) or 3' transcription termini (panel 2c) of 16,169 RefSeq genes, or to 28,890 CpG islands (panel 2d).
  • Y-axis average number of ACSs per lOObp bin. Centered distances of the ACSs from each genomic feature were computed. To avoid the problem of multiply assigned ACSs, a fractional counting technique was used whereby the number of times an ACS is assigned is recorded.
  • a histogram corresponding with equal subdivisions is constructed where the number of ACSs assigned to each class is scaled by the fractional multiple assignment count. Thus, if an ACS is assigned to two distinct transcription start sites, a value of X is assigned to each histogram class. Finally, normalizing the classes by the total number of assigned tags gives the average tag density in the class as depicted in the diagrams. Peaks in ACS density at transcription start sites and at CpG islands is evident, whereas no peak is found at transcription 3' termini. Peak at CpG islands remains even when non-promoter associated CpGs are considered (panel 2d).
  • FIG. 3 Gene expression as a determinant of ACS distribution. The expression status of RefSeq genes were determined by cDNA microarray analysis of K562 cDNA. Genes were categorized according to whether or not they were expressed and a comparison performed of the average density of tags within a 10 kb window around the transcriptional start sites. ACSs show a preference for expressed genes. However, a prominent peak in ACS density is still evident at non-expressed genes suggesting that many of these lie within open chromatin domains.
  • ACS clusters provide more powerful discrimination.
  • ACS clusters are better predictors of DNasel hypersensitivity (see text) and show more prominent aggregation around known or suspected functional genomic landmarks including TSSs (panel 4a), CpG islands (not shown), and evolutionarily-conserved non-genic sequences (panel 4b).
  • Relative densities were calculated as described in Fig 2.
  • Figure 6 illustrates the approach to forming concatamerised tag libraries.
  • Processes of embodiments of the invention to generate concatamer fragment libraries generally start with isolation of intact nuclei from cells and then treatment of nuclei with an agent capable of modifying chromatin at ACSs. DNA is recovered and sequence fragments containing ACSs are isolated. The isolated fragments are then sub- cloned into cloning vectors.
  • a representative overall process may be divided into the following three stages: (I) Preparation of DNA which contains one or more single-stranded or double-stranded modification sites within domains defined by ACSs; (II) Isolation of short segments of DNA fragments associated with ACSs (typically 16 to 21 bp, referred to as tags); and (III) Ordered cloning of concatamerised tags isolated in (II) to create a library representing the ACSs found within the DNA source employed in (I).
  • Each of these stages may be carried out in a variety of ways and has utility for a number of uses as will be appreciated by a skilled artisan.
  • Forming concatamer libraries is a powerful strategy that allows the concatamerization and cloning of sub-fragments from clones within a library. Most importantly, the discovery allows the cloning and concatamerization of far smaller fragments than are usually manipulated and isolated in previously known library production methods. Protocols used in the strategy allow clear identification of small fragments as independent entities. Generally, in this context, a concatamerized clone is sequenced with a sufficient length that allows identification and placement by comparison to a relevant database pertaining to their source material, such as a human genome database. Overview The concatamer approach expedites the sequencing and/or analysis of libraries or similar collections of nucleic acid fragments.
  • the DNA may, for example be prepared as an enriched fraction of genomic sequences in a manner that identifies clearly each fragment.
  • fragments may be identified by size or by combination with an artificially introduced marker sequence, which is referred to herein as a 'Tag'.
  • Sequencing of such concatamer clones is more efficient because many Tags may be read for a single sequencing reaction.
  • the length of the sequence needs to be read or determined to locate the Tag on a database. This depends on the nature of the database to which the Tag is being mapped to (a genomic or EST database for example). It has been calculated that for the human genome the sequence length can be as short as 16 nucleotides. Hence in this scenario a single sequencing reaction can map approximately 30 locations in the genome. This ability to read multiple Tags simultaneously provides substantial savings in time, resources and money for the sequencing effort.
  • DNA DNA may be derived from any eukaryotic cell population including animal cells, plant cells, virus-infected cells, immortalized cell lines, cultured primary tissues such as mouse or human fibroblasts, stem cells, embryonic cells, diseased cells such as cancerous cells, transformed or untransformed cells, fresh primary tissues such as mouse fetal liver, or extracts or combinations thereof. Chromatin may also be obtained from natural or recombinant artificial chromosomes and the like. Still further, the DNA also may be assembled into chromatin in vitro using previously sub-cloned large genomic fragments or human or yeast artificial chromosomes. Sample preparation often begins with chromatin from cellular material.
  • the chromatin is extracted from a eukaryotic cell population such as a population of animal cells, plant cells, virus-infected cells, immortalized cell lines, cultured primary tissues such as mouse or Human fihroblasts, stem cells, embryonic cells, diseased cells such as cancerous cells, transformed or untransformed cells, fresh primary tissues such as mouse fetal liver, or extracts or combinations thereof.
  • Chromatin may also be obtained from natural or recombinant artificial chromosomes.
  • the chromatin may have been assembled in vitro using previously sub-cloned large genomic fragments or Human or yeast artificial chromosomes.
  • multiple ACS sequences and/or location sites are obtained from a eukaryotic cell sample by first extracting and purifying nuclei from the sample as for example, described in U.S. No. 09/432,576. Briefly, a sample is treated to yield preferably between about 1,000,000 to 1,000,000,000 separated cells. The cells are washed and nuclei removed by, for example, NP-40 detergent treatment followed by pelleting of nuclei. After obtaining the DNA ACSs are labeled, preferably with an agent that preferentially reacts with genomic DNA at ACSs is added and marks the DNA, typically by cutting or binding to the DNA. This alteration often will involve breaking or making a covalent bond within specific ACSs.
  • a nuclease may mark by cutting the ACS.
  • non-specific nuclease such as DNAse I cuts DNA at ACSs to produce DNasel ACSs.
  • DNAse I is used to form two single strand breaks near each other, and typically within 5 bases of each other. After reaction with hypersensitive DNA sites the reacted DNA is, if not already, converted into smaller fragments and the reacted fragments optionally are amplified and separated into a library.
  • agents and methods that may be used to mark eukaryotic DNAs at ACSs include, for example, radiation such as ultraviolet radiation, chemical agents such as chemotherapeutic compounds that covalently bind to DNA or become bound after irradiation with ultraviolet radiation, other clastogens such as methyl methane sulphonate, ethyl methone sulphonate, ethyl nitrosourea, Mitomycin C, and Bleomycin, enzymes such as specific endonucleases, non-specific endonucleases, topoisomerases, topoisomerase II, single-stranded DNA-specific nucleases such as SI or PI nuclease, restriction endonucleases, EcoRl, Nlal ⁇ l, Hsp92l, Styl, methylases, histone acetylases, histone deacetylases, and any combination thereof.
  • radiation such as ultraviolet radiation
  • chemical agents such as chemotherapeutic compounds that co
  • clastogens may be used to break DNA and the broken ends tagged and separated by a variety of techniques.
  • Compounds that covalently attach to DNA are particularly useful as conjugated forms to other moieties that are easily removable from solution via binding reactions such as biotin.
  • binding reactions such as biotin.
  • the field of antibody or antibody fragment technology has advanced such that antibody antigen binding reactions may form the basis of removing labeled, nicked or cut DNA from a ACS.
  • Genomic DNA is isolated from DNasel-treated nuclei and the repaired ends ligated to a biotinylated adaptor containing a recognition site for Bsgl, a type IIs restriction enzyme which cuts a fixed distance downstream of its recognition site, and Notl. Digestion with Bsgl generates fragments of uniform size.
  • Figure 6A shows that the fragments are captured on beads whilst the remaining genomic fragments are washed away. The D ⁇ A can be recovered from the beads by digestion with Notl and these fragments ligated together to form concatamers.
  • Figure 6B shows that size selection of the concatamers produces the reagent which is cloned to make the library.
  • recognition sites for type IIS restriction enzymes are introduced, for example by ligating adapters to a target sequence or by cloning a fragment into a designed vector that contains such sites immediately adjacent to the cloning site.
  • Enzymes such as Bsgl or mel, which cut 16 and 20 nucleotides downstream of their recognition sites, are particularly useful, and allow isolation of stretches of novel sequence adjacent to common sequence derived from the linker or vector. Variations of this method are contemplated and included within the ambit of this embodiment of the invention.
  • physical or enzymatic fractionation of the DNA which has previously been cut with a restriction enzyme, is used to produce more heterogeneous fragments. These fragments, when concatamerized and sequenced are recognizable due to the presence of the known restriction site.
  • restriction fragments of known length are isolated by hybridization to a set of oligonucleotides, all of which contain the restriction site. The desired length of random bases, duplexes between the oligonucleotide and the target fragment will be insensitive to digestion by single strand specific nucleases. This treatment generates a population of restriction fragments of the same length.
  • Concatamer Libraries from Fragmented DNA Concatamers may be formed from cut fragments by ligation of linkers into the breakpoints. It was discovered that such cloning of breakages in genomic DNA can be very informative for studying the cutting or shearing patterns of enzymes (e.g. DNasel, SI nuclease, which are useful probes for chromatin structure), chemical agents (such as medically important intercalators or molecules which show high specificity for certain DNA structures), physical agents (such as UV irradiation or shearing) and natural processes (such as apoptosis). After cutting, linkers are ligated onto the breakpoints.
  • enzymes e.g. DNasel, SI nuclease, which are useful probes for chromatin structure
  • chemical agents such as medically important intercalators or molecules which show high specificity for certain DNA structures
  • physical agents such as UV irradiation or shearing
  • natural processes such as apoptosis
  • the site of the breakpoint is repaired by treatment with T4 DNA polymerase, an enzyme capable of converting both 5' and 3' overhangs into blunted ends.
  • a linker constructed by the annealing of equimolar amounts of synthetic oligonucleotides designed to contain a recognition site for a type IIs restriction enzyme, such as Bsgl, in such a place that following ligation it would be adjacent to the repaired breakpoint as well as containing other restriction sites and a biotin molecule to allow separation.
  • oligonucleotides 5'-GGC TCT CAT GAT TAT GTG CAG-3' (SEQ ID NO 1), and 5'-CTG CAC ATA ATC ATG AGA GCC- Biotin-3' (SEQ ID NO 2).
  • the blunted ends may be A-tailed by the action of Taq polymerase in the presence of dATP to create an end with an overhang to facilitate ligation of the linker, which now is designed to incorporate the complimentary overhang.
  • a ligase is used.
  • the linker attachment is not necessarily formed by a ligase.
  • Alternative methods include, for example the use of commercially available custom oligonucleotides with a stalled topoisomerase II molecule attached which effects a joining reaction.
  • a DNA having added biotinylated linker now can be separated for example by a solid phase.
  • the linker may be captured onto strepavidin coated paramagnetic beads (such as those supplied by Dynal, Norway).
  • strepavidin coated paramagnetic beads such as those supplied by Dynal, Norway.
  • one embodiment of the invention is the combination of M ⁇ lll to improve solid phase capture in these reactions as well as other reactions not limited to those described herein.
  • To remove linker that has not been incorporated into the DNA it can be necessary to clean the reaction prior to capture on commercially available columns (such as DNeasy from Qiagen, CA) and/or to treat with exonuclease III.
  • the latter degrades DNA in a 3' to 5' direction but will not cut sites which are blocked by the addition of biotin or have a four nucleotide 3' overhang, such as that created by Main.
  • the prepared DNA optionally is treated with a restriction enzyme prior to adding the first linker. When this occurs a second linker can be added to the end created by the digestion.
  • a linker made from the following oligonucleotides can be added: 5'-GCG TAC TCC GAC TCG CTA TAG ATC ATG-3' (SEQ ID NO 3), and 5'-ATC TAT AGC GAG TCG GAG TAC GC-3' (SEQ ID NO 4).
  • This step creates a PCR competent molecule that can be amplified. Amplification can be necessary if the amount of starting material is limiting or if the cutting event within a large DNA population is very rare.
  • PCR amplified material then can either be cloned into commercially available vectors designed to capture PCR fragments or, if one of the primers used in the PCR reaction contained a 5' biotin, the product can be directly captured onto beads.
  • Other binding partners and separation methodologies can be used as well, as will be appreciated by a skilled artisan.
  • the cloned PCR products can be further processed as, for example described below in the section entitled 'Deriving concatamer libraries from single insert libraries'.
  • the captured fragments subsequently can be treated with Bsgl and the solid phase washed, leaving the biotinylated linker and the genomic fragments on the solid phase (beads in this example).
  • a linker can be added onto the site cut by Bsgl, which leaves a two nucleotide 3' overhang (an example of which would be one created by the annealing of 5- Phosphate-GCA TGC ATG GGA CTG GAA TTC CGT-3' (SEQ ID NO 5), and 5'-ACG GAA TTC CAG TCC CAT GCA TGC NN-3' (SEQ ID NO 6).
  • PCR amplification can be performed on either the supernatant or on the recovered beads to amplify the desired genomic fragment surrounded by linker DNA.
  • the first primer can contain a biotin in the 5' position. This gives a biotinylated PCR product which can be captured onto beads and sequentially digested (and washed to remove digestion products) with Sphl then M ⁇ lll.
  • this DNA represents the Tag and is an M ⁇ lll fragment which can be concatamerized by treatment with T4 DNA ligase for example and subcloned into the Sphl site of the cloning vector pGEM5z (Promega, WI) to form a concatamerized library.
  • the PCR product generated from the beads is subcloned into a PCR cloning vector and bacterially amplified before release of the Tag by digestion of the plasmid DNA with M ⁇ lll.
  • the product is gel purified, concatamerized and cloned into the Sphl site of the cloning vector pGEM5z, in this example.
  • the captured fragments (before ligation of a second linker but after Bsgl digestion) can be denatured and the DNA released into the supernatant be treated with Tsc ligase in the presence of oligonucleotides designed to introduce a second priming site to the single stranded DNA (for example the oligonucleotides 5-GCA TGC ATG GGA CTG GAA TTC CGT-3' SEQ ID NO 8), and 5' -CAG TCC CAT TGC ATG CNN NN-3' (SEQ ID NO 9), and perform the ligation with 30 cycles with an annealing and ligation temperature of 40°C and an intervening melting step at 95°C).
  • oligonucleotides designed to introduce a second priming site to the single stranded DNA for example the oligonucleotides 5-GCA TGC ATG GGA CTG GAA TTC CGT-3' SEQ ID NO 8), and 5' -CAG TCC CAT TGC ATG
  • the resulting product is a PCR competent molecule that can be treated as above for the supernatant following denaturation of the second ligation reaction.
  • Concatamerising Sequences Associated with Restriction Sites The previous embodiment captured genomic sequence adjacent to a breakpoint. The position of the break may be unknown. This embodiment uses restriction sites having known positions in the context of sequenced genomes. The utilities of forming concatamers from these sites include mapping deletions or replications within the genomes of tissue culture cells and mapping the restriction fragments associated with the introduced breakpoints. Mapping Copy Number Differences in Tissue Culture Cells
  • the genomic DNA is digested with rare cutting enzymes to generate a low resolution map, or with frequent cutters to deliver higher resolution.
  • methylation-sensitive restriction enzymes would generate information about the epigenetic status of the genome.
  • This embodiment of the invention may be carried out alternative ways.
  • One advantageous way is to attach a biotinylated linker containing a restriction site for a IIs enzyme, such as Bsgl, with a complimentary site for the restriction enzyme used (an example of the sequence of the primers used in combination with an M ⁇ lll digest are 5'-GCG TAC TCC GAC TCG CTA TAG ATC ATG-3' (SEQ ID NO 10) and 5'-Phosphate-ATC TAT AGC GAG TCG GAG TAC GC-3' (SEQ ID NO 11).
  • a Bsgl digestion is used if appropriate and the product captured on the solid phase such as paramagnetic beads.
  • the concatamer libraries then can be formed by one of the approaches as described above. Physical shearing of the restricted DNA, by sonication or shearing for example, can be used to generate a population of molecules with a small average size, the standard deviation of which can begreatly reduced by size fractionation and a common recognition sequence. The fragments are then repaired by treatment with T4 DNA polymerase to create blunt molecules which then, for example, may be A-tailed and cloned into a PCR cloning vector, such as pGEM-Teasy (Promega, WI).
  • pGEM-Teasy Promega, WI
  • the fragments can be released by digestion with EcoRI and the purified products concatamerized and cloned into the Ec ⁇ l site of a second cloning vector.
  • An alternative to the step of cloning the Tags (with or without PCR amplification) and bacterial expansion has been devised, in order to counter bias introduced in the amplification steps.
  • the Tag is made single-stranded and has a second priming site attached to both the 3' and 5' ends via a Tsc reaction (using the following set of primers, for example: 5'-Phosphate-TAT GCG GCC GCT TAG TAC-3' (SEQ ID NO 12); 5'-NNN NGT ACT AAG-3' (SEQ ID NO 17); 5'-CCG CAT ANN NN-3' (SEQ ID NO 13); and perform the ligation with 30 cycles with an annealing and ligation temperature of 30°C and an intervening melting step at 95 °C).
  • a Tsc reaction using the following set of primers, for example: 5'-Phosphate-TAT GCG GCC GCT TAG TAC-3' (SEQ ID NO 12); 5'-NNN NGT ACT AAG-3' (SEQ ID NO 17); 5'-CCG CAT ANN NN-3' (SEQ ID NO 13); and perform the ligation with 30 cycles with an anne
  • the product forms a template for Rolling Circle Amplification (due to the presence of single stranded circles) which can be performed with an oligonucleotide complementary to the Notl site (5'- GCG GCC GC-3'; SEQ ID NO 14) in the presence of Bst polymerase (NEB, NE; performed for 20 hours at 60°C).
  • Bst polymerase NEB, NE; performed for 20 hours at 60°C.
  • the resulting product can then be digested with Notl to generate a Tag molecule with complimentary ends which can be used to form a concatamer library.
  • a third alternative is to use a hybridization approach, that is to digest the genomic D ⁇ A with an enzyme such as Notl, denature the D ⁇ A and hybridize with a 5 '-biotinylated P ⁇ A molecule containing the recognition site for the enzyme at its 5' end followed by a number of random bases (up to 16). After annealing the P ⁇ A to the D ⁇ A to preferentially form P ⁇ A:D ⁇ A hybrids the remaining single stranded DNA can be digested by the action of a single-stranded specific nuclease. The PNA:DNA hybrids can be captured on beads and the DNA strand recovered by denaturation.
  • an enzyme such as Notl
  • a Notl oligonucleotide (5'-GCG GCC GC-3'; SEQ ID NO 15) then can convert the single stranded DNA into double stranded material in the presence of Taq polymerase and dNTPs, and the Tag cloned individually, and processed further as discussed above, or concatamerized as blunt molecules before cloning to form a concatamer library.
  • Indirectly Mapping Breakpoints The site of a repaired breakpoint is first labeled enzymatically with biotin, using either terminal transferase or an exchange reaction with T4 polynucleotide kinase in the presence of a modified donor nucleotide.
  • the resultant reaction product is cleaned to remove entirely the labeling activity and then digested with a restriction enzyme. Capture of these products on solid phase such as beads will isolate those breakpoints which have been successfully labeled.
  • a ligation reaction can be performed as above; a non- biotinylated linker with the appropriate restriction sites can be attached to the site exposed by the digest. In this iteration the linker contains an Sphl site which can be cut to expose an M ⁇ lll site at the 3' end, a Bsgl digest will then release a Tag molecule which can be precipitated.
  • the Tag can either be blunted, A-tailed and cloned into a PCR vector, bacterially amplified and then cut out of the plasmid by appropriate restriction enzymes and gel purified to generate a Tag with compatible ends which can then be concatamerized and cloned.
  • the Tag molecules can be ligated together directly after being recovered from the beads to form a DiTags (a series of Tags ligated head-to-head on a concatamer molecule with either M ⁇ lll or Bsgl sites at its termini).
  • DiTag products either can be digested (with Sphl) to release single DiTags, which can be cloned into the Sphl site of a vector, or the whole concatamer molecule cloned (following modification to blunt and A-tail it).
  • a second alternative is to ligate a second linker to the site exposed by the Bsgl digestion and process as discussed above.
  • Deriving concatamer libraries from single insert libraries Cloning vectors in embodiments of the invention have either been altered to contain, for example, a Bsgl site adjacent to the cloning site or the insert of interest has a site introduced by attachment of an appropriate linker.
  • the Tag can be generated by digestion with Bsgl and a second unique enzyme in the poly linker 5' of the Bsgl site. This Tag can be gel purified. Most commonly concatamer libraries are made by the formation of DiTags as described above.
  • Tags that are prepared as described above are concatamerized either in double stranded form, with standard DNA ligases, or in single stranded form, using for example the thermostable Tsc DNA ligase.
  • the Tags themselves may be ligated to each other, to form DiTags, before concatamerization, in a process that enters control steps for any bias in the representation of Tags in the population as well as for errors in the sequencing reactions.
  • Various other methods are also alluded to in Section 5.1.2.
  • An Example is given in Section 6.3.
  • Genomic tags were bioinformatically filtered to select those which occurred uniquely within the human genome. For sequences exceeding sixteen nucleotides in length it was found that approximately one half of the tags mapped uniquely to the human genome.
  • HSCA Hot Spot Cluster Algorithm
  • the HSCA takes as input mapped and localized DNA sequence tags extracted from concatamer prepared libraries. These tags are filtered for unique mappability under the Merbase genomic localization system (Hawrylycz et al, manuscript in preparation). Their start, stop, and orientation in the genome are known. At this point a number of genomic markers and features of interest are obtained. In the current version of the algorithm these include
  • 14. Replace binary intron region marking with closest distance to an exon boundary. 15. Distinguish between tags upstream in the probable promoter region from those in the first exons or introns. 16. Distinguish between tags in the first intron and other introns. 17. Add a category for tags inside coding regions. 18.
  • 20 Recognition of mitochondrial and alpha-satellite DNA sequences and subsequent filtering of those tags.
  • 21 Flag if near an expressed gene. 22. CpG Island with designable primers
  • the tags input to HSCA are filtered so as to be uniquely mappable onto the genome, there may be duplicates as a result of PCR amplification, multiple hits on opposite strands of the DNA, and other peculiarities of the genome or operations.
  • duplicate tags are removed prior to clustering, and a small neighborhood around the tag is taken (50bp) in which a check for repeat content in the genome is made.
  • the standard for this latter step is RepeatMasker () and so any neighborhood of the tag containing a lower case repeat masked nucleotide causes the tag to be removed prior to clustering.
  • RepeatMasker RepeatMasker
  • the HSC algorithm works by identifying regions statistical discrepancy subject to a uniformity assumption. Clusters are therefore identified with respect to the null hypothesis of a uniform distribution of tag library cut sites across the genome.
  • the assumed statistical model is a simple binomial model where each DNAse cut is made independently.
  • the Algorithm works intuitively as follows.
  • a tag library, a series of windowed neighbourhoods, a set of density weights for those windows, a minimum required density, and a minimum required standard deviation for observation for clusters are input.
  • the algorithm considers windows of increasing size over the range of input windows around each library tag, looking for cases where the number of observed clones has exceeded the expected mean plus the minimum number of standard deviations
  • the positions of the clones in these windows is averaged and a measure of their dispersion is computed.
  • a cluster-merging step takes place where consecutive, overlapping regions of high density are merged subject to the windowing requirements.
  • the algorithm looks for structure in the ranges from the minimum to the maximum input range, but prefers optimizing to the set of input weighting factors and the larger window sizes.
  • Cluster centers, chromosome, position, and structural information about the average number of clones per window and dispersion are output at the end. See Example 6.4 for the algorithm details.
  • K562 cells are grown to confluence (5 x 10 5 cells per cubic milliliter as assayed by hemocytometer). Nuclei are prepared from a suitable volume (e.g., 100ml) as described (Reitman et al MCB 13:3990). Nuclei are then re-suspended at a concentration of 8 OD/ml with 10 ⁇ lof 2 U/ ⁇ lDNasel [Sigma] at 37°C for 3 min. The DNA is purified by phenol- chloroform extractions and ethanol precipitated.
  • the DNA is repaired in a 100 ⁇ lreaction containing 10 ⁇ gDNA and 6 U T4 DNA polymerase (New England Biolabs) in the manufacturer's recommended buffer and incubated for 15 min at 37°C and then 15 min at 70°C. 1.5 U Taq polymerase (Roche) is added and the incubation continued at 72°C for a further 10 min.
  • the DNA is then recovered using a Qiagen PCR Clean-up Kit and eluted in 50 ⁇ l of 10 mM Tris.HCl, pH8.0.
  • genomic tags 10 ⁇ g of clean genomic DNA was precipitated and resuspended in 20 ⁇ l of 0.2 x TE buffer (2 mM Tris.HCl, 0.2 mM EDTA, pH8.0). The DNA was mixed in 100 of 1 x Taq DNA polymerase buffer (Roche) supplemented with 200 ⁇ M dNTPs, 3 U T4 DNA polymerase (NEB) and 5 U Taq DNA polymerase (Roche) and incubated for 10 min at 37°C followed by a 20 min incubation at 72°C. The DNA was cleaned by use of a Qiagen PCR Purification column and the DNA eluted in a volume of 90 ⁇ l of Elution buffer.
  • 1 x Taq DNA polymerase buffer (Roche) supplemented with 200 ⁇ M dNTPs, 3 U T4 DNA polymerase (NEB) and 5 U Taq DNA polymerase (Roche)
  • the DNA was incubated overnight at 16°C in 100 ⁇ l of 1 x T4 DNA ligase buffer (NEB) containing 10 pmol of the A- Adaptor (formed by the annealing of A-Af 5'Biotin-TEG-CTC TGG CGC GCC GTC CTC TCA CGC GTC CGA CT-3' and A-Ar 5'Phosphate-GTC GGA CGC GTG AGA GGA CGG CGC GCC AGA GC-3') and 400 U T4 DNA ligase (NEB).
  • NEB T4 DNA ligase buffer
  • the Adapted DNA was subsequently digested to completion with Mmel (NEB) before the biotinylated DNA was separated by binding to paramagnetic strepavidin coated M-270 beads (Dynal). Following washes the beads were resuspended in 30 ⁇ l of 1 x T4 DNA ligase buffer containing 20 pmol of NN-Adaptor (formed by the annealing of A-NNf 5 '-GAG AGC GGT GCA GAA GGA GAC GTA CGA NN-3' and A-NNr 5'-TCG TAC GTC TCC TTC TGC ACC GCT CTC-3') and incubated overnight with continual rotation at 4°C.
  • NN-Adaptor formed by the annealing of A-NNf 5 '-GAG AGC GGT GCA GAA GGA GAC GTA CGA NN-3' and A-NNr 5'-TCG TAC GTC TCC TTC TGC ACC GCT CTC-3'
  • the beads are then captured and washed in three changes of 1 x TE buffer supplemented with 50 mM NaCl before finally being resupended in 100 ⁇ l TE.
  • 2 ⁇ l is used as a template in a PCR reaction containing 10 ⁇ l water, 0.225 ⁇ l 20 pmol ⁇ l PCR-Af (5'-CGC CGT CCT CTC ACG CGT CCG A-3'), 0.225 ⁇ l pmol/ ⁇ l PCR-NNf (5'-GAG AGC GGT GCA GAA GGA GAC GTA CGA-3'), 1.2 ⁇ l 25 mM MgCl 2 and 1.5 ⁇ l 10 x Fast Start SYBR-Green master mix (Roche) which is cycled using a Lightcycler Real-time PCR system (Roche).
  • the machine is used to determine the maximum number of cycles in which amplification is still exponential. Typically this value is between ten and fifteen cycles.
  • PCR reactions performed with the chosen cycling conditions are separated by 12% PAGE and the 76 bp band purified by excising the band and eluting the tag DNA into 100 ⁇ l TE by incubation overnight.
  • PCR reactions typically eight 100 ⁇ l PCR reactions are performed in mixture containing 0.8 ⁇ l tag DNA, 80 ⁇ M dNTPs, 40 pmol b-PCR-Af (a 5' biotinylated version of primer PCR-Af), 40 pmol of PCR-NNf and 2.5 U Taq polymerase (Roche) with nine cycles of amplification (94°C for 20s, 60°C for 20s and 72°C for 25s). The correct size band is again excised from the gel.
  • the amplified tags are digested with Bsi ⁇ l by incubation at 55°C. Complete digestion was generally achieved by adding serial aliquots of the enzyme.
  • the reaction was captured on M270 Dynal beads, as per the manufacturer's instructions, and the beads finally resuspended in 30 ⁇ l of 1 x NEB Buffer 3 supplemented with 20 U Mlul and incubated at 37°C with continual rotation for 2 h to cleave the tags from the beads. Following digestion the beads were recaptured and the concentration of the tags in the supernatant assessed with a Picogreen quantitation kit.
  • Example 6.3 Formation and cloning of high molecular weight concatamers A 30 ⁇ l ligation reaction was set up with 30 pmol of tags and 0.5 pmol of BsiWI- Adaptor (formed by the annealing of B-Af 5 '-GAG TGT GGC GCG CCT TGT AGA C-3' and B-Ar 5'-GTA CGT CTA CAA GGC GCG CCA CAC TC-3') and incubated with 400 U T4 DNA ligase (NEB) overnight at 16°C.
  • BsiWI- Adaptor formed by the annealing of B-Af 5 '-GAG TGT GGC GCG CCT TGT AGA C-3' and B-Ar 5'-GTA CGT CTA CAA GGC GCG CCA CAC TC-3'
  • 5 ⁇ l of the ligation was subsequently used in a 50 ⁇ l PCR reaction with 20 pmol B-Af primers, 100 ⁇ M dNTPs and 1 U Taq polymerase which was cycled twenty times with the following conditions: 94°C for 20s, 60°C for 20s and 72°C for 1 min.
  • the DNA was precipitated and resuspended in a 25 ml volume of 1 x NEB4 buffer containing 10 U Ascl and digested at 37°C for 2 hours before being separation on a 1.5% agarose/TBE gel. All concatamers of tags greater than 500 bp in size were isolated using a Qiagen Gel Extraction kit and eluted in 50 ⁇ l EB. 10 ⁇ l of the eluted DNA was used in an overnight ligation into pGEM5z cut with Mlul.
  • Example 6.4 The HSC Algorithm We describe the basic algorithm of the HSCA. Assume a binomial model B(p,L) for DNAse cut site distribution, where p is the probability of a single DNAse cut in an HS site, and L is the total library size. Let ⁇ denote the expected number of cuts in a regions of HS size in genome G and ⁇ its standard deviation.
  • Input Window range (W L , W R ), and set of density weights for each window d t e (d L ,d R ) , minimum standard deviation threshold ⁇ , and minimum required density ⁇ mm . Sort the library L removing duplicates and repeat regions.
  • Step 1 Identify Hot Spots
  • T4 DNA polymerase treatment DNasel-digested DNA + 10 ⁇ l DNA (@ l ⁇ g/ ⁇ l) + 10 ⁇ l 5 x T4 DNA polymerase buffer (Roche) + 1 ⁇ l 10 mM dNTPs (Roche) + 2 ⁇ l T4 DNA polymerase (? U/ ⁇ l) + 27 ⁇ l water
  • Exonuclease I treatment To 25 ⁇ l DNA sample add + 3 ⁇ l 10 x Exonuclease I buffer (USB) + 1 ⁇ l Exonuclease I (USB; 10 U/ ⁇ l) + 1 ⁇ l water

Abstract

The invention is directed towards the comprehensive identification of genomic regulatory sites from the genomic DNA of any cell or tissue type. The method comprises the steps of: (i) treating intact chromatin with an agent capable of modifying regulatory sites; (ii) isolating a fragment of genomic sequence associated with the site of modification; (iii) creation of libraries consisting of concatamerised genomic sequences in such a manner that each individual sequence is recognisable; (iv) identifying the genomic location of each unique genomic sequence and (v) analysing the distribution to identify genomic locations with significant enrichment of cloned genomic sequences, `clusters'. The positions of individual genomic sequences and more so of the `clusters' predict the position of genomic regulatory sites. Comparative analysis of such libraries from different biological sources is a method for defining genomic responses to a range of biological effectors, including, pharamaceuticals, diseases, aging and the environment.

Description

Methods And Algorithms For Identifying Genomic Regulatory Sites
RELATED APPLICATIONS This application claims benefit under 35 U.S.C. § 119(e) of U.S. Provisional Application No. 60/530,320, filed on December 15, 2003, the entire disclosure of which is incorporated by reference herein in its entirety.
1. FIELD OF INVENTION The invention relates to methods for identifying regulatory sites in a genomic locus on the basis of their relative sensitivity of chromatin sensitivity to a DNA modifying agent. The invention relates generally to methods of DNA analysis and more specifically to methods for analysis of genomic sequences. The invention also relates to the use of these regulatory sites, databases comprising the same, and their use in regulating gene expression, disease diagnosis and therapy, and identification of therapeutic drugs.
2. BACKGROUND OF INVENTION
2.1 Regulation of the genome Understanding the human genome and those of other complex organisms will require comprehensive delineation of the functional elements that regulate transcription and other chromosomal processes. In vivo, regulatory sequences are found to coincide with focal alterations in chromatin structure [Felsenfeld G. (1996) Cell 86,13-9 ; Felsenfeld, G. & Groudine, M. (2003) Nature 421, 448-53; Gross, D. S., and Garrard, W. T. (1988) Annu. Rev. Biochem. 57, 159-197 (1988); Elgin, S. C. (1984) Nature 309, 213-4]. Chromatin architecture plays a defining role in the control of eukaryotic genes in vivo as it determines the accessibility of critical genomic sequences to the regulatory and transcriptional machineries [Felsenfeld G. (1996) Cell 86,13-9 ; Felsenfeld, G. & Groudine, M. (2003) Nature 421, 448- 53]. Active regulatory foci within genomic sequences are detectable experimentally on the basis of pronounced sensitivity to cleavage when intact nuclei are exposed to DNA modifying agents, canonically the non-specific endonuclease DNasel [Gross, D. S., and Garrard, W. T. (1988) Annu. Rev. Biochem. 57, 159-197 (1988); Elgin, S. C. (1984) Nature 309, 213-4; Wu, C. (1980) Nature 286, 854-60 (1980)]. The co-localization of DNasel Hypersensitive Sites (HSs) with cz's-active elements spans the spectrum of known transcriptional and chromosomal regulatory activities including transcriptional enhancers, promoters, and silencers, insulators, locus control regions, and domain boundary elements [Felsenfeld G. (1996) Cell 86,13-9; Gross, D. S., and Garrard, W. T. (1988) Annu. Rev. Biochem. 57, 159-197 (1988); Burgess- Beusse, B., Farrell, C, Gaszner, M., Lift, M., Mutskov, V., et al. (2002). Proc. NatlAcad. Sci. U S A 99, 16433-7]. It is therefore expected that a comprehensive library of DNasel hypersensitive sites from the human genome would contain many (if not all) of these classical cώ-regulatory sequences. We sought to exploit in vivo DNasel hypersensitivity as the basis of a powerful and generic approach for de novo identification of functional non-coding sequences on a genome- wide level. We developed an approach for isolating and cloning sequences flanking DNasel cut sites introduced in the context of intact nuclei, and for enriching sequences associated with DNasel hypersensitive sites using a subtractive procedure. Sequencing and genomic mapping of the resulting collection of active chromatin sequences (ACSs) provides the basis for genome-wide localization of DNasel hypersensitive sites and for global analysis of the relationship between chromatin structure and gene expression.
2.1.1 Summary Conventional gene expression studies generally employ immobilized DNA molecules that are complementary to gene transcripts (either the entire transcript or to selected regions thereof) that are transcribed and spliced into mRNA. Recent advances in this field utilize arrays or microarrays of such molecules that enable simultaneous monitoring of multiple distinct transcripts (see, e.g., Schena et al., Science 270:467-470 (1995); Lockhart et al, Nature Biotechnology 14:1675-1680 (1996); Blanchard et al., Nature Biotechnology 14, 1649 (1996); and U.S. Pat. No. 5,569,588, issued Oct. 29, 1996 to Ashby et al. entitled "Methods for Drug Screening."). Such arrays have the potential to detect transcripts from virtually all actively transcribed regions of a cell or cell population, provided the availability of an organism's complete genomic sequence, or at least a sequence or library comprising all of its gene transcripts. In the case of the Human where a complete gene set remains unclear, such arrays may be employed to monitor simultaneously large numbers of expressed genes within a given cell population. The simultaneous monitoring technologies particularly relate to identifying genes implicated in disease and in identifying drug targets (see, e.g., U.S. Patent Nos. 6,165,709; 6,218,122; 5,811,231; 6,203,987; and 5,569,588). Unfortunately, these array technologies generally rely on direct detection of expressed genes and therefore reveal only indirectly the activity of genetic regulatory pathways that control gene expression itself. On the other hand, a detection system directed toward sensing the activity of particular genetic regulatory pathways or cis-acting regulatory elements could provide deeper information concerning a cell's regulatory state. Accordingly, the detection of active regulatory elements, particularly in related and interacting groups, potentially could become extremely important for delineation of regulatory pathways, and provide critical knowledge for design and discovery of disease diagnostics and therapeutics. Most research in the area of gene regulation has focused on finding and using individual sequences either upstream or downstream of individual coding gene targets. Generally, the presence of absence of a particular DNA sequence is linked with increased or decreased expression of a nearby gene when determining the regulatory effect of the sequence. For example, the beta-like globin gene was shown to contain four major DNAase I hypersensitive sites of possible regulatory function by studies that removed or added these sequences and that looked for an effect on gene expression in erythroid cells. See Grosveld et. al. U.S. Patent No. 5,532,143. From related studies, Townes et al. asserted that two of the four DNAse hypersensitive sites might control genes generally in cells of erythroid lineage. Although an interesting development, these observations generally are limited to detection of effects on nearby coding sequences of known genes. Multiple regulatory units, which behave coordinately, are not readily amenable to analysis by these techniques. Multiple gene and protein elements interact for even simple biological processes. Because of this, a one at a time strategy for targeting a single coding gene and nearby non- coding sequences to determine their effects on the preselected gene insufficiently addresses the true in vivo situation. Accordingly, any tool that can provide simultaneous regulation system information would give rich benefits in terms of improved diagnosis, clinical treatment and drug discovery.
2.1.2 Regulation of genes in vivo Understanding the human genome requires comprehensive identification of DNA elements that are functional in vivo. A major class of such sequences are those which have a role in regulating genomic activity. Regulatory factors interact with chromatin in a site- specific fashion to bring the genome to life. All genes are controlled at multiple levels through the interaction of regulatory factors with gene-proximal or, in some cases, distant cis- regulatory sites. The nucleoprotein complexes formed by such interactions may be tissue or developmental stage-specific, or they may be constitutive, depending on the regulatory requirements of their cognate gene. While our knowledge of the patterns of gene expression in diverse tissues and under a wide-ranging set of conditions has grown substantially in recent years, this growth has not been paralleled by a comparable increase in our knowledge of regulatory factors that control specific genes affecting specific cellular or disease processes. The basic chromatin fiber consists of an array of nucleosomes, each packaging around 200 base pairs of DNA; 146 is wound around the histone octamer, with the remainder forming a link to the next nucleosome. In eukaryotic cells, all genomic DNA in the nucleus is packaged into chromatin, the architecture of which plays a central role in regulating gene expression (for reviews see Felsenfeld, G. & Groudine, M., 2003, Nature 421, 448-53; Felsenfeld, G., 1992, Nature 355, 219-24; Brownell, J. E. & Allis, C. D., 1996, Curr Opin Genet Dev 6, 176-84; Kingston, R. E., Bunker, C. A. & Imbalzano, A. N., 1996, Genes Dev 10, 905-20; Tsukiyama, T. & Wu, C, 1997, Curr Opin Genet Dev 7, 182-91; Wolffe, A. P., Wong, J. & Pruss, D., 1997, Genes Cells 2, 291-302; Kadonaga, J. T., 1998, Cell 92, 307-13; Struhl, K., 2001, Science 293:1054-1055). At a global level, this packaging serves two purposes: (i) it is physically necessary to condense the mass of sequence information into a well-ordered regular structure that can be contained within the nucleus; and (ii) it imparts a level of site-specific 'epigenomic' information (Felsenfeld, G., 1992, Nature 355, 219-24), for example discriminating between sequences which are never to be transcribed and are stored in highly condensed heterochromatin, and those sequences which are actively transcribed and are maintained in a more accessible chromatin state. Gene expression is regulated by several different classes of cω-regulatory DNA sequences including enhancers, silencers, insulators, and core promoters (Felsenfeld and Groudine, 2003, Nature 421, 448-53; Butler and Kadonga, 2002, Genes Dev 16: 2583-2592; Gill, G., 2001, Essays Biochem 37: 33-43). The core promoter is the site of formation of the RNA pol II transcription complex. Enhancers and silencers act over distances of several kilobases (or more) to potentiate or silence pol II function. Insulator sequences prevent enhancers and silencers targeted to one gene from inappropriately regulating a neighbouring gene. Larger more complex elements comprising multiple enhancer and/or silencers have come to light which coordinate the activity of linked genes over large chromosomal domains ('Locus Control Regions' or 'Domain Control Regions') (reviewed in Li et al., 2002, Blood 100, 3077-86; Hardison, R.C., 2001, Proc NatlAcad Sci USA 98:1327-1329). Activation of cώ-regulatory elements in the context of chromatin requires the cooperative binding of regulatory factors (Felsenfeld, G., 1996, Cell 86, 13-9). This active state is most commonly addressed by measuring the sensitivity of the underlying DNA sequences to digestion with nucleases (e.g., DNasel) in the context of chromatin (Weintraub, H. & Groudine, M., 1976, Science 193, 848-56; Elgin, S. C, 1981, Cell 21, 413-5). Multiprotein complexes exist in cells that allow specific destabilization of nucleosomes at promoters, facilitating the binding of sequence-specific factors and the general transcriptional machinery (Kingston, R. E., Bunker, C. A. & Imbalzano, A. N., 1996, Genes Dev 10, 905-20; Svaren, J., Horz, W., 1996, Curr Opin Genet Dev. 6:164-170; Tsukiyama, T. & Wu, C, 1997, Curr Opin Genet Dev 1, 182-91). Posttranscriptional modifications of chromatin components, particularly histone RStylation, play important roles in regulating chromatin structure and gene activity (Brownell, J. E. & Allis, C. D., 1996, Curr Opin Genet Dev 6, 176-84; Grunstein, M., 1997, N twre. 389:349-352; Wolffe, A. P., Wong, J. & Pruss, D., 1997, Genes Cells 2, 291-302; Kadonaga, J. T., 1998, Cell 92, 307-13; Struhl, K., 1998, Genes Dev 12, 599-606). Activation of tissue-specific genes during development and differentiation occurs first at the level of chromatin accessibility and results in the formation of transcriptionally- competent genetic loci characterized by increased sensitivity (relative to inactive loci) to digestion with Dnasel (Groudine et al, 1983, Proc NatlAcad Sci USA. 80:7551-7555; Tuan et al, 1985, Proc Natl Acad Sci USA. 82:6384-6388; Forrester et al, 1986, Proc Natl Acad Sci U S A. 83:1359-1363). Loci in an accessible chromatin configuration can subsequently respond to acutely activating signals, often conveyed by non-tissue-specific transcriptional factors that can gain access to the open locus and recruit or activate the basal transcriptional machinery. The initial observation that active genes reside within domains of generally increased sensitivity to nucleases was made nearly 30 years ago (Weintraub, H. & Groudine, M., 1976, Science 193, 848-56). Since this time, such data had been accumulated for a number of human gene loci (Pullner et al, 1996, J Biol Chem 271 : 31452-31457) and those in other vertebrates (Koropatnick and Duereksen, 1987, Dev Biol 122: 1-10; Stratling et al., 1986, Biochemistry 25: 495-502). The chromatin domain phenomenon is particularly striking in Drosophila, where distinct transitions between DNase-sensitive and DNase-resistant chromatin can be documented (Farkas et al, 2000, Gene 253: 117-136). Focal alterations in chromatin structure are the hallmark of active regulatory sequences in eukaryotic genomes. The literature connecting DNasel-hypersensitive sites with genomic regulatory elements is extensive. DNase hypersensitivity studies had been employed to delineate the transcriptional regulatory elements of over 100 human gene loci. Typically, between 1 and 5 hypersensitive sites had been visualized for each of these loci. However, only a fraction of these had been precisely localized at the sequence level. A critical defining feature of HSs is that the function of the DNA sequence component, i.e. its complex-forming activity, is intrinsic. The principal evidence for this is the fact that these sequences can be excised and inserted into other positions in the genome, where they exhibit the same functional chromatin activities. Substantial experimental experience from model systems has revealed that HSs can form when included in either constructs used to create stably transfected cell lines (Fraser et al, 1990 Nucleic Acids Res 18:3503-3508)or transgenic animals (Lowrey et al, 1992, Proc Natl Acad Sci U S A 89, 1143-7; Levy-Wilson et al, 2000, Mol Cell Biol Res Commun 4, 206-11). An important finding has been that HS sequences are rendered functional only upon assembly into nuclear genomic chromatin. These DNA sequences are thought to potentiate formation of a nucleoprotein complex in a manner that dramatically increases its probability of activation vs. neighboring DNA regions. They are hypothesized to adopt a particular topological confirmation, which lowers the free energy for coalescence of a limited set of proteins, some in contact with DNA, and some in contact only with another protein in the complex. This results in the formation of a nucleoprotein complex which is precώely correlated with a particular sequence. The formation of this complex takes plRS in an 'all-or- none' fashion (e.g., Felsenfeld et al, 1996, Cell 86, 13-9; Boyes & Felsenfeld, 1996, EMBOJ 15:2496-2507). The stochasticity of nucleoprotein complex formation can be manipulated through the introduction of point mutations or small deletions or insertions in critical DNA binding bases or in juxtaposed sequences that affect overall stability (e.g., Stamatoyannopoulos et σ/., 1995, EMBOJ 14, 106-16). Cooperative binding of regulatory factors in the context of chromatin results in sequence-specific 'remodeling' of the local chromatin architecture (Felsenfeld and Groudine, 2003. Nature 421; 448-453). This focal 'remodeling' is the signature of active regulatory foci within genomic sequences and is detectable experimentally on the basis of pronounced sensitivity to cleavage when intact nuclei are exposed to DNA modifying agents, canonically the non-specific endonuclease Dnasel (Gross and Garrard 1988. Annu. Rev. Biochem. 57; 159-197, Elgin 1984. Nature 309; 213-4, Wu 1980. N twre 286; 854-860). The co- localization of DNasel Hypersensitive Sites (HSs) with cf^-active elements spans the spectrum of known transcriptional and chromosomal regulatory activities including transcriptional enhancers, promoters, and silencers, insulators, locus control regions, and domain boundary elements (Felsenfeld 1996. Cell 86, 13-9, Gross and Garrard 1988. Annu. Rev. Biochem. 57; 159-197, Burgess-Beusse et al, 2002. Proc. Natl. Acad. Sci. USA 99; 16433-7 ). HSs have also been observed to coincide with sequences governing fundamental genomic processes including attachment to the nuclear matrix (Jarman and Higgs 1988. EMBO J. 7; 3337-44, Kieffer et al, 2002. J, Immunol. 168; 3915-3922), and recombination (Zhang et al, 2002. Proc. Natl. Acad. Sci USA 99; 3070-3075), though their association with these lower level chromosomal processes is less easy to document owing to their ephemeral nature or cell-cycle specific appearance.
Figure imgf000007_0001
222: 305-318.
Insulator Demarcates gene regulatory Beta-globin HS5 Li and Stamatoyannopoulos domains 1994. Blood 84; 1399-1401. H19/Igf2 Jones et al, 2001. Hum. Mol. Genet. 10; 807-814. Locus Control Determines long-range chromatin Beta-globin Grosveld 1999. Curr. Opin. Region structure and control of multiple Genet. Dev. 9; 152-157. linked genes CD2 Festentein et al, 1996. Science 271 ; 1123-5. Adenine Aronow et al, 1992. Mol Cell. Deaminase Biol. 12; 4170-4185.
Transcriptional Down-regulates transcription GATA3 silencer Gregoire and Romeo, 1999. J. Silencer from linked gene Biol. Chem. 274; 6567-6578. Matrix Attachment Tether chromatin to protein CD8 gene Kieffer et al, 2002. J. Immunol. Region backbone comples MARs 168; 3915-22.
Origin of Origin of DNA replication PuffII/9A ORI Urnov et al, 2002. Replication Chromosoma 11; 291-303. (ORI) Recombination Sites of frequent chromosome AML1/RUNX1 Zhang et al, 2002. Proc. Natl. Sites recombination or translocation breakpoints in Acad. Sci. USA. 99; 3070-3075. t(8;21) leukemia
DNase hypersensitivity studies collectively comprise the most successful and extensively validated methodology for discovery of regulatory sequences in vivo, and had been employed to delineate the transcriptional regulatory elements of > 100 human gene loci. Over 25 years of experimentation and legion publications by many investigators have established an inviolable connection between sites of DNase hypersensitivity in vivo and functional non-coding sequences that regulate the genome. In essentially every case where a major DNase HS has been adequately studied, a genomic regulatory activity has ultimately been disclosed, even if such function is not immediately apparent due to temporal or spacial restriction of activity (e.g., Wai et al, 2003. EMBO J. 22; 4489-4500). This is not merely a phenomenon of negative publication bias: since DNasel HSs are biological phenomena of independent significance, they are extensively reported even without specific studies of their contribution to transcription. Conversely, in every published case where a regulatory sequence with documented in vivo activity (e.g., a promoter or enhancer discovered with other means) has been assayed for nuclease hypersensitivity, the expected result has been found. It is now generally accepted that DNase HSs mark genomic sequences that bind regulatory factors in vivo with consequent disruption of the nucleosome array (Felsenfeld 1996. Cell 86; 13-19). Nuclease hypersensitive sites are biologically bounded by (a) the positions of flanking nucleosomes and (b) limits on the area of DNA over which thermodynamically stable nucleoprotein complexes may form. The extent of the regulatory domain is contained within the inter-nucleosomal interval, approximately 150-250bp. This interval corresponds to the size of sequence that is needed to place a canonical nucleosome and it has been a common assumption that HSs represent a break in the nucleosomal array that defines the vast majority of chromatin. A core domain can be identified which is restricted to a region of approximately 80-120 base pairs in length, over which critical DNA- protein interactions take place (e.g., Lowrey et al, 1992. Proc. Natl. Acad. Sci. USA 89; 1143-1147). Cooperative binding of transcription factors to such core regions is sufficient to exclude a nucleosome in vitro (Adams and Workman, 1995. Mol. Cell Biol. 15; 1405-1421) and this is now accepted as a common mechanism for how these sites form in vivo (Boyes and Felsenfeld, 1996. EMBO J. 15; 2496-2507; Wallrath et al, 1994. Bioessays 16; 165-170; Struhl, 2001. Science 293; 1054-1055). In summary, DNase HSs are extensively validated markers of sequence-specific in vivo functionality and should therefore be presumed to be involved in regulation of neighboring genes until proven otherwise (Urnov 2003. J. Cell Biochem. 88; 684-694). DNasel hypersensitivity studies thus represent a powerful, in vivo approach to detection and analysis of biologically active sequences. Nuclease hypersensitive sites are biologically bounded by (1) the positions of flanking nucleosomes and (2) limits on the area of DNA over which thermodynamically stable nucleoprotein complexes may form. The extent of the regulatory domain is contained within the inter-nucleosomal interval, approximately 150-250bp. This interval corresponds to the size of sequence that is needed to plRS a canonical nucleosome and it has been a common assumption that HSs represent a break in the nucleosomal array that constitutes the vast majority of chromatin. A core domain can be identified which is restricted to a region of approximately 80- 120 base pairs in length, over which DNA-protein interactions take plRS (e.g., Lowrey et al, 1992, Proc Natl Acad Sci USA 89, 1143-7). Cooperative binding of transcription factors to such core regions is sufficient to exclude a nucleosome in vitro (Adams and Workman, 1995, Mol Cell Biol 15, 1405-1421) and this has been proposed as a common mechanism for how these sites may form in vivo. Nucleosomal mapping experiments have shown that HSs such as the Drosophila hsp26 promoter (Lu et al, 1995 EMBO J. 2; 4738-46) and the human β- globin HS2 (Kim and Murray, 2001, . Int J Biochem Cell Biol 33, 1183-92) are non- nucleosomal. It is thought that most HSs are non-nucleosomal in nature (Boyes and Felsenfeld, 1996, EMBO J 15:2496-2507; Wallrath et al, 1994, Bioessays 16:165-170). These conclusions are well-supported in the Litreature (e.g., Struhl, 2001, Science 293:1054- 1055). However several HSs are known to still have histone proteins and transcription factors, suggesting that HSs may exist in conjunction with a modified or partial nucleosome. Flanking sequences surrounding the core region appear to modulate the activity of this core region, though this effect tapers off sharply. The boundaries of the sequences needed for hypersensitivity can be defined functionally by performing deletion analyses followed by stable transfection of cells (Philipsen et al, 1993, EMBO J 12, 1 077-85) or transgenic studies (Lowrey et al, 1992, Proc Natl Acad Sci U S A 89, 1143-7; These approaches define the minimum extent of sequence required to retain the biological function associated with the HS under examination. It is observable that many hypersensitive sites occur within broader domains of increased DNase sensitivity and therefore appear to be components of higher-order chromatin structures. It is further observable that, based on published data, such sites appear to harbor increased biological significance and are perhaps the most important functionally. Several investigators have observed that the regions flanking the hypersensitive foci of active elements exhibit an increased level of sensitivity to nuclease digestion compared with the increased general sensitivity of an active locus. This phenomenon has been referred to as 'intermediate sensitivity' (Kunnath and Locker, 1985, Nucleic Acids Res. 13; 115-29). For more than two decades, the standard approach for measurement of chromatin accessibility has been nuclease hypersensitivity assays. In a conventional DNase hypersensitivity assay, intact nuclei are isolated from a cell type of interest and gently permeabilized. The nuclei are aliquoted and treated with with a series of increasing intensities of DNasel (typically with increasing concentrations of the nuclease at fixed incubation time or alternatively with a fixed DNasel concertration with increasing incubation times). The products are then deproeinated. Following DNA extraction and purification, samples from each aliquote are digested with a restriction enzyme, run over an agarose gel, and transferred to a membrane. To detect hypersensitive sites that are located within a particular restriction fragment, a probe is selected that is proximal to either the 5' or 3' end of the restriction fragment. Fragments are often probed from both ends to visualize cutting over both strands. Hybridization of a radiolabeled probe with the membrane highlights the parental band and sites that increase in intensity with increasing DNase concentration. In spite of its extensively documented utility for localization of regulatory sequences, numerous technical barriers have prevented the broader application of conventional hypersensitivity assays to systematic detection of cw-active sequences on a genomic scale. The protocol (a) is extremely labor intensive; (b) is dependent on the presence of suitably- positioned restriction sites; (c) is further dependent on the availablility of a suitable ~500+bp sequence juxtaposed to a restriction site that can function as a specific probe (i.e., does not contain any repetitive sequences); (d) is highly consumptive of tissue resources, and therefore quite vulnerable to tissue preparation-to-preparation variability; (e) it suffers from numerous technical sources of variability including gel composition and running conditions, success of membrane transfer, success of probe labeling, hybridization conditions, wash conditions, and exposure conditions; and (f) it does not provide quantitative data. In practice, localization of the precise sequences which are hypersensitive is a difficult and laborious process requiring a series of restriction digests and probes positioned immediately proximal to the site itself. Typically, probing from both sides of the site is desirable, and this process is necessary when more than one site is present on a given restriction fragment owing to a 'shadowing' effect by probe-proximal sites of those positioned more distally to the probe.
2.1.3 Significance of cis-regulatory sequences for studies of common diseases and environmental exposures
2.1.3.1 Inter-individual Variation in Gene Expression Inter-individual variation in gene expression has been recognized for a number of human genes and is expected to underlie numerous quantitative phenotypes. For example, genes involved in xenobiotic metabolism and that of certain pharmaceutical agents (e.g., Cyp3A4, Cyp2, Thymidylate synthase, Natl) are classical examples of enzymes that exhibit wide (up to 40- or even > 100-fold) inter-individual variation in activity, much of which is attributable to transcriptional variation. Several surveys have now documented the fact that a large proportion (at least 25%) of human genes are subject to such heritable variation in expression (Cheung et al 2002. Nature Genet. 32; 522-525, Schadt et al., 2003. Nature 422; 297-302, Cheung et al .,2003. Nature Genet. 33, 422-425.). Comparable studies have also been performed in model organisms including the mouse (Cowles et al, 2002. Nature Genet. 32; 432-437), Fundulus (Oleksiak et al, 2002. Nature Genet.32; 261-266), and even yeast (Brem et a., 12002. Science 296; 752-755; Yvert et al, 2003. Nature Genet. 35; 57-64.). Although elegantly executed, all of the aforementioned studies were capable of detecting only relatively large (>2-fold) changes in expression. Considerable data have emerged, however, to indicate that in vivo, even small differences in allelic expression can have dramatic phenotypic consequences. For example, a modest (<25%) decrease in total APC expression can result in a nearly 24-fold increase in risk of development of adenomatous polyposis coli and malignant lesions (Yan et al, 2002. Nature Genet. 30; 25-26.). In the case of genes that exhibit a 'threshold' effect in activity (such as do many enzymes and receptors), the effect may be more pronounced. For example, even a 10% differences in the amount of CFTR transcript can dramatically attenuate the cystic fibrosis phenotype (Rave-Harel et al, 1997. Am. J. Hum. Genet. 60; 87-94.; Ramalho et al, 2002. Am. J. Respir. Cell. Mol. Biol. 27; 619-627).
2.1.4 Importance of cis-regulatory sequences for quantitative phenotypes Common diseases are characterized by polygenic inheritance and by quantitative (i.e., continuous) variation in specific phenotypic traits. A major biological mechanism contributing to quantitative phenotypic variation is heritable variation in the regulation of gene expression. In humans, such variation is expected to reside principally within cis- regulatory sequences (Rockman and Wray 2002. Mol. Biol. Evol. 19; 1991-2004.). Since individual trαrø-regulatory transcriptional factors typically interact with a wide network of genes, variation affecting these proteins would be expected to have pleiotropic effects and comparatively dramatic phenotypes, and are therefore anticipated to be quite rare. An example of this phenomenon may be found in inherited defects in transcriptional factors which give rise to marked early-onset Type 2 diabetes (MODY) phenotypes (Lehto et al, 1999. Diabetes 48; 423-425, Chang et al, 1997. Eur. J. Biochem. 247; 148-159). Since transcriptional factors require interaction with cw-regulatory sites in order for their effects to be manifest, defects in the genomic target sites of these factors may produce similar (though quantitatively more subtle) physiological consequences. However, the impact of cώ-regulatory variations should directly impact only their cognate gene(s). C/s-regulatory variation could manifest functionally in a variety of ways by impacting (a) the magnitude of gene expression; (b) regulation of tissue-specificity; (c) control over timing of expression during development and differentiation; (d) response to environmental stimuli (such as pharmacologic agents); or (e) some combination thereof. Given the overall prevalence of human genetic variation, lesions in one or more of the cognate cώ-regulatory sites should be comparatively common. When the multiple regulatory factors that interact with each regulatory sequence of each gene are considered, such cw-variation would provide the ideal substrate for a complex, semi-quantitatively varying phenotype. There presently exist hundreds of reports in the literature of associations between genetic variation in known or suspected regulatory regions and phenotypic manifestations or disease risk (see extensive tabulations in Rockman and Wray 2002. Mol. Biol. Evol. 19; 1991— 2004.; Haukim et al., 2002. Genes Immun. 3; 313-330). Because the region immediately upstream of the transcriptional start site of human genes often (though not universally) demarcates the proximal promoter region, it is not surprising that the vast majority of efforts to locate polymorphisms that impact transcriptional regulation have focused on this region. While it is tempting to conclude that any polymorphism within the upstream region of genes is regulatory in nature, this overlooks the fact that the specific sequences which are active in vivo -i.e., those to which transcriptional factors are complexed - are in fact highly compartmentalized into discrete domains of remodeled chromatin (Felsenfeld 1996. Cell 86; 13-19; Struhl 2001. Science 293; 1054-1055.). It is thus presently the case that many reports of regulatory polymorphism in the literature likely represent cases that would more correctly be classified simply as 'non-coding polymorphism of undetermined significance'. The availability of a molecular method capable of localizing actual cis-regulatory sequences would therefore have a major impact on studies of genetic variation. Even in cases where functional documentation has been undertaken, the focus on the proximal upstream region has resulted in a significant ascertainment bias, which is reflected in the fact that nearly 80% of all documented regulatory polymorphisms described are found within the first 600bp upstream of transcription start sites (Rockman and Wray, 2002. Mol Biol. Evol, 19; 1991-2004). A clear illustration of the effect of regulatory polymorphism in modulating quantitative phenotypes is provided by serum lipids. An extensive literature has now emerged relating dyslipidemias with regulatory polymorphism in major apolipoprotein and lipolytic genes including ApoA 1 (Smith et al, 1992. J, Clin. Invest. 89; 1796-1800; Barre et al, 1994. J. Lipid Res. 35; 1292-1296; Juo et al, 1999. Am. J, Med. Genet. 82; 235-241), ApoCS (Dammerman et al, 1993. Proc. Natl. Acad. Sci. USA 90; 4562-4566; Hegele et al, 1997. Arterioscler. Thromb. Vase. Biol. 17; 2753-2758), ApoB (Van Hooft et al, 1999. J. Lipid Res. 40; 1686-1694), ApoE (Nickerson et al, 2000. Genome Res. 10; 1532-1545), ApoCl (Xu et al, 1999. J. Lipid Res. 40; 50-58), hepatic lipase (Guerra et al ., 1997. Proc. Natl. Acad. Sci. USA 94; 4532-4537; Deeb and Peng 2000. J. LipidRes. 41; 155-158; Zambon et al, 2003. Curr. Opin. Lipidol. 14; 179-189; Murtomaki et al, 1997. Arterioscler. Thromb. Vase. Biol ll; 1879-1884), lipoprotein lipase (Hall et al, 1997. Arterioscler. Thromb. Vase. Biol ll; 1969-1976; Talmud et al, 1998. Biochem. Biophys. Res. Commun. 252; 661-668;), hormone-sensitive lipase (Pihilajamaki et a., 12001. Eur. J. Clin. Invest. 31; 302-308; Talmud et al, 1998. J. Lipid. Res. 39; 1189-1196), and cholesterol esterase transfer protein (Dachet et al, 2000. Arterioscler. Thromb. Vase. Biol. 20; 507-515). Many of these functional polymorphisms had been further shown to influence atherosclerosis (Ye et al, 1996. J. Biol. Chem. 271; 13055-13060; Jansen et al, 1997. Arterioscler. Thromb. Vase. Biol. 17; 2837- 2842; Corbex et al, 2000. Nature Genet. 32; 432-437), myocardial infarction (Lambert et al, 2000. Hum. Mol. Genet. 9; 57-61; Ericksson et al, 1995. Proc. Natl. Acad. Sci. USA 92; 1851-1855), and stroke (Ito et al, 2000. Stroke 31; 2661-2664; Nakayama et al, 2000. Am. J. Hypertens. 13; 1263-1267). 2.1.5 Regulatory polymorphism in common diseases with known or suspected environmental components Compelling evidence now exists for the involvement of regulatory polymorphism in diverse diseases for which a major environmental component exists. Relevant examples include:
2.1.5.1 Pulmonarv diseases. Regulatory polymorphism has recently emerged as a centerpiece of studies of the genetic determinants of airway reactivity, and has been described in several genes associated with asthma (In et al, 1997. J. Clin. Invest. 99; 1130-1137; Silverman et al, 1998. Am J. Respir. Cell Mol. Biol. 19; 316-323; Scott et al, 1999. Br. J. Pharamacol 126; 841-844; Drazen et al, 1999. Nαtwre Genet. 22; 168-170; Sanak et al., 2000. Am J. Respir. Cell Mol. Biol. 23; 290-296; Drysdale et al, 2000. Proc. Natl. Acad. Sci. USA 97; 168-170), chronic respiratory disease (Morgan et al, 1993. Hum. Mol. Genet. 2; 253-257) including COPD (Keatings et al, 2000. Chest 118; 971-975) and environmental susceptibility to emphysema (Yamada et al, 2000. Am J. Hum. Genet. 66; 187-195).
2.1.5.2 Allergic and autoimmune diseases. Functional non-coding polymorphisms have also been implicated in allergic (Nickel et al, 2000. J. Immunol. 164; 1612-1616) and autoimmune diseases including juveline rheumatoid arthritis (Crawley et al, 1999. Arthritis Rheum. 42; 1101-1108; Fishman et al, 1998. J. Clin. Invest. 102; 1369-1376), SLE (Stevens et al, 2001. Arthritis Rheum, 44; 2358-2366), myasthenia gravis (Kaluza et al, 2000. J. Invest. Dermatol 114; 1180-1183), systemic sclerosis (Hata et α/., 2000. Biochem. Biophys. Res, Commun. 272; 36-40), and Type I diabetes (Kennedy et al, 1995. Nature Genet. 9; 293-298; Lew et al, 2000. Proc. Natl. Acad. Sci, USA 97; 12508-12512; Pugilese et al, 1997. Nature Genet. 15; 293-297).
2.1.5.3 Cancer. Regulatory polymorphisms in a variety of genes had been associated with cancers of the ovary (Phelan et al, 1996. Nature Genet. 12; 309-311), aerodigestive tract (Cascorbi et al, 2000. Cancer Res. 60; 644-649), lung (Zhu et al ., 2001. Cancer Res. 61; 7825-7829), endometrium (Nishioka et al, 2000. 91; 612-615), prostate (Rebbeck et al, 2000. J. Natl Cancer Inst. 92; 76; Rebbeck et al, 1998. J. Natl. Cancer Inst. 90; 1225-1229), and skin (Foster et al, 2000. Blood 96; 2562-2567; Ye et al, 2001. Cancer Res. 61; 1296- 1298).
2.1.5.4 Common birth defects. At least one report has specifically connected regulatory polymorphism of PDGF-alpha with neural tube defects during gestation (Joosten et al, 2001. Nature Genet. 27; 215-217). 2.1.6 Functional polymorphism in sequences mediating specific physiological responses Regulatory factor recognition motifs within cz's-regulatory elements can be said to comprise the components of 'nodes' in transcriptional regulatory networks. Mutations disrupting or otherwise modifying specific factor motifs may thus shed light on the physiological connections of multi-gene pathways. Regulatory polymorphism has been described in czs-regulatory sequences which are known to respond to specific physiological stimuli including insulin (Groenendijk et al, 1999. J. Lipid Res. 40; 1036-1044; Waterworth et al, 2000. J. Lipid Res. 41; 1103-1109), low-density lipoproteins (Eriksson et al, 1998. Arterioscler. Thromb. Vase. Biol 18; 20-26), sterols (Yang et al, 1998. J. Lipid Res. 39; 2054-2064), retinoic acid (Piedrafita et al, 1996. J. Biol. Chem. 271; 14412-14420), and estrogen (Morgan et al, 2000. J. Hypertens. 18; 553-557). Mutations in specific drug responsive elements (e.g., nifedipine) have also been described (Walker et al, 1998. Hum. Mutat. 12; 289). Gene induction is a well-described response to a variety of external stimuli, classically xenobiotics. Metabolism of diverse pharmaceuticals is also heavily influenced by inter-individual variation in expression of metabolizing genes. Among enzymes which are known to be impacted by regulatory polymorphism are acetylcholinesterase (Shapira et al, 2000. Hum. Mol. Genet. 9; 1273-1281), glutathione-S-transferase (Coles et al, 2001. Pharmacogenetics 11; 663-669), monoamine oxidase (Denney et al, 1999. Hum. Genet. 105; 542-551; Sabol et al, 1998. Hum. Genet. 103; 273-279), thymidylate synthase (Mandola et al, 2003. Cancer Res. 63; 2898-2904), ornithine decarboxylase (Guo et al, 2000. Cancer Res. 60; 6314-6317), and tyrosine hydroxylase (Albanese et a.,l 2001. Hum. Mol. Genet. 10; 1785-1792; Meloni et al, 1998. Hum. Mol. Genet. 1; 423-428). Regulatory polymorphisms of several genes involved in alcohol metabolism have also been described (Chou et al, 1999. Alcohol Clin. Exp. Res. 23; 963-968; Edenberg et al, 1999. Pharmacogenetics 9; 25-30) and at least one has been linked with clinical alcoholism (Harada et al, 1999. Alcohol Clin. Exp. Res. 23; 958-962). Regulatory polymorphism also appears to be prevalent within p450 enzymes including CYP1A2 (Aitchison et al, 2000. Pharamacogenetics 10; 695-704), CYP2E1 (Hayashi et al, 1991. J. Biochem. 110; 559-565; Watanabe et al, 1994. J. Biochem. 116; 321- 326; Hildesheim et al, 1995. Cancer Epidemol. Biomarkers Prev. 4; 607-610; Fairbrother et al, 1998. Pharmacogenetics 8; 543-552; Marchand et al, 1999. Cancer Epidemol. Biomarkers Prev. 8; 495-500; Chabra et al, 1999. Carcinogenesis 20, 1031-1034), CYP2A6 (Pitarque et al, 2001. Biochem. Biophys. Res. Commun. 284; 455-460), and CYP3A4 (Rebbeck et al, 1998. J. Natl Cancer. Inst. 90; 1225-1229; Amirimani et al, 1999. J. Natl. Cancer. Inst. 91; 1588-1590; Rebbeck 2000. J. Natl. Cancer. Inst. 92; 76). The aforementioned examples provide powerful evidence of the existence and physiological relevance of regulatory polymorphism affecting a wide spectrum of human genes. While promoter sequences are clearly necessary for expression, a recurring theme in the study of human gene regulation is that promoters alone are typically not sufficient either for high-level expression, nor for tissue-specific expression (or both). The Cyp3A genes catalyze the metabolism of structurally diverse endobiotics, drugs, and protoxic and procarcinogenic molecules and provide a relevant example. These genes exhibit substantial (>30-fold) interindividual variability in expression which is linked in cis. However, comprehensive sequencing of their promoter regions has thus far failed to disclose the responsible molecular lesions (Kuehl et al 2001). The distal regulatory sequences of Cyp A genes have not been delineated. This example provides clear rationale for the necessity of searching for polymorphism in distal regulatory sequences. Because of the difficulty in locating distal regulatory sequences using conventional methods, however non-promoter regulatory variants have not been amenable to systematic study. Nonetheless, several cases of non-promoter regulatory polymorphism have come to light, often with clear clinical correlates. Examples include alphal immunoglobulin (Denizot et al 2001), ornithine decarboxylase (Martinez et al, 2003. Proc. Natl Acad. Sci, USA 100; 7859-7864), apolipoproτein(a) (Wade et al, 1991. Atherosclerosis 91; 63-72; Wade et al, 1994. J. Biol. Chem. 269; 19757-19767; Wade et al, 1997. J. Biol Chem. 272; 30387-30399; Puckey and Knight 2003. Atherosclerosis 166; 119-127), the CalpainlO gene implicated in Type2 diabetes (Horikawa et al, 2000. Nature Genet. 26; 163-175; Cox 2001. Hum. Mol. Genet. 20; 2301-2305), the Renin gene enhancer (Fuchs et al, 2002. J. Hypertens. 20; 2391- 2398); and an intronic enhancer of PDCD1, associated with development of systemic lupus erythematosus (Prokunina et al, 2002. Nature Genet. 32; 666-669). A functional lesion within a regulatory sequence located >17kb distant to the acetylcholinesterase gene has been identified characterized in vivo (Shapira et al, 2000. Hum. Mol Genet. 9; 1273-1281). The example of acetylcholinesterase provides further proof-of-principle for the existence of functional polymorphism in distant regulatory sequences that have pronounced and heritable phenotypic manifestations. Regulatory polymorphisms may also interact with protein coding lesions to potentiate or ameliorate their phenotypic consequences. Examples of this phenomenon are found in CFTR (Romey et al, 1999. J. Med. Genet. 36; 263-264; Romey et al, 2000. J. Biol. Chem. 275; 3561-3567; Romey et al, 1999. Hum. Genet. 105; 145-150) and in LTA, where cooccurrence of a functional intronic enhancer polymorphism and a non-synonymous coding variant substantially increase the risk of myocardial infarction in homozygotes (Ozaki et al, 2002. Nature Genet. 32; 650-654). These examples and others highlight the value of the approach we propose to employ in this study, namely, targeted interrogation of candidate cis-regulatory sequences to discover functional regulatory alleles that may modulate important clinical traits and disease phenotypes. The fact that examples of extra-promoter regulatory polymorphism such as the above have come to light in spite of the limited database of known distal regulatory sequences highlights the promise of systematic, large-scale mining of such elements over a gene set of broad physiological relevance. Comparatively 'deep' surveys of genetic variation are a logical approach to regions of the genome in which polymorphisms would be expected to alter gene function or expression, and thereby contribute to phenotypic variation. Polymorphisms with functional consequences are expected to have lower allele frequencies and, in fact, the majority of coding region SNPs (cSNPs) that change an amino acid have allele frequencies below 5% (Cargill et al, 1999. Nature Genet. 22; 231-238; Halushka et al, 1999. Nature Genet. 22; 239-247). Target population sizes sufficient for comprehensive identification of alleles with frequencies of 1- 5% are therefore most desirable and have motivated the sample sizes used in this proposal. Cz's-regulatory regions are of the greatest scientific and clinical interest though they are extremely difficult to delineate and study using conventional approaches. Identification of regulatory regions is expected to be of central importance to our understanding of common diseases, quantitative traits, and environmental exposures.
2.1.7 Computational approaches to the study of c s-regulatory sequences The search, via computational methods, for cis-regulatory elements in genomic DNA has been pursued using three different classes of techniques: motif discovery algorithms, algorithms for recognizing cis-regulatory modules, and non-motif-based algorithms. The problem is particularly challenging in the human genome, owing not only to its size and sequence diversity, but mainly to the fact that human gene regulation is characterized by coordinate action of multiple cω-regulatory elements over distances of many kilobases.
2.1.7.1 Algorithms for de novo discovery of TFBS motifs The first class of algorithms performs de novo discovery of transcription factor binding site (TFBS) motifs in relatively small sets of DNA sequences. This class includes algorithms such as the Gibbs sampler (Lawrence et al., 1993. Science, 262(5131):208-214), MEME (Bailey and Elkan, 1994. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pages 28-36) and Consensus (Hertz and Stormo, 1999. Bioinformatics, 15(7):563-577).Recent research in this area focuses on building richer motif models (Xing et al., 2003. Advances in Neural Information Processing Systems, Cambridge, MA, 2003. MIT Press), on developing provably optimal algorithms (Eskin et al., 2003. Proceedings of the Pacific Symposium on Biocomputing, pages 29-40, New Jersey, 2003. World Scientific), on finding pairs of co-occurring binding sites (Eskin and Pevzner, 2002. Bioinformatics, 18: S354-S363, van Helden et al, 2000. Nucleic Acids Research, 28(8): 1808-1818), and on searching simultaneously with sequence information and other types of data(Loots et al., 2002. Genome Res. 12, 832-9, Blanchette and Tompa, 2002. Genome Research, 12(5):739-748, McCue et al., 2001. Nucleic Acids Research, 29(3): 774-782. , Bussemaker et al., 2001. Nature Genetics, 27:167-171, Holmes and Bruno, 2000. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pages 202-210). However, because these algorithms are appropriate only for relatively small data sets, they all require prior knowledge of the approximate locations of a collection of similar TFBS's.
2.1.7.2 Algorithms for discovery of cis-regulatory modules Algorithms in the second class, in contrast, operate on much larger sequence databases; however, these algorithms generally assume that the statistical properties of a small collection of transcription factor binding sites are known a priori. Here, the problem is to locate statistically significant clusters of these binding sites, called regulatory modules, in genomic DNA. Three groups of algorithms for recognizing regulatory modules have been proposed. Algorithms in the first group use a sliding window approach, scoring each subsequence that appears in the window with respect to a given collection of motifs (Prestridge, 1995. Journal of Molecular Biology, 249:923-932, Kondrakhin et al, 1995. Computer Applications in the Biosciences, 11:477-488, Freeh et al.,1997. Journal of Molecular Biology, 270: 674-687, Berman et al., 2002. Proc NatlAcadSci USA, 99:757-762, Markstein et al., 2002. Proc Natl Acad Sci U S A. 99:763-8, Levy and Hannenhalli, 2002. Mammalian Genome, 13:510-514, Johansson et al., 2003. Bioinformatics, 19(Suppl. I):il69-il76, Sharan et al., 2003. Bioinformatics, 19(Suppl. I):i292-i301). The sliding window approach has intuitive appeal, and has yielded good results in analyses of motif clusters in Drosophila (Berman et al., 2002. Proceedings of the National Academy of Sciences of the United States of America, 99:757-762, Markstein et al., 2002. Proc Natl Acad Sci U S A. 99:763-8). The second group of search algorithms uses a probabilistic modeling framework called hidden Markov models (HMMs) (Frith et al., 2001. Bioinformatics, 17(10):878-889, 2002, Bailey and Noble, 2003. Bioinformatics, 19(Suppl. 2):iil6-ii25). The HMM approach is more theoretically rigorous and offers more accurate statistics than the relatively ad hoc sliding window approach. However, both the sliding window and the HMM approaches to the regulatory module search problem are generative: both rely upon a model (implicit or explicit)of a regulatory module. The third group of algorithms uses a discriminative technique. These methods model the difference between the regulatory module and non-regulatory sequence. Logistic regression analysis (LRA)is a discriminative technique based upon a sliding window, which has been used successfully to build predictors for muscle-specific (Wasserman and Fickett, 1998. Journal of Molecular Biology, 278:167-181) and liver-specific (Krivan and Wasserman, 2001. Genome Research, 11:1559-1566) regulatory modules. The Fisher kernel support vector machine (SVM) method (Pavlidis et al., 2001. Proceedings of the Pacific Symposium on Biocomputing, pages 151-163) uses a discriminative algorithm based upon a hidden Markov model. In the presence of a small amount of data, discriminative techniques typically achieve better performance than similar, generative techniques.
2.1.7.3 Non-motif-based methods The third class of algorithms for identifying cis-regulatory elements is the most general, requiring as input only a database of genomic DNA and producing as output, for example, the predicted locations of promoter regions or CpG islands. Many techniques in this class are non-motif based, capitalizing instead on compositional statistics (see Zhang (2002) Nature Reviews Genetics, 3:698-710, for a review). Some methods augment these statistics using libraries of known TFBS's (Crowley et al., 1997. Journal of Molecular Biology, 268:8-14) or libraries of words extracted in an unsupervised fashion from sequence databases (Scherf et al., 2000. Journal of Molecular Biology, 297:599-606). While most promoter recognition techniques are generative, at least one discriminative method has been described (Davuluri et al., 2001. Nature Genetics, 29(4):412-417).
2.1.7.4 Data fusion Increasingly, the analysis of regulatory elements in DNA faces problems related to data fusion, i.e., drawing inferences from a collection of heterogeneous data. For any of the search problems described above, a solution that operates only on the given DNA sequences suffers from a loss of power relative to a competing method that capitalizes on various types of auxiliary data. The simplest approach to data fusion is to treat each type of data independently. For example, co-expression of genes in microarray experiments may be used to select a collection of upstream regions for analysis by a motif discovery algorithm (Chu et al., 1998. Science, 282:699-705). Similarly, conservation of human DNA with respect to the mouse genome may be used to reduce the size of a database to be scanned. More powerful techniques learn simultaneously from two or more types of data, e.g., from DNA sequence and microarray data (Bussemaker et al, 2001 Nature Genetics, 27:167-171), or from DNA from multiple species (Duret and Bucher, 1997. Current Opinions in Structural Biology, 7:399-405, Blanchette and Tompa, 2002. Genome Research, 12(5): 739-748). Indeed, the problem of discovering motifs in the presence of multi-species sequence data is called phylogenetic footprinting (Tagle et al., 1988. . Journal of Molecular Biology, 203:439-455) and has recently seen success in an analysis of four yeast genomes (Kamvysselis et al., 2003. In Proceedings of the Seventh Annual International Conference on Computational Molecular Biology, pages 157-166; Kellis et al 2003. Nature 423:241-54).
2.1.7.5 In vivo molecular validation of computational predictions To date, there have been few published efforts to perform in vivo validation of computational predictions, owing mainly to the painstaking nature and cost of conventional molecular methodologies. All have been performed in lower-complexity genomes than the human, principally Drosophila (see references above) and C. elegans (Gaudet et al 2002. Science 295(5556):821-5), and generally under idealized situations such as a restricted developmental window when the action of specific morphogenic transcription factors predominates. Furthermore, all published studies have relied on motif-based approaches and it is observable that the findings forthcoming from the majority have pertained to homotypic regulatory elements (i.e., those which contain clusters of a binding sites for single transcriptional factor). Finally, the predicted sensitivity of the approaches is poor, since only a few dozen statistically-significant predictions were made even in genome-wide searches. Significantly, in no case has any computational methodology undergone rigorous in vivo validation sufficient to establish (or rejecf) its predictive value.
2.1.7.6 Use of comparative genomic approaches to predict regulatory sequences Comparative genomic analyses represent a conceptually attractive approach for identification of regulatory sequences (Ureta-Vidal et al. 2003. Nat. Rev. Genet. 4, 251-62). The central hypothesis of such studies is that functionally important sequences will exhibit selective pressures that propagate over evolutionary distances (Dermitzakis et al. 2002. Nature 420, 578-82). However, in reality the situation is complex. For example, while it is clear that certain regulatory elements have been highly conserved during vertebrate and particularly mammalian evolution (Elnitski et al. 2003. Genome Res. 13, 64-72), it is also evident that many such elements exhibit little or no selective conservation above local background (Flint et al. 2001. Hum. Mol. Genet. 10, 371-82). Given that a surprisingly large proportion of the human genome appears to be under selection (Waterston et al 2002. Nature 420(6915):520-62), the task that we ask of a comparative genomics-based method is: can functional elements in the human genome be reliably and specifically discriminated from background levels of conservation? To date, there is little evidence that this can be accomplished in a manner that displays adequate sensitivity, specificity, and generalizes well across the genome. The number of studies evaluating elements identified purely on the basis of comparative genomics (predomintly mouse-human) approaches are very few and in no case has the comparative genomic hypothesis been rigorously examined. Furthermore, an interesting feature of several such studies is the fact that the elements which were reported to be identified on the basis of comparative genomics had in fact been reported previously to be DNasel hypersensitive sites (Loots et al 2002. Genome Research, 12(5):832-839; Mohrs et al 2001; Gottgens et al 2000. Nat Biotechnol. 18(2):181-6.). For example, in one study of the interleukin cluster on chromosome 5 (Loots et al 2002 Genome Research, 12(5):832-839 ), 90 conserved non- coding sequences were identified, but the only one was selected for in vivo studies was in fact a previously described and studied DNasel hypersensitive site (Takemoto et al 1998. Int Immunol. 10(12):1981-5). The recent availability of comparative sequence information from a range of vertebrate and mammalian species has now made practical the description and evaluation of sequence elements conserved across multiple species (so-called multi-species-conserved elements or 'MCSs' (Thomas et al 2003. Nature 424(6950):788-93)). However, although this information imparts some specificity, it does not seem to impact the sensitivity as evidenced by poor performance in identifying previously-characterized regulatory elements. For example, only a small fraction of the numerous DNasel hypersensitive sites identified within and flanking the CFTR gene (Nuthall et al 1999a. Biochem J. 1999 341 ( Pt 3):601-11; Nuthall et al 1999b. Eur J Biochem. 1999 266(2):431-43; Smith et al 2000. Genomics 64(l):90-6) were found to coincide with MCSs, in spite of the fact that hundreds of MCSs were identified in this region. The availability of a generic high-throughput, in vivo functional method to identify candidate regulatory sequences would obviate the need to rely on comparative analyses as a primary discovery vehicle. Rather, their value could be realized mainly by further illumination of functionally-derived information. Such a functional method is described below and will be applied in the proposal.
2.2 Mapping genomic regulatory sites
2.2.1 Principle of cloning active chromatin sequences. We developed a strategy to create genomic DNA libraries containing sequences flanking DNasel cut sites introduced under limiting ('hypersensitive') conditions. The ends of DNasel-treated genomic DNA are enzymatically repaired and modified by the addition of a 3' dATP overhang, the A tailed ends are selected for by ligation of a biotinlyated adapter specific for the A-tailed ends. The DNA is then restricted with the restriction enzyme Nlalϊl. At this stage the genome has been fragmented into two types of fragments: NlaUl-Nlalll fragments, derived from the background, and M lll and A-tailed fragments, enriched for DNasel cut sites. This DNA can be efficiently isolated on paramagnetic strepavidin-coated beads, while the remaining fragments are washed away. A second adapter is added to the captured DNA and the product cut from the beads with a rare-cutter. This population is enriched in DNasel-cut sites and is retained for the subtraction step. A second population, depeleted in DNasel cut sites is prepared for the subtraction step. It is made by cutting DNasel-treated genomic DNA with N/ lll, or other restriction enzymes that introduce a four- nucleotide 3' overhang. Digestion of this DΝA with Exonuclease III will preserve the N/ lll- N/ iπ fragments, while any end generated by DNasel will be efficiently digested. The resultant single stranded DΝA is removed by treatment with Mung Bean Nuclease and the remaining population biotinylated in vitro. An excess of this population is mixed with the first, the sample denatured and allowed to reanneal. Those (non-biotinylated) fragments generated by repeated DNasel digestion at a specific site (i.e., a hypersensitive site) will be more likely to self-anneal, than find a partner in the depleted-population to form a heteroduplex. Extraction of the mixture with paramagnetic beads isolates the non-biotinylated homoduplexes, that are enriched in DNasel cut sites and hypersensitive sites. These are then amplified and cloned to make the genomic libraries.
2.2.2 Active chromatin sequences are G+C-rich and repeat-poor. We applied these methods to produce a library (K5008) of ACSs from K562 cells, which display an erythroid phenotype. The motivation for selection of this tissue was the extensive experience with this tissue for chromatin studies, on the basis of which more DNasel hypersensitive sites have been identified and evaluated than in any other human cell type. We cloned and sequenced a total of 92,115 ACSs from the K5008 library. After filtering to remove ACSs that fell within repeated sequences or were not present in the human genome, we recovered a total of 61,561 ACSs. The mean length of filtered sequences was 86bp, which permitted unique localization within the human genome sequence. Only 12% of ACSs coincided with repetitive elements as determined by RepeatMasker. This represents a marked depletion of repetitive sequences compared with the genomic average of approximately 44% [International Human Genome Sequencing Consortium (2001). Nature 409, 860-921]. The mean G+C content of K5008 library sequences was 51%, significantly higher than the genomic average of 38-41 % [International Human Genome Sequencing Consortium (2001). Nature 409, 860-921; Venter JC et al. (2001). Science 291 :1304-51]. This difference cannot be accounted for by statistical overrepresentation of CpG islands (see below), since the absolute number of such sequences within the overall K5008 set was small. Its significance is further emphasized by the fact that the subtractive technique used to produce the ACS library will tend to select against G+C rich sequences; on average these fragments have higher annealing temperatures and will cross-anneal (and thus be eliminated) more readily under the low-stringency hybridization conditions used to create the library.
2.2.3 ACSs derive predominantly from genie regions.
The distribution and density of 1,174 ACSs from the K5008 library that mapped to chromosome 21 are shown in Figure 1. Striking correspondence in both parameters is evident between Chr. 21 ACSs and genes. The overall distribution of ACSs relative to known genomic sequence classes differed from expectation in several categories. To quantify expectation, we programmed a simulation that selected random genomic sequences of the same average length as K5008 sequences and filtered them in an identical manner; several iterations were run, with each producing 2,000 unique mapping events which were then characterized with respect to the closest genomic feature. 19.9% of ACSs fell within introns of known genes compared with an average of 27.5% from the simulation runs (PO.001). 1.85% of ACS fell within exons of known genes (predominantly first exons), significantly greater than expectation (simulation 1.2%; P<0.01). 13.4% of ACSs overlapped a known CpG island, compared with 1.3% expected by chance (PO.01).
Approximately 17.5% of ACSs mapped >50kb away from any known genie feature. Well- demarcated multi-gene domains of open chromatin have been proposed to be a regular feature of the higher-order organization complex genomes [van Driel R. et al. (2003). J. Cell. Sci. 116:4067-4075]. Aside from global correspondence with genie regions, we found no evidence for large-scale well-circumscribed domains. However, it is unclear whether more subtly demarcated regions would have been detected given the current ACS sample size.
2.2.4 High relative density of ACSs at transcription start sites and CpG islands. To examine the correspondence between ACSs and specific genomic landmarks, we computed the normalized density of ACSs relative to (i) the canonical 5' transcriptional start sites (TSSs) of known genes, (ii) the 3' transcription termini of the same genes, and (iii) CpG islands (Figure 2). Viewed on both a 50kb (+/- 25kb) (Figure 2a) and a closer lOkb (+/- 5kb) horizon (Figure 2b), a clear and symmetrical peak in the relative density of ACSs was observed around TSSs. No significant peak was observed at 3' termini (Figure 2c), confirming the specificity of this finding for TSSs. We also observed a prominent peak in ACS density relative to CpG islands (Figure 2d). Since CpG islands are a regular (though by no means universal) feature of human promoters, we performed a second analysis which included only CpG islands located >2.5kb distant (5' or 3') from a known TSS (Figure 2d). This also revealed a prominent peak, suggesting that a proportion of intergenic CpG islands lie within active chromatin domains. We also considered what component of the observed ACS peak around TSSs could be explained by CpG islands. We found that the strong peak in ACS density at TSSs persisted even when only non-CpG island-associated TSSs were analyzed, further confirming that this is due to a chromatin feature intrinsic to TSSs. One explanation for this distribution of ACSs around TSSs is that it reflects a large- scale chromatin disruption. An alternative and readily testable hypothesis is that this finding instead signifies continuous (though non-linear) distribution of hypersensitive sites both 5' and 3' of TSSs (see below).
2.2.5 ACSs show a preference for expressed genes. We next asked whether the peak in ACS density at TSSs was confined to expressed genes. To evaluate this, we assayed expression of 17,976 genes in K562 cells using a standard microarray platform (see Materials and Methods). We defined genes to be expressed (irrespective of relative magnitude) if their mean intensity exceeded the mean intensity of the background by one standard deviation, a rigorous and widely-accepted empirical threshold [Wodicka, L., Dong, H., Mittmann, M., Ho, M. H., and Lockhart, D. J. (1997) Nature Biotechnology 15, 1359-67; Epstein, C. B., Hale, W., TV, and Butow, RA. (2001) Methods Cell Biol. 65, 439-452]. 8333 genes (46.3%) met these criteria; 9654 genes (53.6%) were considered non-expressed. Genes for which no data were available due to lack of inclusion on the array or to technical issues were not included in the analysis. We then plotted ACS density relative to the TSSs of this gene set, and also relative to the TSSs of the non-expressed genes (Figure 3). This showed a statistically meaningful increase in the density of ACSs relative to expressed genes, but the overall difference was surprisingly modest. This suggests that a large number of non-expressed genes reside within open chromatin domains.
2.2.6 Correspondence between ACSs and DNasel hypersensitive sites. To determine the approximate overall percentage of ACSs that overlapped a DNasel hypersensitive site, we randomly selected 48 ACSs and assayed them for hypersensitivity in K562 cells using HSqPCR. Of these, 3 (6.25%) overlapped a DNase HS. This suggests that in total the ACS library contained at least ~3,800 hypersensitive sites. However, the fact that ACSs display marked distributional preferences for certain genomic features suggested that the correspondence between ACSs and DNasel hypersensitive sites would likewise be dependent on genomic context. We hypothesized specifically that the proportion of ACSs coinciding with DNasel hypersensitive sites would parallel the distribution of ACSs, namely, it would be maximal at the TSS and would diminish rapidly and symmetrically in both 5' and 3' directions. To test this, we randomly sampled 3 classes of ACS: (i) ACSs mapping between 0 and 250bp upstream of the TSS; (ii) ACSs mapping 1000+/-100bp upstream of the TSS; and (iii) ACSs mapping 1000+/-100bp downstream of the TSS. Primers to 48 randomly selected members of each class were designed and assayed for hypersensitivity in K562 cells using HSqPCR. 23/48 (47.9%) of ACSs mapping 0-250bp upstream of the TSS coincided with hypersensitive sites. However, promoter sequences are expected to contain DNasel hypersensitive sites though their position relative to the TSS may vary. To assess the significance of this finding, we therefore determined the background prevalence of DNasel hypersensitivity at transcriptional start sites. Since we observed high ACS densities at both expressed and non-expressed genes, our gene selection was stratified accordingly. Primers were designed to encompass the first 250bp upstream of a total of 192 genes: 92 randomly selected genes from the top quartile of K562-expressed genes, and 92 non-expressed genes (see above). Quantitative determination of DNasel hypersensitivity was then performed as described (Materials and Methods). We found that 17/92 (18.5%) high expressing and 18/92 (19.6%>) of non- or background-expressing genes harbored DNasel HSs immediately upstream of their canonical transcriptional start sites. While all expressing genes are expected to display hypersensitivity over their promoter regions, the precise location of the promoter regions has only been accomplished functionally for a fraction of genes. In many cases, promoter elements may be situated several hundred bases or even >lkb distant from the TSS [Davuluri R.V., Grosse I., Zhang M.Q. (2001). Nat Genet. 29:412-7]. Moreover, for many genes, the promoter region may be located downstream from the TSS within the first intron [Reisman, D., Greenberg, M., and Rotter, V. (1988) Proc. Natl Acad. Sci. USA 85, 5146- 5150]. The finding that a comparable proportion of non-expressed genes also have hypersensitive sites immediately upstream of their TSS is not altogether surprising, and is compatible with previous observations in tissue-specific and developmentally-regulated genes [Groudine, M., Kohwi-Shigematsu, T., Gelinas, R., Stamatoyannopoulos, G., and Papayannopoulou, T. (1983) Proc. Natl Acad. Sci. USA 80, 7551-5; Radomska, H. S., Satterthwaite, A. B., Burn, T. C, Oliff, I. A., Huetnnet, C. S., and Tenen, D. G. (1998) Gene 222, 305-318; Fraser, P., and Grosveld, F. (1998) Curr Opin Cell Biol. 10, 361-5], suggesting that a significant number of human genes may be 'poised' for transcription. Of ACSs mapping -lOOObp and +1000bp relative to the TSS, 2/47 (4.2%) and 2/46 (4.3%ι), respectively, were hypersensitive sites. Taken together with the TSS-proximal results above, these findings parallel the distribution of ACSs relative to the TSS (including their symmetry), and suggest further that this distribution primarily reflects the average distribution of DNasel hypersensitive sites relative to the transcriptional start.
2.2.7 ACSs in distal intergenic regions. Regulatory sequences and associated DNasel HSs have been reported to occur many tens or up to hundreds of kilobases distant from their cognate genes [Antoniv, T. T., De Val, S., Wells, D., Denton, C. P., Rabe, C, de Crombrugghe, B., Ramirez, F., and Bou-Gharios, G. (2001). J. Biol. Chem. 276, 21754-64; Spitz F, Gonzalez F, Duboule D. (2003). Cell 113:405-17.]. To test the utility of ACSs for identification of distal DNasel HSs, we randomly selected 48 ACSs mapping >15kb (but <100kb) distant from the nearest gene (in either the 5' or the 3' direction) and tested for hypersensitivity in K562 cells using HSqPCR. Of these, 0/48 (0%) were found to be hypersensitive. In light of the overall prevalence of HS-overlapping ACSs in the K5008 library (6.25%), this finding suggests significant relative depletion of DNasel HSs in the population of ACSs mapping within distal intergenic regions. However, given the size of the intergenic space and the a priori expected low density of DNasel HSs within it, the relative proportion of DNasel cut sites within HSs to sites of vs. non-specific background cutting is expected to be low, resulting in low relative enrichment of HSs within ACSs from these regions.
2.2.8 ACS clusters provide higher discriminant value. Poor correspondence between ACSs and DNasel HSs in intergenic regions prompted us to search for meta-features that might be more predictive of hypersensitive sites. DNasel HSs are defined by a high frequency of DNasel cut sites over a given genomic interval. However, hypersensitivity only becomes manifest when cutting is averaged across a large population of individual chromosomes, each of which is cut in a stochastic fashion. We therefore hypothesized that in the context of a library of ACSs where each member represents a unique cutting event, DNasel hypersensitive sites would ultimately appear as clusters of ACSs mapping over small genomic intervals (<1000bp) as larger numbers of clones were analyzed. To test this, we clustered DNasel cut sites along chromosomes to identify statistically significant groups of cut sites subject to the null hypothesis of uniformly distributed sites against the genomic background. This revealed that, subject to the null hypothesis, clusters of size 2 contained within a 250bp interval are statistically significant for our ACS library size. We therefore defined an ACS cluster to be 2 or more DNasel cut sites contained within a 250bp window. This yields a hierarchy of ACS clusters containing 2 or more points. One source of false-positive clustering is the presence of segmental duplications within the human genome sequence [Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, Schwartz S, Adams MD, Myers EW, Li PW, Eichler EE. (2002). Science 297:1003-7], which are currently not formally incorporated into genome builds. We therefore filtered our preliminary cluster results against a library of known segmental duplications and rejected any cluster occurring within lOkb of these duplications. Applying these criteria, we identified a total of 3,293 clusters comprising 2-10 members. Plotting the distribution and density of these clusters relative to TSSs of known genes (Figure 4a) produces a significantly more prominent peak than ACSs alone, suggesting that they are considerably more enriched in DNasel hypersensitive sites. To test this directly, we designed primers to encompass the centroids (defined as the mean of the cluster data points) of all clusters of size >=4 and assayed these regions for DNasel hypersensitivity in K562 cells. We identified DNasel hypersensitive sites in 27/95 (28.4%) 4-member clusters, 12/29 (41.4%) of 5-member clusters, 7/9 6-member clusters (77.8%), 1/2 (50%) 7-member clusters, and 2/5 (40%) 8-member clusters. Since these 140 clusters were widely distributed across the across the genome, these results confirm a substantial overall enrichment for DNasel HSs within ACS clusters. Of 2 clusters with 10 members, neither showed DNasel hypersensitivity. We hypothesize that these 10-clusters identify as-yet-undocumented segmental duplications.
2.2.9 Coincidence between ACSs and evolutionarily conserved non-genic sequences.
Evolutionarily conserved non-genic sequences (CNGs) have been proposed to mark functional elements such as regulatory sequences [Ureta-Vidal A, Ettwiller L, Birney E. (2003). Nat Rev Genet. 4:251-62]. We analyzed the correspondence between ACSs and a global set of mouse-human conserved sequences described previously [Alexandersson M, Cawley S, Pachter L. (2003) Genome Res. 13:496-502]. For additional stringency, we did not consider sequences with <70% sequence conservation irrespective of length. A total of 420,431 such CNGs (mean length of 157bp) were analyzed. Overall, we found that 4.1% of the DNase-cut ends of ACSs fell within CNGs, significantly greater than expected by chance. This suggests that a measurable proportion of CNGs may harbor DNasel hypersensitive sites, and is further supported by a peak in the density of ACSs at CNGs, which is substantially more prominent when ACS clusters are considered (Figure 4b). When CNGs proximal to the TSS are excluded from this analysis, the effect diminishes moderately, suggesting that CNGs in intergenic regions or distal introns are enriched in DNasel HSs, though not dramatically so. For this reason, CNGs were not formally evaluated for hypersensitivity as a separate class. However, of all randomly selected ACSs described above, 22 overlapped CNGs. Of these, 6/22 (27%) coincided with hypersensitive sites. However, all HS-positive CNGs were located within the first 137bp 5' of TSSs. Of 9 CNGs located >200bp from the TSS region, none were found to be hypersensitive.
2.2.10 Quantitative approximation of the distribution of HSs in the human genome. The distribution of DNasel HSs in the human genome is of considerable interest given the close correspondence between HSs and classical cw-regulatory sequences. However quantitative data that permit even preliminary estimation of this distribution have been lacking. ACSs and particularly ACS clusters may provide reliable surrogate markers for the distribution of HSs. We calculated binned (lOOObp) percentages of ACSs and ACS clusters within a lOkb window centered on the TSSs of known genes (Figure 5). Although this interval (+/-5kb from the TSS) encompasses the entire peak density regions (Figure 1) it contains only 22.4% of ACSs (Figure 5a) and 41.6% of ACS clusters (Figure 5b). However, DNasel HSs are markedly more prevalent in the vicinity of TSSs. We therefore used the results for DNasel prevalence at the TSS and at +/-1000bp described above to approximate the absolute proportion of DNasel HSs that lie within +/- 5kb of the TSS by assuming that the prevalence of HSs was distributed in parallel with ACSs or ACS clusters. This suggested that approximately 60% of all DNasel HSs are expected to lie within this interval. However, this figure still implies that nearly half of DNasel HSs and, by extension, c/s-regulatory sequences are located >5kb from the TSS.
The study of gene regulation in complex organisms has been severely constrained by the lack of functionally-based methodologies for large-scale identification of cw-regulatory sequences. A variety of computational [Markstein M, Markstein P, Markstein V, Levine MS. (2002) Proc Natl Acad Sci U S A. 99:763-8; Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB. (2002) Proc NatlAcadSci U S A. 99:757-62; Stathopoulos A, Van Drenth M, Erives A, Markstein M, Levine M. (2002) Cell. 111:687- 701] and phylogenetic [Pennacchio LA, Rubin EM. (2001). Nat Rev Genet. 2:100-9] approaches have been developed to address this deficit but their utility has been curtailed by poor sensitivity and specificity for functional elements. By contrast, the use of DNasel hypersensitivity studies for identification of in v vo-functional regulatory sequences is well- established and has underpinned the discovery of hundreds of regulatory sequences controlling human genes and those of other eukaryotes. We have described an approach for extending the DNasel hypersensitivity paradigm to a genomic level through large-scale cloning and mapping of individual in vivo DNasel cutting events. Although demonstrated in human tissue, cloning and analysis of ACSs should be applicable to any eukaryotic cell type providing the basis for the accumulation of comprehensive databases of cώ-regulatory sequences. Analysis of a large library of active chromatin sequences has provided several novel insights into the relationship between cliromatin structure and gene expression. We observed peaks in the density of ACSs at the transcriptional start sites of known genes, at non-gene-associated CpG islands, and, to a lesser degree, at evolutionarily-conserved non-coding sequences. A remarkable feature of the distribution of ACSs around transcriptional start sites was its symmetry. This suggests that proximal intron regions may be a rich reservoir of regulatory sequences [Aronow, B., D. Lattier, et al. (1989). Genes Dev 3: 1384-400; Bates, N. P. and Hurst H. C. (1997). Oncogene 15: 473-81; Rowntree, R. K., G. Vassaux, et al. (2001). Hum Mol Genet 10: 1455-64; Tone, M., L. E. Diamond, et al. (1999). J Biol Chem 11 A: 710-6]. Another surprising finding was the approximately equal representation of expressed and non- expressed genes. This suggests that a majority of genes reside within open chromatin domains. The fact that a large proportion of ACSs and ACS clusters are found within a lOkb interval centered on transcription start sites of known genes is perhaps not surprising. However, the implication that ~40% of DNasel hypersensitive sites (and, by extension, cis- regulatory sequences) lie outside this interval highlights the need for approaches such as the one described here for efficient culling of such sequences from the vastness of the genome. All of the ACS cloning results described above employed a single round of subtraction. However, this procedure may be applied iteratively to produce a population highly-enriched for DNasel hypersensitive sites. Additionally, generation of larger numbers of ACS library sequences will permit more full exploitation of the clustering effect. In combination, these techniques may permit definitive HS probability thresholds to be associated with clusters of different sizes, eliminating the need for direct hypersensitivity testing of large numbers of candidate sequences.
3. SUMMARY OF INVENTION The present invention provides libraries and methods for creating libraries comprising arrays of cloned genomic fragments adjacent or co-incident with regulatory sites, such as nuclease hypersensitive sites, in the chromatin of any given organism or tissue type. Another embodiment of the process comprises protocols that renders the sequence of each concatamerized genomic fragment recognizable and of sufficient length to routinely and unambiguously or stochastically define the genomic location of the parental fragment, with concomitant benefit for the efficiency and scope of the genome-wide study of, for example, regulatory sites. Other embodiments of the invention include the statistical analysis of the distribution of the mapped genomic locations of a collection of the fragments. With the intention of defining genomic locations where there is a high incidence of genomic tags, in order to increase the predictive value of genomic tags for mapping, for example, regulatory sites. According to another aspect of the present invention, there are provided methods for ascertaining the effect of an agent or other environmental perturbation on an active chromatin element profile of a genetic locus by obtaining and analysing a concatamerised library associated with a biological sample, unexposed to an agent or perturbation; obtaining and analysing a second concatamerised library associated with a biological sample, exposed to the agent or perturbation; and comparing the first analysis with the second to determine regulatory sites that are effected by the agent perturbation. In certain illustrative embodiments of this aspect of the invention, the perturbation occurs before obtaining the sample from a tissue, wherein the environmental perturbation is selected from the group consisting of an infection of the eukaryotic organism from a microorganism, loss in immune function of the eukaryotic organism, exposure of the tissue to high temperature, exposure of the tissue to low temperature, cancer of the tissue, cancer of another tissue in the eukaryotic organism, irradiation of the tissue, exposure of the tissue to a chemical or other pharmaceutical compound; and aging. The present invention provides methods for the large-scale isolation of regulatory sites in chromatin, comprising treating chromatin with an agent that modifies DNA at regulatory sites and fragmenting the modified chromatin, and isolating sub-fragments that contain or are adjacent to DNA modifications. In various embodiments, the modified genomic DNA is prepared by treating chromatin with an enzyme, a chemical agent, radiation, a shearing device, or a combination thereof. The chromatin can be obtained from cell nuclei of a biological sample, for example samples of primary cells, mammalian cells, human cells, murine cells, plant cells, fly cells, worm cells, fish cells, diseased cells, cancerous cells, yeast cells, transformed cells and cell lines, embryonic cells, stem cells, yeast artificial chromosomes containing mammalian DNA sequences, plant artificial chromosomes containing eukaryotic DNA sequences, and nuclear extracts and combinations thereof. In certain specific embodiments, the agent is selected from the group consisting of enzymes (e.g., a nuclease, a non-specific endonuc lease, a sequence-specific endonuclease, a topoisomerase (e.g., topoisomerase II), a methylase, a histone acetylase, a histone deacetylase, or a combination thereof), radiation (e.g., UV light, lasers, and ionizing radiation), chemical agents (e.g., a clastogen or a cross-linker), shearing, centrifugation, or electrophoretic devices, and combinations thereof. Where the DNA modifying agent is an enzyme, the enzyme can be endogenous or exogenous. In specific embodiments, the DNA modifying enzyme is a non-specific endonuclease such as DNase I or a sequence-specific endonuclease such as a restriction endonuclease. In certain embodiments, one or more oligonucleotide linkers are ligated to fragments containing modifications directly or indirectly resulting from treatment of chromatin with the DNA-modifying agent. In a preferred embodiment, the oligonucleotide linkers contain one member of a binding pair, such as biotin. In specific modes of the embodiment, the other member of the binding pair is streptavidin-coated paramagnetic beads. In such embodiments, the fragments can be isolated by binding the biotinylated oligonucleotides to streptavidin- coated paramagnetic beads. Optionally, the fragments are amplified prior to isolation, for example by ligating one or more oligonucleotide linker adapters to the isolated fragments and amplifying the fragments with primers complementary to said oligonucleotide linkers. Optionally, the methods of the invention further comprise the step of releasing a sub- fragment from the ends of the isolated fragments. A preferred embodiment of the foregoing methods further includes a step where the sub-fragments are self-ligated to form higher-molecular weight concatamers. Thus, the present invention provides methods obtaining a library of concatamerized genomic sub- fragments, comprising: (a) treating a sample with a DNA modifying agent; (b) isolating genomic subfragments associated with the site of modification (c) self-ligating said genomic sub-fragments to form concatamers, each concatamer comprising two or more sub-fragments; and (d) creating a library containing the concatamers. The present invention provides methods for detecting genomic regions of coincidence of positions of genomic sub-fragments, comprising detecting genomic regions of co-incidence of the positions of genomic sub-fragments, e.g., unique genomic sub-fragments, within a collection of sub-fragments that is greater than the co-incidence expected if the sub- fragments were distributed uniformly within the genome. The present invention further provides methods for detecting genomic positions that show increased sensitivity to a DNA modifying agent, comprising detecting genomic regions of co-incidence of the positions of genomic sub-fragments, e.g., unique genomic sub-fragments, within a collection of sub- fragments that is greater than the co-incidence expected if the sub-fragments were distributed uniformly within the genome, wherein the collection of sub-fragments is generated from a sample treated with said modifying agent. Optionally, the methods further entail mapping such sub-fragments to genomic locations of origin. Accordingly, the present invention provides, inter alia, methods for detecting genomic regions of co-incidence of positions of genomic sub-fragments, comprising: (a) treating a sample with a DNA modifying agent; (b) isolating genomic subfragments associated with the site of modification (c) self-ligating said genomic sub- fragments to form concatamers, each concatamer comprising two or more sub-fragments; (d) creating a library containing the concatamers; (d) sequencing the genomic sub-fragments; and (e) mapping the genomic sub-fragments to unique positions within the genome. In certain embodiments of the foregoing methods for identifying genomic locations, the incidence of two or more sub-fragments occurs within a 50-1000 bp window, for example 50-, 100-, 150-, 200-, 250-, 300-, 350-, 500-, 750-, or 1000-bp window. In certain embodiments of the foregoing methods and compositions, a library, e.g., a library of concatamerized genomic sub-fragments, contains anywhere from 10 to 1 million members, more preferably from 100 to 100,000 members. In a specific embodiment, a library contains from 1,000 to 10,000 members. In certain embodiments, the library contains at least 100, more preferably at least 250, 500, 1000, 2000, 5,000 members. In other embodiments, the library contains at more than 10,000 members. The library preferably comprises the sub- genomic fragments or concatamers of sub-genomic fragments in a plasmid or phage vector, although any other vehicle suitable for construction of a library may be used. In a preferred embodiment of the foregoing methods, the tag sequences (e.g., of the genomic sub-fragments) are obtained by a high throughput, more preferably an ultra-high- throughput, sequencing method, for example Massively Parallel Signature Sequencing (MPSS)( see, e.g., Brenner et al, 2000, "Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays." Nat Biotechnol. 18(6):630-4), signature pyrosequencing (see, e.g., Agaton et al, 2002, "Gene expression analysis by signature pyrosequencing." Gene 289(l-2):31-9), polony sequencing (see, e.g., Mitra et al, 2003, "Digital genotyping and haplotyping with polymerase colonies." Proc Natl Acad Sci U S A. 100(10):5926-31) or solid phase sequencing (see, e.g., Hultman et al, 1994, "Solid-phase cloning to create sublibraries suitable for DNA sequencing." J Biotechnol. 35(2-3):229-38). In a highly preferred embodiment, MPSS is employed. The present methods can be utilized to ascertain the effect of an agent or other environmental perturbation on the composition of a concatamerised library. In an exemplary embodiment, the method encompasses (a) obtaining a first concatamerised library from a biological sample unexposed to the agent or perturbation; (b) obtaining a second concatamerised library from a biological sample exposed to the agent or perturbation; and (c) comparing the composition of the first library with that of the second to determine the regulatory sites effected by the agent perturbation. In certain embodiments, the perturbation occurs before obtaining the sample from a tissue. The perturbation can be, for example, an infection of the eukaryotic organism from a microorganism, loss in immune function of the eukaryotic organism, exposure of the tissue to high temperature, exposure of the tissue to low temperature, cancer of the tissue, cancer of another tissue in the eukaryotic organism, irradiation of the tissue, exposure of the tissue to a chemical or other pharmaceutical compound; and aging. Alternatively, the perturbation can occur after obtaining the sample from a tissue. In certain embodiments, the perturbation can be exposure of the tissue to high temperature, exposure of the tissue to low temperature, irradiation of the tissue, exposure of the tissue to a chemical or other pharmaceutical compound, and aging. The present invention provides computer readable media comprising the genomic locations of genomic sub-fragments associated with a particular treatment of a sample. The present invention provides computer readable media comprising the genomic locations of co-incidences of genomic sub-fragments associated with a particular treatment of a sample. The computer readable medium can optionally contain information relating genomic sub-fragments associated with a particular disease or disorder, such as cancer, or a specific cell, e.g., a mammalian cell, a diseased cell, and/or a cell that has been treated with a drug or agent. The present invention yet further provides methods of detecting a disease or disorder in a subject, comprising: a. creating a computer readable medium of the invention associated with a diseased state; and b. comparing said computer-readable medium with a second computer readable medium of the invention associated with a non-diseased state to identify locations of genomic sub-fragments specific to one state but not the other. The present invention yet further provides methods of qualifying a patient for a clinical trial or therapy, comprising: (a) creating a computer readable medium of the invention associated with a patient treated with an agent; and (b) comparing said computer- readable medium with a computer readable medium of the invention associated with a suitable patient treated with same agent. The methods of the present invention are facilitated by the use of clustering algorithms, referred to herein as a "HSC algorithm" or HSCA. The HSC algorithm works by identifying genomic regions of statistical discrepancy subject to a uniformity assumption. Clusters of tags are identified with respect to the null hypothesis of a uniform distribution of tag library cut sites across the genome. The salient features of the algorithms of the present invention are described below. Binomial Model The assumed statistical model is a simple binomial model where each DNAse cut is made independently. The probability (p) of a cut is computed as the ratio of the expected size of a hypersensitive site to the size of the entire genome. This is a very small probability (p=0.000000078125) of a single cut. This probability is used to compute an expected mean number (μ) of hits in an HS region and its standard deviation (σ). These numbers are respectively, where L is the tag library size and G is the size of the genome:
Figure imgf000033_0001
The Algorithm The HSCA works as follows. 1. A tag library, a series of windowed neighborhoods, a set of density weights for those windows, a minimum required density, and a minimum required standard deviation for observation for clusters are input. 2. The algorithm considers windows of increasing size over the range of input windows around each library tag, looking for cases where the number of observed clones has exceeded the expected mean plus the minimum number of standard deviations. 3. The positions of the clones in these windows are averaged and a measure of their dispersion is computed. 4. A cluster-merging step then takes place where consecutive, overlapping regions of high density are merged subject to the windowing requirements. The algorithm looks for structure in the ranges from the minimum to the maximum input range, but prefers optimizing to the set of input weighting factors and the larger window sizes. 5. Finally, cluster centers, chromosome, position, and structural information about the average number of clones per window and dispersion are output.
See Section 6.4 for an exemplary algorithm in accordance with the present invention.
Additional Embodiments of the Algorithms of the invention The following variations, alone or in combination, may advantageously be applied to the HSCA algorithms of the invention: a. Incorporate recognition of known HS sites for input data simplification. b. Remove ensemble transcripts replace with gene names where possible for better genetic marking. c. Replace binary intron region marking with closest distance to an exon boundary. d. Distinguish between tags upstream in the probable promoter region from those in the first exons or introns. e. Distinguish between tags in the first intron and other introns. f. Add a category for tags inside coding regions. g. Flag Separate category for case where cluster is 3' of gene I but closer to 5' of that gene than 5' of gene II downstream h. Reduce the significance threshold for non-cluster tags that fall in conserved regions. i. Recognition of mitochondrial and alpha-satellite DNA sequences and subsequent filtering of those tags, j . Flag if near an expressed gene.
4. DESCRIPTION OF FIGURES
Figure 1. Distribution of Active Chromatin Sequences parallels genes. Distribution of Active Chromatin Sequences (ACSs) (small vertical bars, top) and genes (Ensembl; HG12) are shown along 33.1Mb of human chromosome 21. Stacking of ACSs and genes is due to compactness of the horizontal axis.
Figure 2. Density of Active Chromatin Sequences peaks at transcription start sites and CpG islands. X-axis: normalized distance (b.p.) relative to transcription start sites (panels 2a, 2b) or 3' transcription termini (panel 2c) of 16,169 RefSeq genes, or to 28,890 CpG islands (panel 2d). Y-axis: average number of ACSs per lOObp bin. Centered distances of the ACSs from each genomic feature were computed. To avoid the problem of multiply assigned ACSs, a fractional counting technique was used whereby the number of times an ACS is assigned is recorded. A histogram corresponding with equal subdivisions is constructed where the number of ACSs assigned to each class is scaled by the fractional multiple assignment count. Thus, if an ACS is assigned to two distinct transcription start sites, a value of X is assigned to each histogram class. Finally, normalizing the classes by the total number of assigned tags gives the average tag density in the class as depicted in the diagrams. Peaks in ACS density at transcription start sites and at CpG islands is evident, whereas no peak is found at transcription 3' termini. Peak at CpG islands remains even when non-promoter associated CpGs are considered (panel 2d).
Figure 3. Gene expression as a determinant of ACS distribution. The expression status of RefSeq genes were determined by cDNA microarray analysis of K562 cDNA. Genes were categorized according to whether or not they were expressed and a comparison performed of the average density of tags within a 10 kb window around the transcriptional start sites. ACSs show a preference for expressed genes. However, a prominent peak in ACS density is still evident at non-expressed genes suggesting that many of these lie within open chromatin domains.
Figure 4. ACS clusters provide more powerful discrimination. We identified 3,293 ACS clusters comprising 2-10 ACSs distributed within a 250bp window. ACS clusters are better predictors of DNasel hypersensitivity (see text) and show more prominent aggregation around known or suspected functional genomic landmarks including TSSs (panel 4a), CpG islands (not shown), and evolutionarily-conserved non-genic sequences (panel 4b). Relative densities were calculated as described in Fig 2.
Figure 5. Percentage of ACSs and ACS clusters within a +/- 5kb of TSSs. 22.4% of ACSs and 41.6% of ACS clusters are contained within a lOkb interval centered on the TSS. Since the prevalence of DNasel hypersensitive sites appears to parallel the distribution of ACSs and, more closely, ACS clusters, this interval is predicted to contain ~60% of all DNasel HSs.
Figure 6 illustrates the approach to forming concatamerised tag libraries.
5. DETAILED DESCRIPTION OF INVENTION
5.1 Generation of Concatamer Libraries Processes of embodiments of the invention to generate concatamer fragment libraries (see Figure 6) generally start with isolation of intact nuclei from cells and then treatment of nuclei with an agent capable of modifying chromatin at ACSs. DNA is recovered and sequence fragments containing ACSs are isolated. The isolated fragments are then sub- cloned into cloning vectors. A representative overall process may be divided into the following three stages: (I) Preparation of DNA which contains one or more single-stranded or double-stranded modification sites within domains defined by ACSs; (II) Isolation of short segments of DNA fragments associated with ACSs (typically 16 to 21 bp, referred to as tags); and (III) Ordered cloning of concatamerised tags isolated in (II) to create a library representing the ACSs found within the DNA source employed in (I). Each of these stages may be carried out in a variety of ways and has utility for a number of uses as will be appreciated by a skilled artisan. Forming concatamer libraries is a powerful strategy that allows the concatamerization and cloning of sub-fragments from clones within a library. Most importantly, the discovery allows the cloning and concatamerization of far smaller fragments than are usually manipulated and isolated in previously known library production methods. Protocols used in the strategy allow clear identification of small fragments as independent entities. Generally, in this context, a concatamerized clone is sequenced with a sufficient length that allows identification and placement by comparison to a relevant database pertaining to their source material, such as a human genome database. Overview The concatamer approach expedites the sequencing and/or analysis of libraries or similar collections of nucleic acid fragments. These generic protocols isolate and combine smaller fragments from several parental clones or subsets of DNA. The DNA may, for example be prepared as an enriched fraction of genomic sequences in a manner that identifies clearly each fragment. For example, fragments may be identified by size or by combination with an artificially introduced marker sequence, which is referred to herein as a 'Tag'. Sequencing of such concatamer clones is more efficient because many Tags may be read for a single sequencing reaction. The length of the sequence needs to be read or determined to locate the Tag on a database. This depends on the nature of the database to which the Tag is being mapped to (a genomic or EST database for example). It has been calculated that for the human genome the sequence length can be as short as 16 nucleotides. Hence in this scenario a single sequencing reaction can map approximately 30 locations in the genome. This ability to read multiple Tags simultaneously provides substantial savings in time, resources and money for the sequencing effort.
5.1.1 Preparation of DNA DNA may be derived from any eukaryotic cell population including animal cells, plant cells, virus-infected cells, immortalized cell lines, cultured primary tissues such as mouse or human fibroblasts, stem cells, embryonic cells, diseased cells such as cancerous cells, transformed or untransformed cells, fresh primary tissues such as mouse fetal liver, or extracts or combinations thereof. Chromatin may also be obtained from natural or recombinant artificial chromosomes and the like. Still further, the DNA also may be assembled into chromatin in vitro using previously sub-cloned large genomic fragments or human or yeast artificial chromosomes. Sample preparation often begins with chromatin from cellular material. Preferably the chromatin is extracted from a eukaryotic cell population such as a population of animal cells, plant cells, virus-infected cells, immortalized cell lines, cultured primary tissues such as mouse or Human fihroblasts, stem cells, embryonic cells, diseased cells such as cancerous cells, transformed or untransformed cells, fresh primary tissues such as mouse fetal liver, or extracts or combinations thereof. Chromatin may also be obtained from natural or recombinant artificial chromosomes. For example, the chromatin may have been assembled in vitro using previously sub-cloned large genomic fragments or Human or yeast artificial chromosomes. In many embodiments multiple ACS sequences and/or location sites are obtained from a eukaryotic cell sample by first extracting and purifying nuclei from the sample as for example, described in U.S. No. 09/432,576. Briefly, a sample is treated to yield preferably between about 1,000,000 to 1,000,000,000 separated cells. The cells are washed and nuclei removed by, for example, NP-40 detergent treatment followed by pelleting of nuclei. After obtaining the DNA ACSs are labeled, preferably with an agent that preferentially reacts with genomic DNA at ACSs is added and marks the DNA, typically by cutting or binding to the DNA. This alteration often will involve breaking or making a covalent bond within specific ACSs. For example, a nuclease may mark by cutting the ACS. In a preferred embodiment non-specific nuclease such as DNAse I cuts DNA at ACSs to produce DNasel ACSs. In a particularly advantageous embodiment DNAse I is used to form two single strand breaks near each other, and typically within 5 bases of each other. After reaction with hypersensitive DNA sites the reacted DNA is, if not already, converted into smaller fragments and the reacted fragments optionally are amplified and separated into a library. Other agents and methods that may be used to mark eukaryotic DNAs at ACSs include, for example, radiation such as ultraviolet radiation, chemical agents such as chemotherapeutic compounds that covalently bind to DNA or become bound after irradiation with ultraviolet radiation, other clastogens such as methyl methane sulphonate, ethyl methone sulphonate, ethyl nitrosourea, Mitomycin C, and Bleomycin, enzymes such as specific endonucleases, non-specific endonucleases, topoisomerases, topoisomerase II, single-stranded DNA-specific nucleases such as SI or PI nuclease, restriction endonucleases, EcoRl, Nlalϊl, Hsp92l, Styl, methylases, histone acetylases, histone deacetylases, and any combination thereof. As will be appreciated by skilled artisans, clastogens may be used to break DNA and the broken ends tagged and separated by a variety of techniques. Compounds that covalently attach to DNA are particularly useful as conjugated forms to other moieties that are easily removable from solution via binding reactions such as biotin. The field of antibody or antibody fragment technology has advanced such that antibody antigen binding reactions may form the basis of removing labeled, nicked or cut DNA from a ACS.
5.1.2 Isolation of genomic tags A particualr aspect of the invention is shown in Figure 6A. Genomic DNA is isolated from DNasel-treated nuclei and the repaired ends ligated to a biotinylated adaptor containing a recognition site for Bsgl, a type IIs restriction enzyme which cuts a fixed distance downstream of its recognition site, and Notl. Digestion with Bsgl generates fragments of uniform size. Figure 6A shows that the fragments are captured on beads whilst the remaining genomic fragments are washed away. The DΝA can be recovered from the beads by digestion with Notl and these fragments ligated together to form concatamers. Figure 6B shows that size selection of the concatamers produces the reagent which is cloned to make the library. Information loss has been reduced as the fragments are of similar size there is no bias in the cloning step. Also because each insert contains many smaller fragments any effects of under- representation due to inefficient growth of the plasmid in bacteria will be disipated. Such small fragments, in this case of 16 nucleotides, can be mapped uniquely onto the human genomic database. This embodiment of the invention benefits from two insights. One, access to databases containing information from completed or near-completed sequencing projects, such as the yeast, Drosophila, mouse, various viral and prokaryotic and human genomes, provides equivocal placement of small stretches of sequence by simultaneous comparison to the entire dataset. Two, the availability of such detailed databases allows confident identification of sequences as short as 16 nucleotides or even 14 nucleotides or surprisingly even 12 nucleotides at unique positions in the respective genomes. Based on these principles, novel protocols were discovered that allow cloning and post-sequencing identification of these small fragments. More specifically, according to an embodiment of invention, recognition sites for type IIS restriction enzymes, are introduced, for example by ligating adapters to a target sequence or by cloning a fragment into a designed vector that contains such sites immediately adjacent to the cloning site. Enzymes, such as Bsgl or mel, which cut 16 and 20 nucleotides downstream of their recognition sites, are particularly useful, and allow isolation of stretches of novel sequence adjacent to common sequence derived from the linker or vector. Variations of this method are contemplated and included within the ambit of this embodiment of the invention. In one representative embodiment, physical or enzymatic fractionation of the DNA, which has previously been cut with a restriction enzyme, is used to produce more heterogeneous fragments. These fragments, when concatamerized and sequenced are recognizable due to the presence of the known restriction site. In yet another embodiment, restriction fragments of known length are isolated by hybridization to a set of oligonucleotides, all of which contain the restriction site. The desired length of random bases, duplexes between the oligonucleotide and the target fragment will be insensitive to digestion by single strand specific nucleases. This treatment generates a population of restriction fragments of the same length.
5.1.2.1 Creating Concatamer Libraries from Fragmented DNA Concatamers may be formed from cut fragments by ligation of linkers into the breakpoints. It was discovered that such cloning of breakages in genomic DNA can be very informative for studying the cutting or shearing patterns of enzymes (e.g. DNasel, SI nuclease, which are useful probes for chromatin structure), chemical agents (such as medically important intercalators or molecules which show high specificity for certain DNA structures), physical agents (such as UV irradiation or shearing) and natural processes (such as apoptosis). After cutting, linkers are ligated onto the breakpoints. In many embodiments the site of the breakpoint is repaired by treatment with T4 DNA polymerase, an enzyme capable of converting both 5' and 3' overhangs into blunted ends. A linker, constructed by the annealing of equimolar amounts of synthetic oligonucleotides designed to contain a recognition site for a type IIs restriction enzyme, such as Bsgl, in such a place that following ligation it would be adjacent to the repaired breakpoint as well as containing other restriction sites and a biotin molecule to allow separation. An example of a useful oligonucleotides is: 5'-GGC TCT CAT GAT TAT GTG CAG-3' (SEQ ID NO 1), and 5'-CTG CAC ATA ATC ATG AGA GCC- Biotin-3' (SEQ ID NO 2). Alternatively following T4 DNA polymerase treatment, the blunted ends may be A-tailed by the action of Taq polymerase in the presence of dATP to create an end with an overhang to facilitate ligation of the linker, which now is designed to incorporate the complimentary overhang. In one embodiment a ligase is used. However the linker attachment is not necessarily formed by a ligase. Alternative methods include, for example the use of commercially available custom oligonucleotides with a stalled topoisomerase II molecule attached which effects a joining reaction. A DNA having added biotinylated linker now can be separated for example by a solid phase. For example the linker may be captured onto strepavidin coated paramagnetic beads (such as those supplied by Dynal, Norway). Commonly, if the starting material was genomic DNA that material also would have been digested with a restriction enzyme. It was found experimentally that hands typically lll, can reduce the average size, facilitate enzymatic reactions, and decrease steric hindrance that otherwise inhibits the capture step on the beads. Accordingly, one embodiment of the invention is the combination of Mαlll to improve solid phase capture in these reactions as well as other reactions not limited to those described herein. To remove linker that has not been incorporated into the DNA, it can be necessary to clean the reaction prior to capture on commercially available columns (such as DNeasy from Qiagen, CA) and/or to treat with exonuclease III. The latter degrades DNA in a 3' to 5' direction but will not cut sites which are blocked by the addition of biotin or have a four nucleotide 3' overhang, such as that created by Main. The prepared DNA optionally is treated with a restriction enzyme prior to adding the first linker. When this occurs a second linker can be added to the end created by the digestion. For example, after Mαlll digestion, a linker made from the following oligonucleotides can be added: 5'-GCG TAC TCC GAC TCG CTA TAG ATC ATG-3' (SEQ ID NO 3), and 5'-ATC TAT AGC GAG TCG GAG TAC GC-3' (SEQ ID NO 4). This step creates a PCR competent molecule that can be amplified. Amplification can be necessary if the amount of starting material is limiting or if the cutting event within a large DNA population is very rare. PCR amplified material then can either be cloned into commercially available vectors designed to capture PCR fragments or, if one of the primers used in the PCR reaction contained a 5' biotin, the product can be directly captured onto beads. Other binding partners and separation methodologies can be used as well, as will be appreciated by a skilled artisan. The cloned PCR products can be further processed as, for example described below in the section entitled 'Deriving concatamer libraries from single insert libraries'. The captured fragments subsequently can be treated with Bsgl and the solid phase washed, leaving the biotinylated linker and the genomic fragments on the solid phase (beads in this example). A linker can be added onto the site cut by Bsgl, which leaves a two nucleotide 3' overhang (an example of which would be one created by the annealing of 5- Phosphate-GCA TGC ATG GGA CTG GAA TTC CGT-3' (SEQ ID NO 5), and 5'-ACG GAA TTC CAG TCC CAT GCA TGC NN-3' (SEQ ID NO 6). Following denaturation of the duplex captured on the beads PCR amplification can be performed on either the supernatant or on the recovered beads to amplify the desired genomic fragment surrounded by linker DNA. Typically we use the following primers for PCR (5'-GGC TCT CAT GAT TAT GTG CAG-3' (SEQ ID NO 7), and 5'-ACG GAA TTC CAG TCC CAT GCA TGC-3' (SEQ ID NO 16), in the case of amplification using the supernatant as a template the first primer can contain a biotin in the 5' position. This gives a biotinylated PCR product which can be captured onto beads and sequentially digested (and washed to remove digestion products) with Sphl then Mαlll. The final supernatant is retained and the fragment precipitated, this DNA represents the Tag and is an Mαlll fragment which can be concatamerized by treatment with T4 DNA ligase for example and subcloned into the Sphl site of the cloning vector pGEM5z (Promega, WI) to form a concatamerized library. The PCR product generated from the beads is subcloned into a PCR cloning vector and bacterially amplified before release of the Tag by digestion of the plasmid DNA with Mαlll. The product is gel purified, concatamerized and cloned into the Sphl site of the cloning vector pGEM5z, in this example. Alternatively the captured fragments (before ligation of a second linker but after Bsgl digestion) can be denatured and the DNA released into the supernatant be treated with Tsc ligase in the presence of oligonucleotides designed to introduce a second priming site to the single stranded DNA (for example the oligonucleotides 5-GCA TGC ATG GGA CTG GAA TTC CGT-3' SEQ ID NO 8), and 5' -CAG TCC CAT TGC ATG CNN NN-3' (SEQ ID NO 9), and perform the ligation with 30 cycles with an annealing and ligation temperature of 40°C and an intervening melting step at 95°C). The resulting product is a PCR competent molecule that can be treated as above for the supernatant following denaturation of the second ligation reaction. Concatamerising Sequences Associated with Restriction Sites The previous embodiment captured genomic sequence adjacent to a breakpoint. The position of the break may be unknown. This embodiment uses restriction sites having known positions in the context of sequenced genomes. The utilities of forming concatamers from these sites include mapping deletions or replications within the genomes of tissue culture cells and mapping the restriction fragments associated with the introduced breakpoints. Mapping Copy Number Differences in Tissue Culture Cells The genomic DNA is digested with rare cutting enzymes to generate a low resolution map, or with frequent cutters to deliver higher resolution. It should be noted that the use of methylation-sensitive restriction enzymes would generate information about the epigenetic status of the genome. This embodiment of the invention may be carried out alternative ways. One advantageous way is to attach a biotinylated linker containing a restriction site for a IIs enzyme, such as Bsgl, with a complimentary site for the restriction enzyme used (an example of the sequence of the primers used in combination with an Mαlll digest are 5'-GCG TAC TCC GAC TCG CTA TAG ATC ATG-3' (SEQ ID NO 10) and 5'-Phosphate-ATC TAT AGC GAG TCG GAG TAC GC-3' (SEQ ID NO 11). A Bsgl digestion is used if appropriate and the product captured on the solid phase such as paramagnetic beads. The concatamer libraries then can be formed by one of the approaches as described above. Physical shearing of the restricted DNA, by sonication or shearing for example, can be used to generate a population of molecules with a small average size, the standard deviation of which can begreatly reduced by size fractionation and a common recognition sequence. The fragments are then repaired by treatment with T4 DNA polymerase to create blunt molecules which then, for example, may be A-tailed and cloned into a PCR cloning vector, such as pGEM-Teasy (Promega, WI). After bacterial amplification the fragments can be released by digestion with EcoRI and the purified products concatamerized and cloned into the Ecό l site of a second cloning vector. An alternative to the step of cloning the Tags (with or without PCR amplification) and bacterial expansion has been devised, in order to counter bias introduced in the amplification steps. The Tag is made single-stranded and has a second priming site attached to both the 3' and 5' ends via a Tsc reaction (using the following set of primers, for example: 5'-Phosphate-TAT GCG GCC GCT TAG TAC-3' (SEQ ID NO 12); 5'-NNN NGT ACT AAG-3' (SEQ ID NO 17); 5'-CCG CAT ANN NN-3' (SEQ ID NO 13); and perform the ligation with 30 cycles with an annealing and ligation temperature of 30°C and an intervening melting step at 95 °C). Following this reaction and treatment with exonuclease I the product forms a template for Rolling Circle Amplification (due to the presence of single stranded circles) which can be performed with an oligonucleotide complementary to the Notl site (5'- GCG GCC GC-3'; SEQ ID NO 14) in the presence of Bst polymerase (NEB, NE; performed for 20 hours at 60°C). The resulting product can then be digested with Notl to generate a Tag molecule with complimentary ends which can be used to form a concatamer library. A third alternative is to use a hybridization approach, that is to digest the genomic DΝA with an enzyme such as Notl, denature the DΝA and hybridize with a 5 '-biotinylated PΝA molecule containing the recognition site for the enzyme at its 5' end followed by a number of random bases (up to 16). After annealing the PΝA to the DΝA to preferentially form PΝA:DΝA hybrids the remaining single stranded DNA can be digested by the action of a single-stranded specific nuclease. The PNA:DNA hybrids can be captured on beads and the DNA strand recovered by denaturation. A Notl oligonucleotide (5'-GCG GCC GC-3'; SEQ ID NO 15) then can convert the single stranded DNA into double stranded material in the presence of Taq polymerase and dNTPs, and the Tag cloned individually, and processed further as discussed above, or concatamerized as blunt molecules before cloning to form a concatamer library. Indirectly Mapping Breakpoints The site of a repaired breakpoint is first labeled enzymatically with biotin, using either terminal transferase or an exchange reaction with T4 polynucleotide kinase in the presence of a modified donor nucleotide. The resultant reaction product is cleaned to remove entirely the labeling activity and then digested with a restriction enzyme. Capture of these products on solid phase such as beads will isolate those breakpoints which have been successfully labeled. A ligation reaction can be performed as above; a non- biotinylated linker with the appropriate restriction sites can be attached to the site exposed by the digest. In this iteration the linker contains an Sphl site which can be cut to expose an Mαlll site at the 3' end, a Bsgl digest will then release a Tag molecule which can be precipitated. At this stage the Tag can either be blunted, A-tailed and cloned into a PCR vector, bacterially amplified and then cut out of the plasmid by appropriate restriction enzymes and gel purified to generate a Tag with compatible ends which can then be concatamerized and cloned. Alternatively the Tag molecules can be ligated together directly after being recovered from the beads to form a DiTags (a series of Tags ligated head-to-head on a concatamer molecule with either Mαlll or Bsgl sites at its termini). Such DiTag products either can be digested (with Sphl) to release single DiTags, which can be cloned into the Sphl site of a vector, or the whole concatamer molecule cloned (following modification to blunt and A-tail it). A second alternative is to ligate a second linker to the site exposed by the Bsgl digestion and process as discussed above. Deriving concatamer libraries from single insert libraries Cloning vectors in embodiments of the invention have either been altered to contain, for example, a Bsgl site adjacent to the cloning site or the insert of interest has a site introduced by attachment of an appropriate linker. Following bacterial amplification the Tag can be generated by digestion with Bsgl and a second unique enzyme in the poly linker 5' of the Bsgl site. This Tag can be gel purified. Most commonly concatamer libraries are made by the formation of DiTags as described above.
5.1.3 Concatamerisation and cloning of tags Tags that are prepared as described above are concatamerized either in double stranded form, with standard DNA ligases, or in single stranded form, using for example the thermostable Tsc DNA ligase. The Tags themselves may be ligated to each other, to form DiTags, before concatamerization, in a process that enters control steps for any bias in the representation of Tags in the population as well as for errors in the sequencing reactions. Various other methods are also alluded to in Section 5.1.2. An Example is given in Section 6.3.
5.2 Bioinformatic recovery of tag sequences and unique mapping Genomic tags were bioinformatically filtered to select those which occurred uniquely within the human genome. For sequences exceeding sixteen nucleotides in length it was found that approximately one half of the tags mapped uniquely to the human genome.
5.3 Statistical analysis of the distribution of tags The computational tools used for determining clusters are a combination of bioinformatics, statistical analysis and simulation. In developing this analysis we recognized essentially three distinct sources of information: Unsupervised data clustering or pileups of library tags, i.e. structural information, Supervised clustering of data near known or hypothesized genomic features, Correlations with other data, e.g. direct observation of HS scores from locus profiles, microarray data, cross-library comparison. For initial analysis of the data, a suite of tools has been developed in C/C++ that concentrate on detecting statistical anomalies in the data. The most sophisticated of these is a data clustering and feature identification algorithm called the Hot Spot Cluster Algorithm (HSCA) and is described below (also see Example 6.4). The HSCA takes as input mapped and localized DNA sequence tags extracted from concatamer prepared libraries. These tags are filtered for unique mappability under the Merbase genomic localization system (Hawrylycz et al, manuscript in preparation). Their start, stop, and orientation in the genome are known. At this point a number of genomic markers and features of interest are obtained. In the current version of the algorithm these include
1. Minimum number of standard deviations to report deviation from. Distance to closest exon 1, acting as a proxy for the promoter. 3 Distance to closest CpG island. 4. Distance to closest UniGene. 5. Flag if in a region of Tetraodon conservation. 6. Score for Human-Mouse conservation. 7. Flag for region of segmental duplications. 8. Distance to closest TATA box if within pre-specified neighborhood. 9. Distance to closest initiator sequence if within pre-specified neighborhood. 10. DNA sequence in a region of tag.
The data sources for these markers are downloaded from the UCSC Genome Browser site http://genome.ucsc.edu/downloads.html build hgl2, June 2002. Other correlations could be considered and these would include but not be limited to proximity, co-incidence with or general methods to improve the correlation of the data: 1. Proximal Promoter 2. Distal Promoter 3. Intronic 4. Proximal Intergenic 5. Distal Intergenic 6. Exonic 7. First Intron 8. 3' End (3 kb from end of transcript) 9. Mouse conserved outside exons - should be precise about conservation level.i.e. bp/identity 10. Tetraodon conserved outside exons 11. Transcription factor binding sites 12. Incorporate recognition of known HS sites for input data simplification. 13. Remove ensemble transcripts replace with gene names where possible for better genetic marking. 14. Replace binary intron region marking with closest distance to an exon boundary. 15. Distinguish between tags upstream in the probable promoter region from those in the first exons or introns. 16. Distinguish between tags in the first intron and other introns. 17. Add a category for tags inside coding regions. 18. Flag Separate category for case where cluster is 3' of gene I but closer to 5' of that gene than 5' of gene II downstream 19. Reduce the significance threshold for non-cluster tags that fall in conserved regions. 20. Recognition of mitochondrial and alpha-satellite DNA sequences and subsequent filtering of those tags. 21. Flag if near an expressed gene. 22. CpG Island with designable primers
5.3.1 Input Preprocessing The tags input to HSCA are filtered so as to be uniquely mappable onto the genome, there may be duplicates as a result of PCR amplification, multiple hits on opposite strands of the DNA, and other peculiarities of the genome or operations. In the current version of HSCA, duplicate tags are removed prior to clustering, and a small neighborhood around the tag is taken (50bp) in which a check for repeat content in the genome is made. The standard for this latter step is RepeatMasker () and so any neighborhood of the tag containing a lower case repeat masked nucleotide causes the tag to be removed prior to clustering. The combined effect of these steps has been shown to result in a reduction in input library size by approximately one third.
5.3.2 HSCA The HSC algorithm works by identifying regions statistical discrepancy subject to a uniformity assumption. Clusters are therefore identified with respect to the null hypothesis of a uniform distribution of tag library cut sites across the genome.
5.3.2.1 Binomial Model The assumed statistical model is a simple binomial model where each DNAse cut is made independently. The probability (p) of a cut is computed as the ratio of the expected size of a hypersensitive site to the size of the entire genome. This is a very small probability (p=0.000000078125) of a single cut. This probability is used to compute an expected mean number (μ) of hits in an HS region and its standard deviation (σ). These numbers are respectively, where L is the tag library size and G is the size of the genome:
HS μ = p - L σ = ^p - (l - p) - L
5.3.2.2 The Algorithm The HSCA works intuitively as follows. A tag library, a series of windowed neighbourhoods, a set of density weights for those windows, a minimum required density, and a minimum required standard deviation for observation for clusters are input. The algorithm considers windows of increasing size over the range of input windows around each library tag, looking for cases where the number of observed clones has exceeded the expected mean plus the minimum number of standard deviations The positions of the clones in these windows is averaged and a measure of their dispersion is computed. Once this phase is complete, a cluster-merging step takes place where consecutive, overlapping regions of high density are merged subject to the windowing requirements. The algorithm looks for structure in the ranges from the minimum to the maximum input range, but prefers optimizing to the set of input weighting factors and the larger window sizes. Cluster centers, chromosome, position, and structural information about the average number of clones per window and dispersion are output at the end. See Example 6.4 for the algorithm details.
6. EXAMPLES
6.1 Preparation of DNA that contains one or more single-stranded or double-stranded cleavage sites within genomic DNA domains that form hypersensitive sites K562 cells are grown to confluence (5 x 105 cells per cubic milliliter as assayed by hemocytometer). Nuclei are prepared from a suitable volume (e.g., 100ml) as described (Reitman et al MCB 13:3990). Nuclei are then re-suspended at a concentration of 8 OD/ml with 10 μlof 2 U/μlDNasel [Sigma] at 37°C for 3 min. The DNA is purified by phenol- chloroform extractions and ethanol precipitated. The DNA is repaired in a 100 μlreaction containing 10 μgDNA and 6 U T4 DNA polymerase (New England Biolabs) in the manufacturer's recommended buffer and incubated for 15 min at 37°C and then 15 min at 70°C. 1.5 U Taq polymerase (Roche) is added and the incubation continued at 72°C for a further 10 min. The DNA is then recovered using a Qiagen PCR Clean-up Kit and eluted in 50 μl of 10 mM Tris.HCl, pH8.0.
6.2 Production of genomic tags 10 μg of clean genomic DNA was precipitated and resuspended in 20 μl of 0.2 x TE buffer (2 mM Tris.HCl, 0.2 mM EDTA, pH8.0). The DNA was mixed in 100 of 1 x Taq DNA polymerase buffer (Roche) supplemented with 200 μM dNTPs, 3 U T4 DNA polymerase (NEB) and 5 U Taq DNA polymerase (Roche) and incubated for 10 min at 37°C followed by a 20 min incubation at 72°C. The DNA was cleaned by use of a Qiagen PCR Purification column and the DNA eluted in a volume of 90 μl of Elution buffer. The DNA was incubated overnight at 16°C in 100 μl of 1 x T4 DNA ligase buffer (NEB) containing 10 pmol of the A- Adaptor (formed by the annealing of A-Af 5'Biotin-TEG-CTC TGG CGC GCC GTC CTC TCA CGC GTC CGA CT-3' and A-Ar 5'Phosphate-GTC GGA CGC GTG AGA GGA CGG CGC GCC AGA GC-3') and 400 U T4 DNA ligase (NEB). The Adapted DNA was subsequently digested to completion with Mmel (NEB) before the biotinylated DNA was separated by binding to paramagnetic strepavidin coated M-270 beads (Dynal). Following washes the beads were resuspended in 30 μl of 1 x T4 DNA ligase buffer containing 20 pmol of NN-Adaptor (formed by the annealing of A-NNf 5 '-GAG AGC GGT GCA GAA GGA GAC GTA CGA NN-3' and A-NNr 5'-TCG TAC GTC TCC TTC TGC ACC GCT CTC-3') and incubated overnight with continual rotation at 4°C. The beads are then captured and washed in three changes of 1 x TE buffer supplemented with 50 mM NaCl before finally being resupended in 100 μl TE. Of this 2 μl is used as a template in a PCR reaction containing 10 μl water, 0.225 μl 20 pmol μl PCR-Af (5'-CGC CGT CCT CTC ACG CGT CCG A-3'), 0.225 μl pmol/μl PCR-NNf (5'-GAG AGC GGT GCA GAA GGA GAC GTA CGA-3'), 1.2 μl 25 mM MgCl2 and 1.5 μl 10 x Fast Start SYBR-Green master mix (Roche) which is cycled using a Lightcycler Real-time PCR system (Roche). The machine is used to determine the maximum number of cycles in which amplification is still exponential. Typically this value is between ten and fifteen cycles. PCR reactions performed with the chosen cycling conditions are separated by 12% PAGE and the 76 bp band purified by excising the band and eluting the tag DNA into 100 μl TE by incubation overnight. Typically eight 100 μl PCR reactions are performed in mixture containing 0.8 μl tag DNA, 80 μM dNTPs, 40 pmol b-PCR-Af (a 5' biotinylated version of primer PCR-Af), 40 pmol of PCR-NNf and 2.5 U Taq polymerase (Roche) with nine cycles of amplification (94°C for 20s, 60°C for 20s and 72°C for 25s). The correct size band is again excised from the gel. The amplified tags are digested with BsiΨl by incubation at 55°C. Complete digestion was generally achieved by adding serial aliquots of the enzyme. The reaction was captured on M270 Dynal beads, as per the manufacturer's instructions, and the beads finally resuspended in 30 μl of 1 x NEB Buffer 3 supplemented with 20 U Mlul and incubated at 37°C with continual rotation for 2 h to cleave the tags from the beads. Following digestion the beads were recaptured and the concentration of the tags in the supernatant assessed with a Picogreen quantitation kit.
Example 6.3 Formation and cloning of high molecular weight concatamers A 30 μl ligation reaction was set up with 30 pmol of tags and 0.5 pmol of BsiWI- Adaptor (formed by the annealing of B-Af 5 '-GAG TGT GGC GCG CCT TGT AGA C-3' and B-Ar 5'-GTA CGT CTA CAA GGC GCG CCA CAC TC-3') and incubated with 400 U T4 DNA ligase (NEB) overnight at 16°C. 5 μl of the ligation was subsequently used in a 50 μl PCR reaction with 20 pmol B-Af primers, 100 μM dNTPs and 1 U Taq polymerase which was cycled twenty times with the following conditions: 94°C for 20s, 60°C for 20s and 72°C for 1 min. The DNA was precipitated and resuspended in a 25 ml volume of 1 x NEB4 buffer containing 10 U Ascl and digested at 37°C for 2 hours before being separation on a 1.5% agarose/TBE gel. All concatamers of tags greater than 500 bp in size were isolated using a Qiagen Gel Extraction kit and eluted in 50 μl EB. 10 μl of the eluted DNA was used in an overnight ligation into pGEM5z cut with Mlul.
Example 6.4 The HSC Algorithm We describe the basic algorithm of the HSCA. Assume a binomial model B(p,L) for DNAse cut site distribution, where p is the probability of a single DNAse cut in an HS site, and L is the total library size. Let μ denote the expected number of cuts in a regions of HS size in genome G and σ its standard deviation.
Input: Window range (WL , WR ), and set of density weights for each window dt e (dL ,dR) , minimum standard deviation threshold γ , and minimum required density δmm . Sort the library L removing duplicates and repeat regions.
Output: Cluster centroids in each chromosome, the standard deviations of clones assigned to the clusters and distribution by window size of the average number of points per window contained in the cluster. Note: This last item is not calculated in the pseudo-code displayed.
Step 1. Identify Hot Spots
For a fixed input range w, e (WL , WR ) do { For each library tag position /( e L { Form the window Wt = (l - wl / 2,lJ + w, 12) of length w, , Let Cj denote the number of library clones in W{ . If Cj > μ + γ σ { II anomaly σ = (Cj - 1 - μ) I σ II Observed standard deviation δt + = δw II Increment density count σ + = σ δw II Average standard deviation c l - — 2_j H The average position of clones falling in CJ heW, w,. W ' } max - W ; } } σ I = δt II Average standard deviation
Step 2. Merge Clusters
Find the minimum lmn e L with δt > δmm and set C, = lmm . For each /, > /mm do { If δj > δmm { If I Ck_x —a, \ < Wma {II high density within maximum window A_j + = at II Average the position and standard deviations } Else { // start a new cluster Ck_x I =| Ck_x I // Adjust previous cluster centroid
} }
After the cluster merge step basic statistics are calculated and the output is readied for print out.
6.5 Production of concatamer libraries by Tsc/Bst-mediated amplification
T4 DNA polymerase treatment DNasel-digested DNA + 10 μl DNA (@ l μg/μl) + 10 μl 5 x T4 DNA polymerase buffer (Roche) + 1 μl 10 mM dNTPs (Roche) + 2 μl T4 DNA polymerase (? U/μl) + 27 μl water
Incubate at 37°C for 15 min Followed by 75 °C for 15 min Phenol-chloroform extract and precipitate Resuspend in 10 μl water
Ligation of PBNTl linker
Prepare PBNTl linker at a concentration of 50 pmol/μl (consiting of PNBTl 5'Phosphate-GGC TGG CGG CCG CAT TAT AGT GCA G-3' and BNTBl 5'-CTG CAC TAT AAT GCG GCC GCC AGC C-Biotin-3')
+ 2 μl 10 x T4 DNA ligase (Roche) + 10 μl DNA sample + 2 μl PBNTl linker + 2 μl T4 DNA ligase (Roche; 5 U/μl) + 4 μl water
Incubated at 16°C overnight Clean-up on Qiagen Dneasy column
To ligation mixture add 80 μl water + 17 μl Qiagen Proteinase K + 200 μl AL buffer
Incubate at 65°C for 5 min
+ 200 μl Ethanol Spin through DNeasy column Wash twice with 500 μl of AW1 and AW2 buffers Elute with 180 μl AE buffer
Digestion with Nlalll To DNA sample add + 20 μl 10 x NEB Buffer 4 + 3 μl Nlalll (NEB; 10 u/μl)
Incubate overnight at 37°C
Capture on Dynal beads (M-270)
Wash 10 μl Dynal beads in 200 μl 1 x Binding buffer To recaptured beads add 200 μl digestion mix and 200 μl 2 x Binding buffer Mix gently Incubate at 37°C for 30 min with occasional gentle agitation Capture beads and wash in 200 μl 1 x NEB Buffer 4
Digestion with Bsgl Recaptured beads resuspended by the addition of + 77.5 μl water + 10 μl 10 x NEB Buffer 4 +10 μl 1/40 diln. S-Adenosylmethionine (NEB) + 2.5 μl Bsgl (NEB; 10 U/μl)
Incubate at 37°C for 3 hours Capture beads and wash in two changes of 40 μl 2 x Binding buffer Resuspend beads in 8 μl 0.1 M NaOH Incubate with gentle incubation at room temperature for 5 min Capture beads and reserve supernatant Precipitate single stranded DNA by addition of + 1 μl 10 mg/ml glycogen + l μl 3 M NaOAc (pH5.2) + 25 μl Ethnaol
Mix and incubate at -80°C for 30 min Precipitate and wash Resuspend in 10 μl water
Addition of single stranded adaptor by Tsc ligase The following oligonucleotides were prepared: NotAd (5'Phopshate-TAT GCG GCC GCT TAG TAC-3'); 5JCoR4 (5'-CAG CCG TAC TAA GC-3'); 3J (5'- NNN NAT ATG CGC-3'). In tthis example they are used at concentrations of 10 pmol/μl, but also have been used at concentrations as low as 10 fmol/μl To 3 μl DNA sample + 2.5 μl Tsc Incubation buffer (Roche) + 1.2 μl NotAd (@ 10 pmol/μl) + 1.2 μl 5JCoR4 (@ 10 pmol/μl) + 1.2 μl 3J (@ 10 pmol/μl) + 1 μl Tsc ligase (Roche) + 14.9 μl water
Incubated using the following programme: 94°C for 5 min; 94°C for 30 s followed by 30°C for 3 min; this step repeated 32 times; 99°C for 15 min; 4°C for ever.
Exonuclease I treatment To 25 μl DNA sample add + 3 μl 10 x Exonuclease I buffer (USB) + 1 μl Exonuclease I (USB; 10 U/μl) + 1 μl water
Incubate at 37°C for 1 hour
+ l μl 0.5 M EDTA
Incubate at 65°C for 5 min
+ 3.1 μl 3 M NaOAc (pH5.2) + 68 μl Ethanol + 1 μl 10 mg/ml glycogen
Mix and incubate at -80°C for 30 min Precipitate and wash Resuspend in 15 μl water
Bst polymerase mediated rolling circle amplification Prepare Not oligo (5'-GCG GCC GC-3')
Incubate the 15 μl DNA sample at 95 °C for 1 min and then add + 5 μl Thermo Pol buffer (NEB) + 2 μl Not oligo (@ 200 pmol/μl) + 1 μl 10 mM dNTPs + 1 μl Bst polymerase (NEB; 8 U/μl) + 25 μl water
Incubate overnight (up to 20 hours) at 60°C
Recovery of fragment and concatamerisation Notl digest of Bst-amplifϊed products
To 1 μl amplified DNA add + 2 μl 10 x Buffer H (Roche) + 15.5 μl water + 1.5 μl Notl (Roche; 10 U/μl)
Incubate at 37°C for 2 hours
Isolation of 40 bp 'tag' DNA sample mixed with 3 μl 6 x Loading Buffer (Promega) Loaded onto PAG and run at 300 N in 1 x TBE until the yellow dye front had left gel Band if interest identified on a transilluminator and excised Incubated overnight at room temperature overnight in 1 ml of TE Recover TE buffer Spin gel slice through siliconized glass wool and add liquid to TE
+ 100 μl 3 M ΝaOAc (pH5.2) + 1 μl 10 mg/ml glycogen + 2.5 ml Ethanol
Mix and incubate at -80°C for 30 min Precipitate and wash Resuspend in 6 μl water Concatamerization of 40 bp fragment
To 5 μl DΝA sample add + 2 μl 10 x T4 DΝA ligase buffer (Roche) + 1 μl T4 DΝA ligase (Roche; 5 U/μl) + 15 μl water
Incubate overnight at 16°C
Cloning Cloning into Νotl-digested and alkaline phosphatase-treated pGEM5z
To 1-10 μl DΝA sample prepare a ligation reaction in a total volume of 20 μl 1 x T4 DΝA ligase buffer (Roche) supplemented with 1 μl T4 DΝA ligase (Roche; 5 U/μl) and 50 ng Νotl-digested and alkaline phosphatase-treated pGEM5z
Incubate at 16°C overnight
7. Equivalents Those skilled in the art will recognize, or be able to ascertain using no more than routine experimentation, many equivalents to the specific embodiments of the invention described herein. Such equivalents are intended to be encompassed by the following claims.
All publications, patents and patent applications mentioned in this specification are herein incorporated by reference into the specification to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference.

Claims

What is claimed is:
1. A method for the large-scale isolation of regulatory sites in chromatin comprising treating chromatin with an agent that modifies DNA at regulatory sites and fragmenting the modified chromatin, and isolating sub-fragments that contain or are adjacent to DNA modifications.
2. The method of claim 1 wherein the modified genomic DNA is prepared by treating chromatin with an enzyme, a chemical agent, radiation, a shearing device, or a combination thereof.
3. The method of claim wherein the chromatin is obtained from cell nuclei of a biological sample.
4. The method of claim 4 wherein the biological sample is selected from the group consisting of samples of primary cells, mammalian cells, human cells, murine cells, plant cells, fly cells, worm cells, fish cells, diseased cells, cancerous cells, yeast cells, transformed cells and cell lines, embryonic cells, stem cells, yeast artificial chromosomes containing mammalian DNA sequences, plant artificial chromosomes containing eukaryotic DNA sequences, and nuclear extracts and combinations thereof.
5. The method of claim 1 wherein the agent is selected from the group consisting of enzymes, radiation, chemical agents, shearing, centrifugation, or electrophoretic devices, and combinations thereof.
6. The method of claim 5 wherein the enzyme is a nuclease, a non-specific endonuclease, a sequence-specific endonuclease, a topoisomerase, a methylase, a histone acetylase, a histone deacetylase, or a combination thereof.
7. The method of claim 6 wherein the enzyme is endogenous.
8. The method of claim 6 wherein the enzyme is exogenous.
9. The method of claim 6 wherein the non-specific endonuclease is DNase I.
10. The method of claim 6 wherein the sequence-specific endonuclease is a restriction endonuclease.
11. The method of claim 6 wherein the topoisomerase is a topoisomerase II.
12. The method of claim 5 wherein the radiation is one or more of the group consisting of UV light, lasers, and ionizing radiation.
13. The method of claim 5 wherein the chemical agent is a clastogen.
14. The method of claim 5 wherein the chemical agent is a crosslinker.
15. The method of claim 1 wherein one or more oligonucleotide linkers are ligated to fragments containing modifications.
16. The method of claim 15 where oligonucleotide linkers contain one member of a binding pair.
17. The method of claim 16 where one member of the binding pair is biotin.
18. The method of claim 16 where the other member of the binding pair is streptavidin-coated paramagnetic beads.
19. The method of claim 16 where the fragments are isolated by binding the biotinylated oligonucleotides to streptavidin-coated paramagnetic beads.
20. The method of claim 1 wherein the fragments are amplified prior to isolation.
21. The method of claim 20 wherein one or more oligonucleotide linker adapters are ligated to the isolated fragments and amplification reaction utilizes primers complementary to said oligonucleotide linkers.
22. The method of claim 1 further compromising the step of releasing a sub-fragment from the ends of the isolated fragments.
23. A method where the sub-fragments of any of claims 1-22 are self-ligated to form higher-molecular weight concatamers.
24. A method for mapping such sub-fragments to genomic locations of origin.
25. A method for identifying genomic locations where the incidence of two or more sub-fragments occurs within a 250 bp window.
26. The method of claim 7 where two or more sub-fragments occurs within a 50 bp window.
27. The method of claim 7 where two or more sub-fragments occurs within a 1000 bp window.
28. The method of any of claims 1-22 where the tags sequences are obtained by Massively Parallel Signature Sequencing (MPSS).
29. A method of ascertaining the effect of an agent or other environmental perturbation on the composition of a concatamerised library as generated by any of claims 1-22 comprising; (a) obtaining a first concatamerised library from a biological sample unexposed to the agent or perturbation; (b) obtaining a second concatamerised library from a biological sample exposed to the agent or perturbation; and (c) comparing the composition of the first library with that of the second to determine the regulatory sites effected by the agent perturbation.
30. The method of claim 29, wherein the perturbation occurs before obtaining the sample from a tissue, wherein the environmental perturbation is selected from the group consisting of an infection of the eukaryotic organism from a microorganism, loss in immune function of the eukaryotic organism, exposure of the tissue to high temperature, exposure of the tissue to low temperature, cancer of the tissue, cancer of another tissue in the eukaryotic organism, irradiation of the tissue, exposure of the tissue to a chemical or other pharmaceutical compound; and aging.
31. The method of claim 29, wherein the perturbation occurs after obtaining the sample from a tissue, wherein the perturbation is selected from the group consisting of exposure of the tissue to high temperature, exposure of the tissue to low temperature, irradiation of the tissue, exposure of the tissue to a chemical or other pharmaceutical compound, and aging.
32. A computer readable medium comprising the genomic locations of genomic sub- fragments associated with a particular treatment of a sample.
33. A computer readable medium comprising the genomic locations of co-incidences of genomic sub-fragments associated with a particular treatment of a sample.
34. The computer readable medium of any of claims 32-33 wherein the disease or disorder is cancer.
35. The computer readable medium of any of claims 32-33 associated with a specific cell.
36. The computer readable medium of claim 35, wherein the cell is a mammalian cell.
37. The computer readable medium of claims 35, wherein the cell is a diseased cell.
38. The computer readable medium of claims 35, wherein the cell has been treated with a drug.
39. The computer readable medium of claims 35, wherein the cell has been treated with an agent.
40. A method of detecting a disease or disorder in a subject, comprising: a. creating a computer readable medium of any of claims 32-33 associated with a diseased state; and b. comparing with a computer readable medium of any of claims 32-33 associated with a non-diseased state to identify locations of genomic sub- fragments specific to one state but not the other.
41. A method of qualifying a patient for a clinical trial, comprising: a. creating a computer readable medium of claims 32-33 associated with a patient treated with an agent; and b. comparing with a computer readable medium of claims 32-33 associated with a suitable patient treated with same agent.
42. A method of qualifying a patient for a therapy, comprising: a. creating a computer readable medium of any of claims 32-33 associated with a patient treated with an agent; and b. comparing with a computer readable medium of any of claims 32-33 associated with a suitable patient treated with same agent.
43. A method for detecting genomic regions of co-incidence of positions of genomic sub-fragments, comprising detecting genomic regions of co-incidence of the positions of genomic sub-fragments within a collection of sub-fragments that is greater than the co-incidence expected if the sub-fragments were distributed uniformly within the genome.
44. A method for detecting genomic positions that show increased sensitivity to a DNA modifying agent, comprising: applying the method of claim 43 to a collection of sub-fragments generated from a sample treated with said modifying agent
45. A method for obtaining a library of concatamerized genomic sub-fragments, comprising: (a) treating a sample with a DNA modifying agent; (b) isolating genomic subfragments associated with the site of modification (c) self-ligating said genomic sub-fragments to form concatamers, each concatamer comprising two or more sub-fragments; and (d) creating a library containing the concatamers.
46. A method for detecting genomic regions of co-incidence of positions of genomic sub-fragments, comprising: (a) treating a sample with a DNA modifying agent; (b) isolating genomic subfragments associated with the site of modification (c) self-ligating said genomic sub-fragments to form concatamers, each concatamer comprising two or more sub-fragments; (d) creating a library containing the concatamers; (d) sequencing the genomic sub-fragments; and (e) mapping the genomic sub-fragments to unique positions within the genome.
PCT/US2004/042172 2003-12-15 2004-12-14 Methods and algorithms for identifying genomic regulatory sites WO2005058931A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US53032003P 2003-12-15 2003-12-15
US60/530,320 2003-12-15

Publications (2)

Publication Number Publication Date
WO2005058931A2 true WO2005058931A2 (en) 2005-06-30
WO2005058931A3 WO2005058931A3 (en) 2005-10-13

Family

ID=34700122

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2004/042172 WO2005058931A2 (en) 2003-12-15 2004-12-14 Methods and algorithms for identifying genomic regulatory sites

Country Status (1)

Country Link
WO (1) WO2005058931A2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2971278A4 (en) * 2013-03-15 2016-11-30 Broad Inst Inc Methods for determining multiple interactions between nucleic acids in a cell
US20180087089A1 (en) * 2015-04-14 2018-03-29 Hypergenomics Pte. Limited Method for Analysing Nuclease Hypersensitive Sites
US11170872B2 (en) 2019-11-05 2021-11-09 Apeel Technology, Inc. Prediction of latent infection in plant products

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002097135A1 (en) * 2001-05-30 2002-12-05 Stamatoyannopoulos John A Accurate and efficient quantification of dna sensitivity by real-time pcr

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002097135A1 (en) * 2001-05-30 2002-12-05 Stamatoyannopoulos John A Accurate and efficient quantification of dna sensitivity by real-time pcr

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FELSENFELD G. ET AL.: 'Chromatin Unfolds' CELL vol. 86, 12 July 1996, pages 13 - 19, XP002990546 *
SCHUT G.J. ET AL.: 'DNA Microarray analysis of the hyperthermophilic Archaeon Pyrococcus furiosus: Evidence for a new type of Sulfur-reducing enzyme complex' JOURNAL OF BACTERIOLOGY vol. 183, no. 4, December 2001, pages 7027 - 7036, XP008052283 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2971278A4 (en) * 2013-03-15 2016-11-30 Broad Inst Inc Methods for determining multiple interactions between nucleic acids in a cell
US11618923B2 (en) 2013-03-15 2023-04-04 The Broad Institute, Inc. Methods of determining multiple interactions between nucleic acids in a cell
US20180087089A1 (en) * 2015-04-14 2018-03-29 Hypergenomics Pte. Limited Method for Analysing Nuclease Hypersensitive Sites
US11170872B2 (en) 2019-11-05 2021-11-09 Apeel Technology, Inc. Prediction of latent infection in plant products

Also Published As

Publication number Publication date
WO2005058931A3 (en) 2005-10-13

Similar Documents

Publication Publication Date Title
CN106604994B (en) Whole genome unbiased identification of DSBs by sequencing evaluation (GUIDE-Seq)
AU2019408503B2 (en) Compositions and methods for highly efficient genetic screening using barcoded guide rna constructs
Dietrich et al. Maize Mu transposons are targeted to the 5′ untranslated region of the gl8 gene and sequences flanking Mu target-site duplications exhibit nonrandom nucleotide composition throughout the genome
US20100311602A1 (en) Sequencing method
WO2014093709A1 (en) Methods, models, systems, and apparatus for identifying target sequences for cas enzymes or crispr-cas systems for target sequences and conveying results thereof
US20040220127A1 (en) Methods and compositions relating to 5&#39;-chimeric ribonucleic acids
AU779568B2 (en) Genetically filtered shotgun sequencing of complex eukaryotic genomes
CA2496517A1 (en) Genome partitioning
EP1407053A2 (en) Dna microarrays comprising active chromatin elements and comprehensive profiling therewith
US20220372456A1 (en) Novel crispr dna targeting enzymes and systems
US20200255823A1 (en) Guide strand library construction and methods of use thereof
EP1639126A2 (en) Regulome arrays
EP3927717A1 (en) Guide strand library construction and methods of use thereof
US10287621B2 (en) Targeted chromosome conformation capture
AU2020341711A1 (en) Novel CRISPR DNA targeting enzymes and systems
EP4021924A1 (en) Novel crispr dna targeting enzymes and systems
Symeonidi et al. CRISPR-finder: A high throughput and cost-effective method to identify successfully edited Arabidopsis thaliana individuals
WO2004053106A2 (en) Profiled regulatory sites useful for gene control
WO2005058931A2 (en) Methods and algorithms for identifying genomic regulatory sites
WO2018183607A1 (en) Methods of identifying and characterizing gene editing variations in nucleic acids
Mitschka et al. Generation of 3′ UTR knockout cell lines by CRISPR/Cas9-mediated genome editing
Niu et al. Resolving a Systematic Error in STARR-seq for quantitative enhancer activity mapping
Walsh et al. Functional characterization of lncRnas
EP4321630A1 (en) Method of parallel, rapid and sensitive detection of dna double strand breaks
Yates A CRISPR/Cas9 Tissue Specific Forward Genetic Screening Method in Danio rerio

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase in:

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

122 Ep: pct application non-entry in european phase