WO2016156469A1 - Genome architecture mapping on chromatin - Google Patents

Genome architecture mapping on chromatin Download PDF

Info

Publication number
WO2016156469A1
WO2016156469A1 PCT/EP2016/057025 EP2016057025W WO2016156469A1 WO 2016156469 A1 WO2016156469 A1 WO 2016156469A1 EP 2016057025 W EP2016057025 W EP 2016057025W WO 2016156469 A1 WO2016156469 A1 WO 2016156469A1
Authority
WO
WIPO (PCT)
Prior art keywords
loci
gam
dna
segregation
chromatin
Prior art date
Application number
PCT/EP2016/057025
Other languages
French (fr)
Inventor
Ana Pombo
Paul Dear
Miguel BRANCO
Original Assignee
Max-Delbrück-Centrum für Molekulare Medizin
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Max-Delbrück-Centrum für Molekulare Medizin filed Critical Max-Delbrück-Centrum für Molekulare Medizin
Publication of WO2016156469A1 publication Critical patent/WO2016156469A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6816Hybridisation assays characterised by the detection means
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6827Hybridisation assays for detection of mutation or polymorphism
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers

Definitions

  • the present invention relates to the field of analysis of the three-dimensional structure of the genome, i.e., for genome architecture mapping on chromatin (GAM-ch).
  • the invention provides a method of determining interaction of a plurality of nucleic acid loci in a compartment comprising nucleic acids, such as the cell nucleus, comprising separating nucleic acids from each other depending on their interaction in the compartment by crosslinking nucleic acids with each other directly or indirectly, fragmenting the nucleic acids of the compartment to obtain fragments and/or cross-linked complexes of fragments, and dividing the fragmented nucleic acids to obtain a collection of fractions such that every fraction contains, on average, less than one copy of every locus; determining the presence or absence of the plurality of loci in said fractions; and determining the co-segregation of said plurality of loci in the fractions.
  • Co-segregation may then be analysed with statistical methods to determine interactions.
  • the method can be used e.g., for identifying the frequency of interactions across a cell population between a plurality of loci; and mapping loci and/or genome architecture, e.g., in the nucleus, an organelle, a microorganism or a virus; identification of regulatory regions (enhancers) directing expression of a specific gene through spatial contacts; identifying the spatial contacts between loci that depend on their co- association with specific protein(s), or R A, and/or diagnosing a disease associated with a disturbed co-segregation of loci.
  • Chromatin immunoprecipitation ChIP can be combined with the method of the invention.
  • Information about the three-dimensional structure of chromatin is also of high interest, in particular, to discover contacts between regulatory regions (e.g. enhancers) and gene promoters which may be disrupted in disease due to genetic mutations in the non-coding part of the genome (e.g. Uslu V.V. et al. 2014 Long-range enhancers regulating Myc expression are required for normal facial morphogenesis. Nature Genetics 46: 753).
  • regulatory regions e.g. enhancers
  • gene promoters which may be disrupted in disease due to genetic mutations in the non-coding part of the genome
  • chromosomes Studying the structural properties and spatial organization of chromosomes is important for the understanding and evaluation of the regulation of gene expression, DNA replication and repair, and recombination.
  • the folding of chromosomes and their contacts has important implications for disease mechanisms and elucidation of targets for therapeutic approaches, e.g., in cancer or congenital diseases.
  • Chromatin exists in interacting and non-interacting states. Interacting states have different properties depending on the characteristics of the genomic sites, or binding sites, involved in the interactions, namely (a) their number, distance and distribution, (b) their specificity and affinity for binders, and (c) the concentration and specificity of binders. Chromatin interactions can also involve different numbers of loci associating simultaneously (multiplicity of interaction).
  • Fluorescence in situ hybridization uses microscopy to directly measure spatial distances between genomic loci, but it can only be applied to the study of a small number of genomic regions at a time in the same nucleus (e.g., Pombo A. 2003. Cellular genomics: which genes are transcribed when and where? Trends Biochem. Sci. 28, 6). It is theoretically possible to re-probe the same cells or tissue sections with different sets of probes, but there are concerns that repeated re-probing causes structural artefacts, e.g. due to DNA denaturation necessary to dissociate subsequent sets of probes, that e.g. induce artificial aggregation (contacts) of loci (i.e.
  • RNA-FISH is a milder FISH approach that does not involve DNA denaturation but that can only be used to determine the nuclear position of actively transcribed genes (not silent genes). Samples from cells in the interphase stage of the cell cycle, where functional chromatin contacts are most often mapped, can be re-probed for R A-FISH only about three times, although the preservation of structure has not been measured in detail.
  • the number of probes which can be simultaneously applied in either DNA- or RNA-FISH is limited by distinguishable fluorescent markers, e.g. 181 barcodes can in principle be obtained by combining five colours, four colour ratios and two different levels of intensity (Pombo A. 2003. Cellular genomics: which genes are transcribed when and where? Trends Biochem. Sci. 28, 6).
  • this approach fails when the loci analysed are so close in space that the combination of fluorochromes in one probe is not distinguishable from the combination in another, and is therefore not amenable to the identification of loci that are spatially proximal at very short distances.
  • FISH can only be applied to analyse interactions of known loci of interest, and not to discover e.g. the presence of an exogenous DNA sequence in an interaction with the host's DNA.
  • the approach fails e.g. in the detection of endogenous or exogenous DNA sequences, unless they are known a priori, e.g. viral subtype integration positions and the exact sequences of exogenous DNA.
  • FISH is also confounded by a priori assumptions of linear genome organisation, which are not acceptable to study chromatin positioning features, e.g. chromatin contacts, when e.g. the influence of natural variation in genomic sequence in organism populations is of interest, e.g. in studying human samples, due to the fact that FISH does not inherently detect sequence variations such as copy number variations, or genomic rearrangements, without a priori probe design or a priori whole genome sequencing of the sample followed by probe design.
  • 3C-based methods generally start with chemical crosslinking of proteins that mediate genomic contacts. After chromatin extraction, pieces of DNA bound by the crosslinked proteins and RNAs are treated with a restriction enzyme for fragmentation. Addition of a ligase then connects (ligates) two pieces of DNA.
  • 3C uses different methods of detecting such ligation events: a popular one is paired-end sequencing (Hi-C, 4C-seq, ChlA-PET), and in one embodiment the DNA bound by a specific protein (or molecule) is purified before the ligation step.
  • the present inventors addressed the problem of providing an improved method for determining the interaction of nucleic acids, which avoids bias based on ligation of fragmented nucleic acids for detection of nucleic acids interactions, and which allows for simultaneous analysis of several high multiplicity interactions (each involving more than two loci), in particular, more than two interactions.
  • the method allows for simultaneous analysis of substantially all nucleic acid interactions in the genome, in another, the method allows for simultaneous analyses of all nucleic acid interactions of fragments bound by a given protein or molecule of interest such as protein or RNA.
  • This problem is solved by the method of the invention, as described below and in the claims. This method is designated Genome Architecture Mapping on Chromatin (GAM-ch).
  • the present invention provides a method of determining interaction of a plurality of nucleic acid loci in a compartment comprising nucleic acids, comprising steps of
  • nucleic acids from each other depending on their interaction in the compartment by (i) crosslinking nucleic acids with each other directly or indirectly, (ii) fragmenting the nucleic acids of the compartment to obtain fragments and/or cross-linked complexes of fragments, e.g. by the use of sonication, mechanical shearing or restriction enzyme digestion, and (iii) dividing the fragmented nucleic acids to obtain a collection of fractions such that every fraction contains, on average, less than one copy of every locus (e.g. about 0.5 copies or one copy in every other fraction), wherein steps (i) and (ii) can be carried out simultaneously or in any order;
  • fragments bound by a given molecule of interest are selected, e.g. by chromatin immunoprecipitation (ChIP), as described in more detail below.
  • ChIP chromatin immunoprecipitation
  • a locus is the specific location of a gene, DNA sequence, or position on a chromosome (Wikipedia). Each chromosome carries many genes; the number of protein coding genes in the haploid human genome is estimated to be 20,000-25,000, on the 23 different chromosomes; there are as many transcription units which produce RNA species that do not encode for proteins. A variant of the similar DNA sequence located at a given locus is called an allele.
  • the nucleic acid may be DNA or RNA or a combination of both, e.g., if interactions between genes being actively transcribed and other genomic regions are to be analysed. Usually, the method of the invention is used to analyse co-segregation of DNA.
  • the co-segregation of loci may be analysed in any compartment comprising nucleic acids, such as the nucleus of a eukaryotic cell, a mitochondrion, a chloroplast, a prokaryotic cell or a virus.
  • nucleic acids such as the nucleus of a eukaryotic cell, a mitochondrion, a chloroplast, a prokaryotic cell or a virus.
  • co-segregation of nucleic acid in particular, DNA loci in the nucleus of a eukaryotic cell may be analysed.
  • the method of the invention thus constitutes a solution to analyse locus proximity or interaction in the nucleus, through measuring their frequency of co- segregation in cross-linked DNA complexes extracted from nuclei.
  • the cell or particle from which the compartment is derived may be a virus, a bacterium, a protozoan, a plant cell, a fungal cell or an animal cell, e.g., a mammalian cell, such as a cell from a patient (preferably, a human patient) having a disease or a disorder, or being diagnosed for a disorder, or a healthy subject.
  • the cell may be a tumor cell or a stem cell, such as an induced pluripotent stem cell generated, e.g., through reprogramming of human tissues.
  • Such cells can advantageously be used to apply GAM-ch to study human developmental disorders or congenital disease.
  • the cell is an embryonic stem cell, it is preferably not generated in a method involving destruction of a human embryo. A plurality of cells/compartments or single cells may be analysed with the method of the invention.
  • the mammal preferably is a human, but it may also be of interest to investigate, and, optionally, compare the genomic architecture of other organisms, such as E. coli, yeast, A. thaliana, C. elegans, X. leavis, D. rerio, D. melanogaster, mouse, rat or primate, or possibly parasitic interactions, e.g. the proximity of parasitic nucleic acids relative to the host genome, such as the chromatin contacts a virus (e.g. HIV, HSV) make with the host DNA, or of an artificially inserted nuclei acid (e.g. in the context of gene therapy).
  • a virus e.g. HIV, HSV
  • Cells can be derived from cell culture or analysed ex vivo from a specific tissue from a living organism or a dead organism, i.e., post-mortem, or from a whole experimental organism (e.g. a whole D. melanogaster embryo or C. elegans embryo), or from a mixture of microorganisms.
  • Cells used in the analysis can be selected, e.g., by synchronizing the cells in a particular stage of the cell cycle, or sorting the cells e.g. by fluorescence activated cell sorting to capture a particular cell type expressing a specific marker, e.g., using an antibody specific for a protein uniquely expressed in the cell type or cell stage of interest, or detected by in situ hybridization e.g.
  • a nucleic acid probe that detects a specific e.g. mRNA, or other RNA, expressed specifically in the cell type of interest, or a fluorescent marker such as GFP showing expression of a specific gene or characteristic of a specific stage.
  • a GFP transgene under the control of the promoter of the Pitx3 transcription factor can be used to mark dopamine- expressing neurons (Maxwell S. et al, 2005, Pitx3 regulates tyrosine hydroxylase expression in the substantia nigra and identifies a subgroup of mesencephalic dopaminergic progenitor neurons during mouse development. Dev. Biol. 282 (2): 467-479).
  • Cells can be pre-treated with an agent, e.g., to test the effect of drugs on co-segregation or positioning of loci, or be studied during the lifetime of an organism to understand development, ageing and degeneration.
  • a suspension of single cells is prepared before step (a), depending on the species and type of tissue, e.g., a single cell suspension of mammalian solid tissues may be prepared.
  • Preparation of a single cell suspension may be carried out by any procedure that is also compatible with 3C-techonologies. Detailed description of several single cell preparations compatible with the production of a chromatin sample that preserves crosslinked chromatin contacts can be found in e.g. Hagege H. et al. 2007. Quantitative analysis of chromosome conformation capture assays (3C-qPCR). Nature Protocols 2, 1722.
  • the preparation of a single cell suspension may start by tissue dissection, followed by treatment with collagenase, or, for soft tissues (e.g. mouse thymus or fetal liver), by passage of tissue through a cell strainer (e.g. 40 micrometer mesh), or in the case of cells grown in in vitro culture or microorganism cultures, through centrifugation of the culture at appropriate force for the cell type, followed by resuspension at appropriate strength to yield a single cell suspension with minimal cell damage or death.
  • a cell strainer e.g. 40 micrometer mesh
  • centrifugation of the culture at appropriate force for the cell type followed by resuspension at appropriate strength to yield a single cell suspension with minimal cell damage or death.
  • Application to post-mortem samples is also possible using published protocols or developments thereafter (Mitchell A.C. et al. 2014. The genome in three dimensions: a new frontier in human brain disease. Biol. Psychiatry 75, 961).
  • the separation of nucleic acids from each other in step (a) is carried out by (i) crosslinking nucleic acids with each other directly or indirectly, i.e., DNA and/or RNA may be cross-linked directly or through proteins interaction with the nucleic acid, using e.g. chemical crosslinking agents such as formaldehyde, (ii) fragmenting the nucleic acids of the compartment to obtain a fragments and/or complexes of cross-linked fragments of nucleic acids, e.g.
  • nucleic acids by sonication, and (iii) dividing the nucleic acids into fractions to obtain a collection of fractions each containing a plurality of fragments and/or complexes of cross-linked fragments, such that every fraction contains, on average, less than one copy of every locus.
  • Nuclei, cells, tissues or whole organisms are treated with a crosslinking agent, e.g. a chemical crosslinking agent in step (a) (i).
  • the crosslinking agent induces linkage of proteins with each other and between nucleic acids (DNA and/or RNA) and proteins.
  • the method of the invention is compatible with cross-linking conditions that are also compatible with current 3C-based methods.
  • the crosslinking agent comprises formaldehyde or another crosslinking agent compatible with DNA extraction.
  • Formaldehyde will preferably be used, at a concentration of 0.5-4%, preferably, about l%-2% (all w/w), e.g., in a buffered solution, e.g., of PBS pH 7.0- 8.0, or directly by addition of concentrated solution of the cross-linking agent directly to cell medium, preferably for 5-120 min, preferably 10-20 min.
  • Alternative cross-linkers are, e.g., disuccinimidylglutarate, dithiobis-succinimidyl propionate, glutaraldehyde.
  • Crosslinking may also be performed by UV radiation.
  • fixed nuclei or cells can be pelleted and stored frozen, e.g., at - 20°C, or -70°C or -80°C, e.g. in 1% formaldehyde.
  • Steps (i) and (ii) may be carried out at the same time or in any order.
  • crosslinking is performed as soon as possible to maintain the structure of chromatin intact as well as possible, i.e., it is usually performed first.
  • Step (a) of the method may further comprise, e.g., permeabilisation of cells by a lysis buffer and/or freezing.
  • the crosslinking can, e.g., be done directly on cells and then followed by permeabilisation, e.g., lysis with a suitable lysis buffer, and/or, freezing, and then fragmentation, e.g., by restriction (see Hagege et al. 2007).
  • crosslinking and permeabilising can be performed at the same time.
  • the fragmenting in step (a)(ii) can be carried out by any method, which preferably leads to formation of fragments of homogenous length, or randomly and evenly- spaced breaks in the nucleic acids.
  • fragmentation can be done by ultrasound, by mechanical shearing, by Dounce homogenisation, vortexing with glass beads, or by restriction digest, or a combination of two or more of these methods.
  • Physical methods such as ultrasound or shearing can be adapted to yield fragments or complexes of fragments of a desired fragment size, which may vary depending on the tissue and/or cell analysed.
  • Preferred average fragment size depends on the resolution with which chromatin interactions are aimed to be mapped (which depend on organism and on aims) and is about 100bp-5 Mbp, or preferably, 200bp-500kbp or lkbp-5kbp nucleotides.
  • the average "chromatin loop-size is about 100 kbp.
  • Promoter contacts with regulatory regions are often local, below 50 kbp, so an appropriate resolution needs to be chosen.
  • Dounce homogenisation can be performed using e.g. 100 mg tissue in (a) 2 mL IX PBS (phosphate buffered saline) or another suitable buffer, and (b) 200 ⁇ ⁇ protease inhibitor (Mitchell A.C. et al. 2014. The genome in three dimensions: a new frontier in human brain disease. Biol. Psychiatry 75, 961).
  • vitrification i.e. rapid freezing
  • chemical crosslinking agents e.g. formaldehyde
  • restriction digestion may be considered to introduce some bias into the formation of fragments, it may be acceptable if it is taken into account in the analysis of results.
  • frequently cutting restriction enzymes may be used, or a combination of enzymes recognizing different restriction sites e.g., two, three or four different restriction enzymes, may be used.
  • a restriction digest with the enzymes Hindlll, Ncol, EcoRI or Bglll (6-base cutters) or DpnII or Nlalll (4-base cutters) may be carried out e.g. for 60 min, or over night at 37°C and will provide different fragment sizes depending on the genomic distribution of the restriction sequence.
  • step (a) (iii) can be preceded by an additional step (a) (iii.0) comprising selection of fragments/complexes of fragments that are bound by a given molecule of interest, in particular a given protein, a given protein post-translational modification, a given RNA (if fragments are DNA) or a given DNA (if fragments are RNA), or a chemical modification of DNA (e.g. DNA methylation) or RNA, or a given protein/nucleic acid complex, or, after targeting a locus with Cas9 complex with guide RNAs.
  • the given molecule of interest is a protein that is bound to chromatin at the time that chromatin forms contacts.
  • Said selection may be carried out by an affinity-based method such as affinity precipitation, e.g.by performing a chromatin immunoprecipitation or pull down using antibodies or other affinity molecules (e.g. aptamer), followed by dividing/aliquoting e.g. the 'beads' used for pull down.
  • affinity-based method such as affinity precipitation, e.g.by performing a chromatin immunoprecipitation or pull down using antibodies or other affinity molecules (e.g. aptamer), followed by dividing/aliquoting e.g. the 'beads' used for pull down.
  • affinity precipitation with antibodies is preferred, other affinity based selection methods, e.g.
  • biotin binding to avidin or derivatives such as streptavidin e.g., after labelling of chromatin using in vivo biotinylation, or incorporation of biotin to specific nucleic acid sequences, e.g. after in situ incorporation of Biotin-UTP or Biotin-dUTP into nascent RNA or nascent DNA, respectively, can also be employed.
  • Specific nucleic acids may also be selected by use of hybridizing nucleic acids for selection, e.g., by affinity precipitation. Affinity precipitation can be substituted for by passage over columns comprising a ligand specific for the molecule of interest.
  • Chromatin Immunoprecipitation can be employed (e.g., Collas, 2010. The current state of chromatin immunoprecipitation. Molecular Biotechnology 45(1):87-100; Stock et al. 2007; Brookes et al. 2012). Suitable conditions for specific interaction with the molecule of interest are employed, e.g., conditions for stringent hybridization. Methods disclosed in WO 2014/14152397 A2 may be employed.
  • step (a) (iii) the nucleic acids in the preparation resulting from the previous steps, e.g., directly from step (a)(ii) or from step (a)(iii.0), are divided (or aliquoted) into fractions to obtain a collection of fractions such that every fraction contains, on average, less than one copy of every locus (e.g. 0.0001-0.9, 0.01-0.7, 0.1-0.6, 0.4-0.5, preferably, about 0.5 copies, i.e. one copy in every other fraction).
  • one locus is seen in every other fraction (i.e. in 50% of the fractions), or in 40% or 30% or 10% or 5% of fractions.
  • the number of fractions depends on the approximate number of loci and the genomic resolution at which the assay will be carried out (i.e. it depends on the total genome length of the organism under study and the length of the loci for which contacts are measured, in other words on the resolution).
  • the nucleic acids are separated into many fractions.
  • the number of fractions depends on whether only pairwise or multiple contacts are to be found between loci, on whether only the most highly frequent contacts (interactions) (e.g. frequency above 50% across the cell population), or also the least frequent contacts (e.g. 5%) also are to be identified.
  • step (a) (iii.0) is used to reduce the complexity of the sample. If step (a) (iii.0) is used, analysis of about 180 fractions (or more) already provides meaningful results.
  • the nucleic acid (often DNA) content of the fractions should be homogenous for the whole analysis, but non-homogenous fractions (e.g.
  • fractions that have excessive DNA content may be excluded a posteriori once nucleic acid content is mapped; e.g., if using fractions that are supposed to contain approximately 30% of genomic DNA coverage on average, any tubes that contain more than 40% or less that 20% coverage can be excluded, or analysed separately, upon DNA detection.
  • These fractions may be obtained from a plurality of cells (or nucleic acid containing cellular compartments) or from single cells.
  • the separation into fractions is preferably done after homogenous division of the fragments and/or cross-linked complexes of fragments.
  • some fractions will, statistically, contain one or more copies of all possible loci that cover the given genome. This may be found in different situations, firstly, when the preparation of fractions of the compartment leads to fractions with very heterogeneous content in terms of number of fragments (e.g. an large chunk of chromatin; Gavrilov A. A. et al. 2013. Disclosure of a structural milieu for the proximity ligation reveals the elusive nature of an active chromatin hub. Nucleic Acids Res. 41 , 3563-75). This is an artefact, which can be detected and disregarded in the analysis of the said invention. Furthermore, this may happen when the two alleles in a cell interact so closely that they appear in the same fraction. When loci are identified with sequencing, this is not a problem, as it can be measured based on sequence difference due to SNP variation between alleles.
  • the presence or absence of the plurality of loci may be determined by e.g., polymerase-chain reaction (PCR), or preferably, by sequencing, preferably, by next generation sequencing and eventually by the developing single molecule sequencing techniques.
  • PCR polymerase-chain reaction
  • WGA single cell whole genome amplification
  • the nucleic acids of loci in the fraction are sequenced substantially or completely. This is of particular interest if the method is carried out to detect possible interactions between different loci in a research setting, and a "normal" co-segregation pattern has not yet been established for the cell type of interest in the physiological conditions used.
  • the method of the invention may thus be used to analyse spatial proximity (and, consequently, interactions) of unknown and/or unspecified loci, or of transgenic loci inserted in the genome (e.g. in gene therapy) to study their effects of chromatin contacts.
  • the method can be used to detect specific (and new) species, as the DNA in cells of each species crosslinks with DNA from each species, and is more often found co-segregated.
  • nucleic acids such as DNA may be analysed by crosslinking, nuclear fractionation (optional), fragmentation (i.e. chromatin preparation or preparation of nuclei acid complexes), dilution and separation into fractions or sub-samples, followed by amplification using single-cell whole genome amplification (WGA; Baslan, T. et al. 2012. Genome-wide copy number analysis of single cells. Nat. Protoc. 7: 1024) (Fig. 4A).
  • WGA-amplified DNA may be sequenced, e.g., using Illumina HiSeq technology. Visual inspection of tracks from single fractions shows that each contains a different complement of sub-chromosomal regions of expected size (Fig. 6, Fig. 14), as expected from sequencing a sub-cellular fraction of chromatin containing fragment lengths of a given genomic length.
  • each fraction contains only a restricted subset of sequences from each chromosome (Fig. 15B).
  • presence or absence of a specific interaction has previously been investigated, so the interacting loci of interest are already known.
  • a significant difference in the frequency with which two loci interact may have been found between different patient groups (e.g., healthy subjects and subjects having a disease, such as a tumor or a congenital disease).
  • presence or absence of the two (or more) loci of interest can also be determined by specific PCR, or by otherwise specifically checking for their presence, e.g., by Southern blot or by Illumina HiSeq technology, after selection of nucleic acids covering locus of interest, e.g.
  • GAM-ch thus preferably combines single copy locus fractionation of a crosslinked chromatin preparation with DNA detection (e.g. by whole genome amplification and next generation sequencing).
  • DNA detection e.g. by whole genome amplification and next generation sequencing.
  • chromatin is crosslinked, loci that are closer to each other in the nuclear space (but not necessarily on the linear genome) are found together in the single molecule fraction more frequently than distant loci (i.e. they co-segregate more frequently, Fig. 2).
  • the frequency of contacts between genomic loci can then be inferred by scoring the presence or absence of loci among a number of aliquots containing a sub-genome sample of fragments (Fig. 2).
  • the resulting table can be used to compute the co-segregation frequency of each locus against every other locus to create a matrix of inferred contact frequencies between loci. Therefore, GAM allows for the calculation of chromatin contacts genome wide without the need for end-to- end ligation between the interacting fragments.
  • Co-segregation may be analysed with a statistical method to determine chromatin contacts. Close spatial proximity can be a sign for specific interaction of loci. Specific interaction of loci may thus also be determined by analysing co-segregation with a statistical method.
  • Statistical methods used in the method of the invention may be, e.g., inferential statistic methods.
  • Statistical methods used in the examples may also be used in the method of the invention to analyse samples of different origin and/or for different loci of interest, e.g., as mentioned herein.
  • the loci are determined to interact specifically, when they co-segregate at a frequency higher than expected from their linear genomic distance on a chromosome. If all possible pairs of loci in the genome at a given genomic (linear) distance are considered, pairs of loci that do NOT interact will be found distributed around an average frequency of chromatin contacts (i.e., co- segregation across the collection of fractions) that depends on the genomic distance between the two loci and the degree of chromatin compaction.
  • the term "contact" is used herein to describe co-segregation across the collection of fractions i.e., a quantitative measure of interaction. Loci that do not interact, e.g., are considered to have a value of contact of zero.
  • interacting pairs will have higher frequencies of chromatin contacts (i.e., co-segregation in the fractions) than the average for that genomic distance that depends on their physical distance in the nucleus of that particular cell type.
  • More complex arguments can also be considered, but an interaction can be most simply defined as a deviation from the random (three-dimensional) arrangement of the chromatin fibre taking into consideration any additional contributing factor(s) to a non- random behaviour.
  • GAM-ch measures the frequency with which two loci co-segregate in the same fraction, and can measure the co-segregation of all genomic loci simultaneously, producing quantitative information that is amenable to (a) the identification of genomic coordinates that more frequently interact with other genomic regions, but also (b) to a wide-range of mathematical treatments that calculate the probability of loci interacting above some random (expected) behaviour.
  • a plurality of loci means two or more loci, optionally, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least eleven, at least 12, at least 13, at least 15, at least 20, at least 30, at least 40, at least 50, at least 75, at least 100, at least 200, at least 500 or at least 1000 loci and up to several million or billion loci, which are analysed simultaneously. For example, allele-specific analysis of a human cell at 5 kb resolution requires simultaneous analysis of 1.3 million loci.
  • substantially all loci or all loci in a compartment are analysed with the method of the invention, e.g., by sequencing substantially all nucleic acids, preferably, all DNA, in the compartment.
  • the loci to be analysed may be determined in a biased way (e.g. by choosing to analyse all 23000 protein coding genes in a human cell, or all gene promoters or all non-coding regulatory regions, or all enhancers), or in an unbiased way, e.g. by dividing the genome into windows of a certain size, e.g., windows of 100 bp to 10 Mbp, preferably, 1 kbp to 1 Mbp, 5 kbp-50 kbp, or 10 kbp-30 kbp windows.
  • the method of the invention can be applied in a way which does not distinguish between different alleles (e.g. the two homologous copies of a gene present in a normal human cell), or, alternatively, it can be used to distinguish the two (or more, in the case of e.g. polyploid amphibian cells) alleles of a locus in the same cell.
  • different alleles e.g. the two homologous copies of a gene present in a normal human cell
  • it can be used to distinguish the two (or more, in the case of e.g. polyploid amphibian cells) alleles of a locus in the same cell.
  • the method of the invention allows for the detection of multiple co-segregating loci, in particu- lar, more than two co-segregating loci, preferably, more than three, more than four, more than 8, or more than 20, co-segregating loci.
  • identification of multiple interactions using 3C-based methods has been attempted and shown to be both inefficient and highly biased (Sexton et al, 2012, Cell 148:458-72).
  • There is mathematical evidence showing that these experimental limitations of 3C-based methods will remain insurmountable, irrespective of incremental improvements (O'SuUivan J.M. et al., 2013, Nucleus 4:390-8).
  • restriction sites are not randomly distributed in the genome, leading to a bias in detection.
  • the efficiency of ligation is affected by the different length of DNA fragments, which adds further bias to 3C-based results.
  • the method of the invention is preferably not or not substantially affected by these biases.
  • step (b) no ligation occurs between nucleic acids originally present in the compartment, in particular, no ligation has to be performed prior to step (b).
  • ligation e.g., with external linkers is possible in the context of detection of the presence or absence of nucleic acid loci, e.g., for amplification or sequencing.
  • the avoidance of ligation of nucleic acids derived from the compartment with each other overcomes the structural bias of 3C-based methods.
  • GAM-ch is unique compared with competing technologies, as it can detect the multiplicity of loci interacting simultaneously, where there are more than three loci interacting at once (such detection being impossible or inefficient by ligation-based 3C-based methods), and it can also detect all loci present in the compartment and their copy number, irrespectively of whether they are found to participate in an interaction, which allows important corrections to be made in the contact maps. It is also one of the advantages of the method of the invention that it can be used to identify spatial proximity of loci which were not known before the method was carried out, i.e., interactions can be identified between newly discovered or non-defined loci. The present invention also provides the use of the method of the invention for
  • the method of the invention may be used to determine specific interactions, and is capable of differentiating leading interactions from bystander interactions;
  • mapping loci and/or genome architecture in the compartment (b) mapping loci and/or genome architecture in the compartment.
  • a map in particular, a matrix, can be drawn up for specific loci or the chromosomal architecture based in the co- segregation frequencies determined;
  • Chromosomal insertion of a nucleic acid due to gene therapy or other genetic engineering approaches may affect genome architecture, e.g., it may enhance or prevent interaction of regulatory regions with specific promoters and thus affect transcription of "unrelated" genes.
  • the expression pattern of the introduced nucleic acid may itself depend on, or be disrupted by, its interactions with endogenous regulatory regions.
  • the method of the invention allows for assessment of the effects of gene therapy or genetic engineering on the level of interaction between different loci;
  • mapping chromosomal rearrangements e.g., in cancer, including in specific sub-tissue cell populations, e.g. to study clonal evolution of rearrangements;
  • identifying a species in a mixture of species e.g., identifying a potentially novel microorganism species in a mixture of species
  • the method of the invention may be used in identification of species in microbial communities, e.g. as described for Hi-C in Burton et al. (2014, G3 4, 1339-1346).
  • step (a)(ii) specifically mapping contacts mediated by a defined factor (or molecule of interest, e.g., protein, R A, DNA and/or their modifications), e.g., by extracting said factor and associated complexes of fragments after step (a)(ii) is carried out, e.g., by immunoprecipitation of the defined protein and associated complexes of fragments (step (a) (iii.O)).
  • a defined factor or molecule of interest, e.g., protein, R A, DNA and/or their modifications
  • Option (1) may be of specific interest, as it reduces the complexity of the sample.
  • the present invention thus also provides a method of diagnosing a disease associated with a disturbed co-segregation of loci in a patient, comprising, in a sample taken from said patient, analysing co-segregation of a plurality of loci in the patient, and comparing said co-segregation with co-segregation of said loci in a subject already diagnosed with said disease, wherein the co- segregation is preferably also compared with co-segregation in a healthy subject.
  • co-segregation of loci may be compared between specific sub-groups of cells, which may be derived from the same patient, e.g., tumor cells and normal tissue.
  • Co-segregation can also be analysed in different cell types upon derivation of pluripotent stem cells from the patient, or model organism, and their experimental differentiation into specific cell types through laboratory culture in appropriate conditions, e.g. in the presence of the appropriate factors, in the suited container, at the appropriate temperature, e.g. 37°C for human samples.
  • "a" is meant to refer to "at least one", if not specifically mentioned otherwise.
  • the present invention may be used to investigate a disturbed co-segregation of loci in a patient, i.e., chromatin misfolding, it may also inform or guide the treatment of patients having a disease associated with chromatin misfolding, as such patients may, after diagnosis with a method of the invention, be treated to correct chromatin misfolding (Deng W., Blobel G., 2014, Curr Op Genet Dev. 25: 1-7). The present invention may then be used to monitor the effects of such treatments on chromatin misfolding.
  • Fig. 1 Limitations of current 3C-based methods due to dependency on ligation of DNA ends for capturing contacts between nucleic acids.
  • 3C-based methods the presence of multiple loci in a single interaction may dilute the measured ligation frequency between any two loci that are member of the interaction.
  • GAM-ch the measured interaction is not affected by multiplicity.
  • Fig. 2 Outline of the GAM-ch method. Chromatin is prepared from mildly-fixed cells and randomly fragmented (1). Crosslinked chromatin is divided (ali quoted) across tubes to have ⁇ 1 haploid genome equivalent per tube (2). The DNA content of each tube is determined to assess the co-segregation of genomic sequences across tubes (3). Co-segregation of genomic sequences reflects chromatin contacts of genomic sequences in the cell nucleus dependent on protein- protein and protein-RNA bridged interactions and is used to measure long-range chromatin interactions.
  • A Schematic presentation of the mouse ⁇ -globin gene cluster (adapted from Tolhuis, B. et al. (2002). Looping and interaction between hypersensitive sites in the active beta-globin locus.Molecular Cell 10, 1453). Arrows and circles depict the individual hypersensitive sites.
  • the ⁇ -globin genes are indicated by triangles, with active genes (Pmaj and ⁇ ) in grey and inactive genes ( ⁇ and ⁇ ) in black.
  • the olfactory receptor (OR) genes are indicated by white boxes, of which some were shown to interact with the ⁇ -globin gene cluster. Grey boxes also indicate other gene loci (3' prime olfactory receptor genes, Uros and Eraf), which were shown to interact with the ⁇ -globin gene cluster in embryonic liver tissue.
  • LCR Locus Control Region.
  • B A hypothetical 3D model of the active chromatin hub (ACH) based on population-based 3C data from Tolhuis et al. (2002). Neither the size of the ACH nor the actual position of the elements relative to each other is to scale. Hypersensitive sites and active genes of the locus form a hub of hyper-accessible chromatin (ACH). The inactive regions of the locus, having a more compact chromatin structure, are indicated in grey, with the inactive ⁇ and ⁇ genes in lighter grey. The olfactory genes are not shown. The interactions in the ACH would be dynamic in nature, in particular with the active genes (Pmaj and ⁇ ), which are alternately transcribed.
  • Crosslinking frequency with value 1 arbitrarily corresponds to the crosslinking frequency between two neighbouring control fragments within the Calreticulin (CALR) gene locus, which is expressed at similar levels in the two tissues.
  • a schematic illustration of mouse ⁇ -globin gene cluster is depicted; the grey shading represents the position and size of fragments generated by Hindlll restriction.
  • the quality of the chromatin preparation produce was validated by 3C at four regions of the murine ⁇ -globin gene cluster in fetal liver and brain cells. Fetal liver and brain cells from El 4.5 mouse embryos were fixed (5 or 10 min) in 2% formaldehyde, digested with Hindlll and ligated under highly diluted conditions. Ligation products were quantified by qPCR using the 3'end hypersensitive site (3'HS1) as bait. Means and SEM are shown. The black vertical line indicates the position and size of the 3C-bait fragment containing 3'HS1.
  • Crosslinking frequency with a value of 1 arbitrarily corresponds to the crosslinking frequency between two neighbouring control fragments (with analyzed restriction sites being 8.3 kb) within the Ercc3 gene locus (on chromosome 18), which is expressed at similar levels in fetal liver and brain. Black bars indicate the position of primer pairs used for 3C.
  • PK proteinase K
  • GAM-ch samples marked with an asterisk were used for library preparation.
  • WGA-amplified DNA was fragmented to -400 bp using Covaris, and amplified using the Illumina library mate-pair kit DNA fragments were excised (350-650 bp for -0.2 and -10 genomes, 200-650 bp for -0.7 genomes), quantified and sequenced.
  • GAM-ch is also designated xGAM.
  • Fig. 6 Mapping of GAM-ch-seq datasets corresponding to -0.2 and -10 genomes in comparison with linear DNA.
  • Gaps are defined as regions which are not covered by reads.
  • the sequencing depth is calculated by dividing the genome into identical windows and counting the number of nucleotides covered by reads, which fall into each window.
  • Fig. 8 Gap-size (A) and sequencing depth (B) distributions for 10 ng of linear DNA and GAM-ch (xGAM) samples at -0.2 and -10 genomes.
  • X axes represent the gap-sizes and sequencing depth at 1 kb windows (bp) in log 10 scale.
  • Y axes represent Kernel probability densities.
  • Graphs are plotted using density function in R. Fig. 9. Thresholds from Gaussian fitting to GAM-ch fractions with ⁇ 0.2 genomes.
  • the threshold is defined as the number of reads for which the height of the Gaussian fit ( ⁇ in dotted thick line) equals the height of the entire sequencing depth distribution (Ay in thin grey line).
  • X-axes represent the sequencing depth at 1 kb windows in the loglO scale.
  • Y-axes represent the Kernel probability densities.
  • Fig. 10 Number of "positive windows” detected from random sampling the original datasets of -0.2 genomes (10 to 100%, 12 pM).
  • Erosion of reads from GAM-ch-0.2 genome dataset shows only a mild change of detected "positive windows" when randomly sampling -60% of reads. Information is markedly lost when ⁇ 30% of reads are considered.
  • the threshold used here for the detection of 4 kb windows is based on the residual analysis in Fig. 9.
  • Fig. 11 Outline of the GAM-ch method in combination with immunoprecipitation of chromatin bound by, e.g., RNA polymerase II.
  • Chromatin crosslinking and fragmentation e.g., chromatin is prepared from mildly fixed cells and randomly fragmented, e.g. by sonication (1).
  • fragment enrichment e.g., by immunoprecipitation of a specific chromatin-bound protein such as RNA polymerase II (2).
  • Division of the fragmented nucleic acids to obtain a collection of fractions (every fraction contains ⁇ 1 copy of every locus, typically ⁇ 0.5 copies).
  • crosslinked chromatin is either directly divided (aliquoted) across tubes to have ⁇ 1 haploid genome equivalent per tube (3 a), or (optionally) first enriched for chromatin occupied by a given protein (or other bound molecule of interest), e.g.
  • chromatin immunoprecipitation 3b. Extract and detect nucleic acids, e.g., the DNA content of each tube is extracted and identified to assess the co-segregation of genomic sequences across tubes (4). Co-segregation of genomic sequences reflects chromatin contacts of genomic sequences in the cell nucleus dependent on protein-protein and protein-RNA bridged interactions and is used to measure long-range chromatin interactions. Boxes: Enhancers, thick black line: active gene, medium thick line: inactive gene, arrows: promoters.
  • RNA polymerase II occupies active gene promoters, coding regions and enhancers.
  • RNAPII-S5p ChlP-seq signal at promoters also called transcription start sites (TSS)
  • TSS transcription start sites
  • TES transcription end/termination sites
  • Transcriptionally silent genes are not occupied by RNAPII-S5p.
  • the average occupancy profiles are represented at ⁇ 5kb windows centered at the transcription start site (TSS) or transcription end site (TES). All mouse genes were ranked by their expression levels determined by mRNA-seq in mouse ESCs (Brookes et al. 2012), then top 25% genes were selected as most actively transcribed genes and the bottom 25% genes were selected as most transcriptionally silent.
  • RNA polymerase II co-associates with enhancers.
  • RNAPII-S5p is present at enhancers defined in murine ESCs according to Whyte et al. (2013). Background levels of ChIP signal was determined by a control ChIP experiment using non-specific antibody against plant steroid digoxigenin.
  • RNAPII-S5p occupancy determined by ChIP combined with quantitative PCR at active, Polycomb-repressed and inactive genes. Quantitative PCR confirms the expected enrichment of RNAPII-S5p of active (Oct4) and Polycomb-repressed (Nkx2.2, HoxA7) genes, and its absence at inactive (Myf5) gene, as expected (Stock et al. 2007). Background levels (mean enrichment after ChIP with non-specific antibody against plant steroid digoxigenin) at promoter and coding regions are shown in black bars. Means and standard deviations from three biological replicates are shown.
  • ChlP-enriched positive windows for different starting amounts of chromatin immunoprecipitated DNA.
  • the percentage of positive windows for GAM-chIP dataset is higher for GAM-chIP samples with larger amounts of input DNA.
  • ChlP-enriched positive windows were determined by number of reads in each 5 kb window from published ChlP-seq RNAPII- S5p obtained in mESC (Brookes et al. 2012). The top 2% of 5 kb windows were taken as the genomic windows most enriched for RNAPII-S5p.
  • Fig. 14 GAM-chIP raw data and detection of positive genomic windows.
  • GAM-chIP profiles of raw sequencing data across two genes show that more positive windows are detected across an actively transcribed gene than an inactive gene. Represented tracks from top to bottom: 1 - RNAPII-S5p ChlP-seq in mESC; 2 - cumulative window detection frequency across 182 GAM-chIP datasets; 3-7 - raw sequencing data for five randomly chosen GAM-chIP datasets together with representation of positive windows defined by fitting binominal distributions (black horizontal bars) or by JAMM peak-finder approach (striped horizontal bars); 8 - raw sequencing data for a control sample containing no chromatin immunoprecipitated material (water control). Images were obtained from UCSC Genome Browser using mean as windowing function. Schematic representation of the genes present in the selected regions is shown underneath.
  • Fig. 15 Quality controls of GAM-chIP dataset.
  • Each GAM-chIP sample contains only a restricted subset of sequences from each chromosome. Each mouse chromosome was divided into 5 kb windows, and the percentage of positive 5kb windows was plotted for each chromosome and for each GAM-chIP sample. No GAM-chIP sample contains more than 12% of any given chromosome, and all chromosomes are comparable in coverage except for chromosome X, which is present in only a single copy (whereas autosomal chromosomes are present in two copies), as expected in the male ESC line used.
  • RNAPII-S5p occupancy in ChlP-seq datasets (from published data; Brookes et al. 2012).
  • the TSS-overlapping 5 kb windows with the least binding of RNAPII-S5p are detected in 4.4% of GAM-chIP samples on average, whereas those with the most abundant binding are detected in an average of 12.5% of GAM-chIP samples.
  • the percentage of 5 kb positive windows overlapping transcriptionally active genes and enhancers are higher than the percentage of 5 kb positive windows overlapping transcriptionally silent genes.
  • the percentage of positive windows is shown for gene body (gene), promoters (transcription start site, TSS) and transcription end site (TES).
  • the set of most actively transcribed and of most silent genes were chosen based on their expression levels, as determined by mR A-seq in a published dataset (Brookes et al. 2012). Positive 5 kb windows overlap gene promoters with high R APII-S5p levels (as determined from published ChlP-seq dataset; Brookes et al. 2012) more often than gene promoters with low R APII-S5p levels.
  • Fig. 16 Co-segregation of genomic windows within actively transcribed genes in GAM- chIP samples. GAM-chIP samples containing multiple positive windows from the same actively transcribed gene occur more frequently than GAM-chIP samples containing multiple positive windows from the same silent genes and more often than would be expected by chance, confirming that chromatin contacts can be formed within actively transcribed genes during transcription (as schematized in Fig. 11).
  • Fig. 17 Co-segregation of genie regions of actively transcribed genes coincides with preferential co-segregation of nearby enhancers in GAM-chIP samples.
  • active (but not for silent) genes the nearest enhancer was more frequently observed in the GAM-chIP samples with the highest number of positively detected intragenic windows.
  • co-segregation of nearby enhancers in the same GAM-chIP samples as actively transcribed genes is indicative of a chromatin interaction between the enhancer and gene during transcription.
  • HAPPY Mapping is based on the co-segregation and detection of nearby DNA markers in the genome and uses limiting dilutions of fragmented DNA to single molecule contents.
  • LOD logarithm of the odds
  • GAM-ch applies the basic principle of HAPPY Mapping to a different purpose: instead of measuring linear genomic distances, it measures long-range chromatin interactions between any genomic regions within the three-dimensional cell.
  • Cells are first treated with a crosslinking agent which, for example, chemically crosslinks proximal genomic regions in the same or differ- rent chromosomes, before chromatin fractionation.
  • a crosslinking agent which, for example, chemically crosslinks proximal genomic regions in the same or differ- rent chromosomes, before chromatin fractionation.
  • GAM-ch detects chromatin proximity but does not require ligation of crosslinked DNA fragments.
  • GAM-ch chromatin preparations similar to 3C are prepared and diluted as for HAPPY Mapping, before quantification of co-segregation frequency; genomic regions that are bridged by proteins and crosslinked during the chromatin preparation will co-segregate more frequently than genomic regions that do not interact (Fig. 2).
  • GAM-ch can provide single allele information about multiplicity of interactions, i.e. multiple genomic regions interacting at the same time with a given allele.
  • 3C a given DNA fragment in a high multiplicity chromatin interaction can only ligate with one or two (at high restriction and ligation efficiency) other DNA fragments.
  • This limitation of 3C makes it difficult to distinguish, for example, between a low-frequency chromatin interaction involving only two fragments and an interaction that involves many genomic partners at high frequency across the cell population (Fig. 1).
  • the same 3C signal e.g.
  • a measured contact of 50% can be due to an interaction that occurs for half the alleles in the cell population if the multiplicity of interaction is only two (or possibly three), or be due to an interaction that occurs in all alleles (real contact frequency is 100%) but is underestimated to only 50%> due to competition with other bound DNA fragments that co-bind at high multiplicity, thereby diluting the probability of ligation between any single fragment with all others.
  • each GAM fraction was subjected first to WGA fragmentation, primer ligation and PCR amplification. WGA-amplified GAM-ch samples were then further amplified using the Illumina library preparation, which adds new sets of primers at each end of the DNA fragments. GAM-ch- seq samples were sequenced using the Illumina sequencing platform (Table 1). As recent 3C- based genome-wide mapping approaches use Hindlll digestion, instead of sonication, this approach was also adopted here. Validation of Hindlll-digested chromatin preparations was performed by 3C analyses (Fig. 3D). Linear DNA was used in parallel to test the effects of WGA and high-throughput sequencing on sequence representation, and as a positive control.
  • GAM-ch samples were prepared for Illumina sequencing as described for 3C and validated by 3C-qPCR using published primer sequences (Fig. 3D). Nuclei from fetal liver cells, fixed for 5 min, were extracted, counted using a haemocytometer, subjected to digestion with Hindlll (digestion efficiency of -77%), and aliquots of -100 genomes ⁇ L were prepared and frozen for further use. Different genome numbers of 3C-like chromatin were first subjected to WGA fragmentation (1 h at 50°C with PK and 4 min at 99°C) and amplification (-0.2, -0.7 and -10 genomes/tube; Fig. 4B). Linear human DNA (2 ng; provided with the WGA kit) was used as a positive control for the WGA reaction.
  • Fragment sizes of crosslinked chromatin range from -0.3-2 kb, whereas linear DNA is less fragmented upon WGA, probably due to lower-sized DNA fragments present in Hindlll digested chromatin (average distance between Hindlll restriction sites is -4 kb in the mouse genome).
  • GAM-ch samples of -0.2 genomes did not show visible products on ethidium bromide gels after WGA amplification (Fig. 4B), but yielded visible products upon preparation of sequencing library (Fig. 6).
  • GAM-ch samples were subjected to Illumina library preparation and DNA fragments were size- selected (350-650 bp for -0.2 and -10 genomes, 200-650 bp for -0.7 genomes) and sequenced. Since the -0.7 genome GAM-ch sample showed less-intense WGA products, DNA fragments from a wider range size were excised and sequenced. Linear mouse DNA was also amplified by WGA and Illumina library kits (not shown) and sequenced in parallel.
  • Each unmappable read is trimmed at its 5 'end by 36 nts and mapped back to the genome. For the remaining reads that still do not align, then 36 nts are trimmed at the 3 'end of the read, and resulting 36 nt read realigned to the genome.
  • This trimming strategy increased the overall percentage of alignment to ⁇ 54 ⁇ 6% (Fig. 5B). This trimming pipeline is not necessary for libraries produced using Illumina Nextera library kits, as the library production relies on tagmentation.
  • GAM-ch libraries obtained from -0.2 genomes show a more clustered distribution of sequencing reads with higher enrichment, as expected due to lower genomic content. This is consistent with a lower diversity of DNA fragments in the -0.2 genome libraries. The higher enrichment suggests that the amount of sequence obtained may already be sufficient to over- represent this diversity.
  • the first step in the analysis of GAM-ch samples is to detect DNA fragments that are present or absent in each GAM-ch sample analysed with subgenomic content. This requires the definition of background read distribution, and a decision about an appropriate window size.
  • the window size should reflect the average size of the DNA fragments present in 3C- like chromatin. For Hindlll restriction, this corresponds to -4 kb fragments.
  • Two different statistical approaches were performed to analyse and to compare sequencing results from multiple libraries. First, the distribution of the gap-size between adjacent covered areas of the genome was analysed, and second the sequencing depth at different window sizes was studied (Fig. 7). Both approaches were used to analyse the sequencing results from linear DNA and GAM-ch samples (Fig. 8).
  • the content of GAM-ch samples with -10 genomes also show an even distribution across the genome meaning that the whole genome is covered, which suggests that DNA extraction from 3C-like chromatin is efficient.
  • the average gap size peaks at -1 kb (Fig. 8A) and displays a second population of gap-sizes of -100 bp. This may reflect the fact that not all genomic regions are represented in this low DNA content sample; it can be the result of interacting DNA sequences within short range distances (as seen in 4C results) being frequently brought together due to crosslinking; further sequencing experiments and analyses are currently ongoing to investigate the significance of the different gap distributions.
  • sequenced reads in the -10 genomes sample are sequenced multiple times, such that each 1 kb window is covered by more reads with the distribution of sequencing depths peaking at -500 nts per 1 kb window (Fig. 8B). Since sequences are a mix of 36 and 72 nt reads, which will appear as single spikes representing a multiple unit of 36 nts in the sequencing depth distribution, each average read would contain about 50 nts ((36+72)/2). Therefore, each 1 kb window of the -10 genomes GAM-ch sample would be covered by -10 reads. In addition many windows with ⁇ 10 reads exist, which are visualized by the left spiky tail in the sequencing depth curve and are hardly distinguishable from the main population of windows with 10 reads.
  • GAM-ch samples with -0.2 genomes contain only a fraction of the genome, as seen in wider gaps in the read distribution and a gap-size peaking at -50 kb, with additional shoulders reflecting non-random spacing between DNA fragments; this is consistent with the presence of chromatin interactions in these few GAM-ch samples.
  • the less diverse set of fragments that are sequenced in the -0.2 genomes sample are sequenced more frequently than fragments in GAM- ch- 10 genomes sample, resulting in about 5000 sequenced nucleotides in each 1 kb window corresponding to -100 reads per 1 kb window.
  • the GAM-ch sample with -10 genomes did not have enough sequencing depth to sufficiently resolve the signal from the noise distribution.
  • the threshold is 790 nts (-1 1 reads with 72 nts), in the same order of magnitude of the residual distribution approach.
  • GAM-chIP combining GAM-ch with immunoprecipitation to capture fragments co-occupied by RNA polymerase II phosphorylated on Serines.
  • the DNA fragments bound by a specific protein are selected from the bulk chromatin, e.g. by chromatin immunoprecipitation (ChIP), a strategy called GAM-chIP.
  • GAM-chIP is performed with an additional step in which crosslinked chromatin fragments, containing a given protein or protein post-translational modification, are first selected prior to their dilution between tubes, e.g. to enrich for fragments containing genes and regulatory regions (enhancers) (Fig. 11). Including this additional selection step has two advantages: first it allows for detection of chromatin contacts which are formed in the presence of the given protein or protein post-translational modification.
  • R APII-S5p DNA fragments bound by R A polymerase II phosphorylated on the Serine-5 residue of the CTD, which we abbreviate to R APII-S5p.
  • R APII-S5p was chosen because it has high occupancy at active genes, especially at promoters, throughout coding regions and transcription termination sites, and enhancers (Fig. 12A,B). Combining GAM-ch with ChIP for R APII-S5p therefore has the potential of increasing the power of GAM-ch to detect contacts between enhancers and their target genes.
  • chromatin was crosslinked using formaldehyde and fragmented by sonication, then chromatin fragments bound by R APII-S5p were selected by immunoprecipitation using a specific antibody coupled to beads (CTD-4H8, Covance; according to Brookes et al 2012).
  • CCD-4H8, Covance a specific antibody coupled to beads
  • fragments resulting from ChIP were eluted from beads, and fractionated/diluted into a multitude of fractions and WGA amplified.
  • RNAPII-S5p bound DNA fragments ChIP of RNAPII-S5p bound DNA fragments was performed as described previously (Stock et al. 2007; Brookes et al. 2012).
  • Mouse embryonic stem cells (ESCs) were fixed in 1% formaldehyde for 10 min. Nuclei were then extracted, counted using a haemocytometer, and chromatin was extracted using sonication. Sonicated chromatin fragments bound by RNAPII-S5p were selected by immunoprecipitation.
  • RNAPII-S5p was validated using quantitative PCR of DNA fragments known to be bound by RNAPII-S5p in mouse ESCs, namely promoters and coding regions of active and Polycomb-repressed genes (Fig. 12C); inactive gene promoter and coding region were used as negative control.
  • a control ChIP experiment was performed with nonspecific antibody against plant steroid digoxigenin, which showed no DNA fragment enrichment, as expected (Stock et al. 2007; Brookes et al. 2012). This analysis demonstrated that the antibody immunoprecipitation step had successfully and efficiently selected RNAPII-S5p-bound chromatin fragments (Fig. 12C).
  • the immunoprecipitated chromatin material was divided (aliquoted) into multiple tubes at the chosen dilution factor based on the measured DNA concentration.
  • GAM-chIP samples show a fragment size distribution of ⁇ 100bp to ⁇ 1200bp following WGA amplification (Fig. 13 A, slightly smaller than for GAM-ch samples prepared by Hindlll digestion without chromatin immunoprecipitation; Fig 4B).
  • the fragment size distributions and the amount of DNA after amplification were comparable between different samples prepared from the same concentration of input DNA (Fig. 13B).
  • GAM-chIP samples from the first two exploratory experiments were subjected to Illumina TruSeq Nano library preparation (Table 2).
  • GAM-chIP An exploratory GAM-chIP dataset was collected consisting of 182 GAM-chIP samples (Table 2. GAM-chIP Exp003), each generated from 1 pg of chromatin after ChIP for RNAPII-S5p, plus four positive controls containing 500 pg of the same chromatin, and four negative controls where no chromatin was added (water control).
  • GAM-chIP samples in this exploratory collection were WGA amplified and subjected to Illumina Nextera XT library preparation. DNA fragments from 300-500 bp were size-selected and sequenced.
  • the mouse genome was divided into 5 kb windows and the number of sequencing reads mapping to each window was calculated.
  • a two- curve fitting strategy was applied to distinguish signal from noise in GAM-chIP datasets.
  • the distribution of sequencing depth over 5 kb windows was fit with a negative binomial distribution (representing sequencing noise) and a lognormal distribution (representing true signal).
  • a threshold number of reads x was determined, where the probability of observing more than x "noise" reads mapping to a single genomic window was less than 0.001. Such a threshold was thus independently determined for each sample, and windows were scored as positive if the number of sequenced reads was greater than the determined threshold.
  • GAM-ch and GAM-chIP experiments have the greatest statistical power when the chance of a given tube containing a given locus of interest is ⁇ 0.5.
  • the loci of interest are those which are bound by the protein targeted for enrichment, which can be identified by sequencing the bulk immunoprecipitated chromatin (ChlP-seq) without dilution and WGA amplification.
  • RNAPII-S5p As an estimation of the complexity of the datasets produced in the second experiment (Exp002), we determined the number of sequencing reads mapping to each 5 kb window by ChlP-seq of RNAPII-S5p using a published ChlP-seq dataset obtained in mouse ESCs (Brookes et al. 2012). The top 2% of 5 kb windows were taken as the genomic windows "most enriched for RNAPII- S5p". The percentage of "RNAPII-S5p most enriched windows" identified as positive in each GAM-chIP sample was determined (Fig. 13C). The percentage of most enriched windows identified as positive in each GAM-chIP dataset was highest for GAM-chIP samples with larger amount of input DNA, but was 2-16%, i.e.
  • the exploratory GAM-chIP R APII-S5p dataset consisted of 182 samples containing lpg of ChIP DNA, four samples with 500 pg DNA (positive controls) and four samples without DNA (negative controls). Positive windows were identified for each of these 190 samples as outlined above for the other GAM-chIP datasets. Positive windows were examined in the UCSC Genome Browser and compared to the raw sequencing data, confirming that the window-calling approach was performing sensibly (Fig. 14). We confirmed that each GAM-chIP sample contained only a subset of 5 kb windows, whilst very few positive windows were identified for the negative control samples, in support of the feasibility of the approach.
  • the 182 GAM-chIP samples were collected in two batches, each of which was further divided into four pools for independent sequencing to achieve sufficient sequencing depth.
  • the first four batches were WGA amplified immediately after ChIP, the second four batches were WGA amplified from the same ChIP material following storage at -20°C after the aliquoting step but before WGA amplification.
  • This collection of GAM-chIP samples gave a total of eight pools, each containing around 24 GAM-chIP samples.
  • quality control of purity of the amplified material from very small amounts of mouse DNA fragments, i.e. lpg
  • the percentage of sequencing reads from each library that could be successfully mapped back to the mouse genome was plotted by library pool number (Fig. 15 A).
  • the negative control samples yielded very low percentages of mapped reads to the mouse genome, indicating that they were not contaminated by mouse DNA (e.g. from the GAM- chIP samples processed in parallel) during the WGA amplification or library preparation steps.
  • Positive control samples (each with 500 pg of DNA) yielded the highest percentage of mapped reads (85% on average), whilst 178 out of 182 GAM-chIP libraries showed robust read mapping rates to the mouse genome of >70%.
  • the distribution of the percentage of mapped reads was highly reproducible between samples and between sequencing pools. In particular, pools 5 to 8 did not yield a smaller percentage of mapped reads than pools 1 to 4, indicating that they were not affected by the addition of the freezing step (Fig. 15 A).
  • each GAM- chIP sample contains only a restricted subset of sequences from each chromosome (Fig. 15B).
  • No GAM-chIP sample contains more than 12% of any given chromosome, and all chromosomes are comparable in coverage except for chromosome X, which is present in only a single copy (whereas autosomal chromosomes are present in two copies), as expected in the male ESC line used.
  • RNAPII-S5p antibodies shows abundant detection of DNA fragments co- occupied by RNA polymerase II phosphorylated on Serine-5
  • RNAPII-S5p is most abundant at actively transcribed genes, and in particular at their promoters (Fig. 12A). To confirm that the promoters of genes more highly bound by RNAPII-S5p are also more frequently detected in GAM-chIP samples, 5kb windows overlapping gene promoters were identified and sorted into five equal groups (quantiles) according to the occupancy of RNAPII- S5p (as determined by ChlP-seq, published dataset from Brookes et al. 2012; Fig. 15C). As expected, the detection frequency of 5 kb windows that overlap gene promoters (also called transcription start sites or TSSes) increases with increased chromatin occupancy of RNAPII-S5p.
  • TSSes transcription start sites
  • RNAPII-S5p The TSS-over lapping 5 kb windows with the lowest binding of RNAPII-S5p are detected in 4.4% of GAM-chIP samples on average, whereas those windows with the highest binding are detected in an average of 12.5% of GAM-chIP samples (Fig. 15C).
  • Future experiments will include the use of larger DNA fragment amounts per sample, to reach detection of genomic windows most abundantly occupied by RNAPII-S5p closer to the optimal 0.5 frequency of detection of each fragment, which will provide optimal chromatin contact information from the least number of samples (as expected from linear HAPPY Mapping).
  • GAM-chIP One possible use for GAM-chIP is to identify enhancers regulating the expression of given genes.
  • RNAPII-S5p is expected to be found at transcriptionally expressed genes and enhancers but not transcriptionally silent genes (Fig. 12A,B), and was therefore chosen as a suitable target for the exploratory GAM-chIP experiment in order to increase the potential to identify interactions within and between enhancers and active genes.
  • the use of different proteins for immunoprecipitation may yield optimal co-segregation of promoters and their target enhancers.
  • mice genes were ranked according to their expression level, as determined by mPvNA-seq. The top 25% of genes were selected as most actively transcribed genes, whilst the bottom 25% of genes was selected as transcriptionally silent genes. 5 kb windows were identified that overlapped the gene body, transcription start site (TSS) or transcription end site (TES) of genes in the top or bottom 25% by expression.
  • TSS transcription start site
  • TES transcription end site
  • the percentage of 5 kb windows overlapping each feature that were identified as positive was plotted for each of the 182 GAM-chIP samples and compared to the percentage of all 5 kb windows or of 5 kb windows overlapping enhancers detected as positive in each sample (Fig. 15D).
  • 5 kb windows overlapping the gene body, TSS or TES of a silent gene were detected slightly less frequently than the average for all 5 kb windows.
  • chromatin contacts can form within the bodies of actively transcribed genes (Larkin, Cook & Papantonis, 2012). This means that distant regions within the same gene should be crosslinked both to each other and to R APII-S5p.
  • GAM-chIP identifies the presence or absence of genomic loci across a collection of tubes. If actively transcribed genes interact with themselves during transcription, some tubes will contain many chromatin fragments derived from the same gene, which were crosslinked to each other during the fixation step. Alternatively, if actively transcribed genes do not interact with themselves, a smaller number of tubes will contain multiple windows from the same gene by chance alone.
  • GAM-chIP detects co-association of actively transcribed genes with nearest candidate enhancer regions
  • Genomic windows overlapping enhancers should therefore co-segregate in the same GAM-chIP samples as the genomic windows overlapping their target genes. Furthermore, since different parts of each gene also contact themselves during transcription, GAM-chIP samples containing multiple positive windows from the same gene are the most likely to have originated from the gene during its transcription cycle and therefore likely to additionally co-segregate with the enhancer.
  • GAM-chIP samples For each gene, we ordered the GAM-chIP samples according to the proportion of intragenic windows detected. GAM-chIP samples which contain many positive windows from the same active gene often also contain a nearby enhancer, whereas GAM-chIP samples containing few positive windows from the same gene are often less likely to additionally contain the enhancer (Fig. 17A). In contrast, this behaviour is not expected for silent genes, since these genes are not expected to contact nearby regions classified as enhancers in mouse ESCs. For silent genes, the detection of a nearby enhancer is often uncorrected to the detection of the gene itself (Fig. 17B). With a larger collection of GAM-chIP samples each produced from fragment frequencies closer to 0.5, it should be possible to assign enhancers to their target genes based on the correlation of detection of the enhancer with detection of the gene across the collection of samples.
  • GAM-ch samples with -0.2 and 10 genomes were subjected to WGA and detected by next- generation sequencing.
  • the sequencing profile of the GAM-ch-0.2 sample has distinct islands across the genome whereas linear DNA at high concentration is evenly distributed (Fig. 6).
  • the sequencing profile of -0.2 genomes suggests that only a sub-fraction of the genome is captured, which is then frequently sequenced, as expected (Fig. 8B).
  • the threshold of signal detection of positive windows above background was 13 reads (-940 nts) for 4 kb windows, resulting in 45xl0 3 -50xl0 3 windows of 4 kb passing the threshold (Fig. 10).
  • 45xl0 3 -50xl0 3 windows of 4 kb correspond to a total of 1.8xl0 8 -2xl0 8 nts (out of 2.6xl0 9 bp in the total mouse genome including repetitive sequences). If -0.2 genomes are dispensed across tubes, each molecule has a probability of 0.18 to be present in each tube assuming a Poisson distribution, which would correspond to ⁇ 4.7xl0 8 bp.
  • Identifying contacts between active genes and their regulatory regions is a major current challenge, especially as there is evidence for complex interactions between clustered enhancers and their target genes (Fig. 11 A).
  • 3C-based technologies underestimate contacting partners of most complex interactions (i.e. interactions involving three or more fragments; O'Sullivan et al. 2013; Fig. 1).
  • FISH in interphase nuclei is limited by sensitivity of detection which requires that probes cover several kilobase pairs of genomic sequence, and by spatial resolution, which is limited to detect interactions between genomic sequences separated by several tens of kilobase pairs.
  • Novel ligation-free technologies should help detect enhancers that participate in the most complex interactions (Fig. 11B).
  • GAM-chIP after R APII-S5p ChIP can be performed reliably for different amounts of DNA, especially for 1 pg of DNA yielding GAM-chIP libraries with low complexity (2-10% of detection of 5 kb genomic windows; Fig. 13, 14, 15).
  • the GAM-chIP libraries produced were enriched for genomic windows containing active genes, including windows covering the gene promoters (TSS) and the gene termination sites (TES) (Fig. 15C,D). 5kb genomic windows containing candidate enhancers were also more likely to be detected in the pool of positive windows in each GAM-chIP dataset (Fig. 15D), consistent with the presence of RNAPII-S5p at these regulatory regions.
  • Murine fetal liver and fetal brain were dissected from El 4.5 wildtype mouse embryos as described previously (Hagege et al. 2007) and processed in parallel for 3C and GAM-ch. The quality of the resulting 'chromatin' preparation was determined using a chromosome conformation capture (3C)-qPCR assay, performed as previously described (Hagege et al. 2007), on the mouse ⁇ -globin gene cluster as a reference locus.
  • 3C chromosome conformation capture
  • Mouse fetal liver and brain tissue from 14.5 dpc embryos were dissected and processed into a single cell suspension as previously described (Hagege et al. 2007), resulting in a single-cell sample containing approximately 2xl0 7 cells/mL in 10% (v/v) heat inactivated fetal calf serum in PBS.
  • Cells were fixed by addition of 2% formaldehyde/ 10%> FCS/PBS and incubated for 5 or 10 min at room temperature. The crosslinking reaction was then quenched by addition of 1 M glycine solution to give 0.14 M final concentration.
  • restriction enzyme buffer 500 ⁇ ; NEB2 buffer
  • 20%> (w/v) SDS solution 7.5 ⁇ was added to a final concentration of 0.3%, and incubated (1 h shaking at 900 rpm) to increase chromatin accessibility for restriction enzyme digestion.
  • 50 ⁇ , of 20%> Triton X-100 solution were added (2% final concentration) and incubated at 37°C (1 h shaking) to sequester SDS.
  • Hindlll 400 units; BioLab
  • digestion was performed overnight (37°C, shaking) followed by addition of 40 ⁇ ⁇ 20% SDS solution (1.6% final concentration) and incubation at 65°C (20 min) to inactivate Hindlll.
  • Aliquots of undigested and digested chromatin were taken for subsequent analysis of digestion efficiency.
  • the digested nuclei were transferred to a 50 mL Falcon tube and diluted in 6.125 mL of ligation buffer (66 mMTris-HCl, pH 7.5; 5 mM DTT; 5 mM MgCl 2 ; 1 mM ATP). After addition of 375 ⁇ . of 20% (v/v) Triton X-100 solution (1% final concentration), nuclei were incubated (1 h shaking at 37°C). T4 DNA ligase (Promega) was added (100 Units) and ligation was performed at 16°C for 4 h.
  • Reversal of crosslinks was performed by addition of 30 ⁇ of 10 mg/mL proteinase K (300 ⁇ g total; Sigma) and incubation at 65°C overnight followed by RNase incubation (300 ⁇ g total; Roche) at 37°C (1 h), and by phenol-chloroform extraction and ethanol precipitation (Sigma).
  • the 3C material was desalted using Micro Bio-Spin P-30 chromatography columns (BioRad) before qPCR. Each qPCR reaction was performed with -120 ng of 3C material.
  • Quantitative real-time PCR (MJ MiniOpticon, BioRad) was performed with Platinum Taq DNA Polymerase (Invitrogen) and double-dye oligonucleotides (5'FAM + 3'TAMRA) as TaqMan probes, using the following concentrations: 0.1 ⁇ LTaq-polymerase from kit; 2.5 ⁇ , lOxTaq-buffer from kit; 0.75 ⁇ MgCl 2 (final 1.5 mM) from kit; 0.5 ⁇ (final 200 ⁇ ); 0.25 ⁇ , of each primer (from stock solution of 0.29 ⁇ g/ ⁇ L); 0.025 ⁇ LTaq-probe (final 2.5 pmol); 1-2 ⁇ , DNA template and adjusting to 25 ⁇ , with H 2 0.
  • a real-time qPCR (95°C for 10 min, 40 cycles with 95°Cfor 30 seconds, 58°C or 15 seconds and 72°C for 15 seconds) with Syb R Green as performed with the undigested (UND) and digested (D) samples using 2xPCR mix (Promega) on the MJ MiniOpticon PCR engine (BioRad).
  • primer sets that amplify across each restriction site of interest (R) were used.
  • internal primers (C) not containing a restriction site were used.
  • Preparation of crosslinked nuclei from mouse fetal liver cells for GAM-ch is similar as for 3C. Briefly, fetal liver cells were resuspended in 2% formaldehyde/ 10% FCS/PBS and the reaction was quenched with glycine after 10 min. Fixed cells were lysed in cold lysis buffer, and nuclei were spun as for 3C (as described above).
  • sonication buffer 50 mM HEPES pH 7.9, 140 mM NaCl, 1 mM EDTA, 1% Triton X-100, 0.1% Na- deoxycholate, 0.1%) SDS
  • sonication buffer 50 mM HEPES pH 7.9, 140 mM NaCl, 1 mM EDTA, 1% Triton X-100, 0.1% Na- deoxycholate, 0.1%) SDS
  • Nuclei were sonicated in 2.5 mL aliquots using a Bioruptor (Diagenode) for 30 min at 30 s on/off intervals at medium energy.
  • mouse fetal liver cells were embedded into DNA agarose strings at a density of ⁇ lxl0 7 cells/mL ( ⁇ 2xl0 5 genomes/cm; prepared according to Dear D.H. et al. 1998. A high-resolution metric HAPPY map of human chromosome 14. Genomics 48:232). Agarose strings of distinct length were melted in 0.5x PCR buffer II (68°C, 10 min) and DNA was diluted in molecular biology-grade H 2 0 (Sigma) into aliquots of -100 genomes ⁇ L and stored at -20°C.
  • ESCs Mouse embryonic stem cells (ESCs; 46C cell line, male) were grown in ESGRO medium (Merck, SF001-500P) supplemented by 1000 units/ml LIF (Merck), and chromatin prepared as previously described (Stock et al, 2007). Briefly, cells were treated with 1% formaldehyde (37°C, 10 min) and the reaction stopped with addition of glycine to a final concentration of 0.125 M. Cells were washed in ice-cold PBS, before "swelling" buffer (25 mM HEPES pH 7.9, 1.5 mM MgC12, 10 mM KC1 and 0.1% NP-40) was added to lyse the cells (10 min, 4°C).
  • ESGRO medium Merck, SF001-500P
  • LIF Merck
  • chromatin prepared as previously described (Stock et al, 2007). Briefly, cells were treated with 1% formaldehyde (37°C, 10 min) and the reaction stopped
  • Protein-G-magnetic beads were first incubated with rabbit anti-mouse (IgG+IgM) bridging antibodies (Jackson Immunoresearch; 10 ⁇ g per 50 ⁇ beads) for 1 h at 4°C and washed with sonication buffer. Seven hundred ⁇ g of chromatin was immunoprecipitated (4°C, overnight) with 10 ⁇ g of RNAPII-S5p antibody (clone CTD-4H8, Covance) and 50 ⁇ magnetic beads beads. ChIP washes and elutions after immunoprecipitation were performed as described previously (Stock et al, 2007).
  • crosslinked DNA-protein complexes were eluted twice from beads (65°C, 5 min; and room temperature, 15 min) with 50 mM Tris-HCl pH 8.0, 1 mM EDTA and 1% SDS.
  • Half of the eluted immunoprecipitated chromatin was diluted into multiple tubes (based on the measured DNA concentration in the other half of eluted chromatin).
  • To measure DNA concentration half of the eluted chromatin was incubated overnight at 65 °C with addition of NaCl (160 mM final concentration) and RNase A (20 ⁇ g/ml; Sigma) to reverse cross- linking.
  • Oct4 promoter F GGCTCTCC AGAGGATGGCTGAG (SEQ ID NO : 1 )
  • Oct4 promoter R TCGGATGCCCCATCGCA (SEQ ID NO: 2)
  • Nkx2.2 promoter F CAGGTTCGTGAGTGGAGCCC (SEQ ID NO: 5)
  • Nkx2.2 promoter R GCGCGGCCTC AGTTTGTAAC (SEQ ID NO : 6)
  • HoxA7 promoter R CCGACAACCTCATACCTATTCCTG (SEQ ID NO: 10)
  • Illumina libraries were prepared for HT sequencing from WGA-amplified GAM-ch DNA.
  • WGA-amplified GAM-ch samples were fragmented using a Covaris shearing system before library preparation.
  • Illumina libraries were size selected on agarose gels, enabling visualisation of the amplified DNA fragments, and therefore more careful extraction of appropriate sized fragments.
  • QIAgen Gel Extraction kit libraries were quantified by QuBit (Invitrogen) and qPCR, and library size was analysed by Bioanalyser (Agilent). Fragment sizes were within the expected size distribution of 210-600 bp (including adapters) for all libraries.
  • RNAPII-S5p Chromatin precipitated with antibodies against RNAPII-S5p was quantified fluorimetrically with PicoGreen (Molecular Probes, Invitrogen) and diluted into multiple tubes (see Table 2 for amounts).
  • DNA was extracts by WGA, first by incubation in WGA fragmentation buffer containing PK for 2 h (Exp.001 and Exp.002) or 8 h (Exp.003); subsequent steps were carried out according to the manufacturer's specifications.
  • Amplified DNA was purified with MinElute 96 UF PCR Purification Kit (Qiagen) according to manufacturer's instructions.
  • DNA fragments from 300-500 bp were size-selected with Agencourt AMPure XP (Beckman Coulter) and the final DNA concentration was determined by PicoGreen fluorimetry (Molecular Probes, Invitrogen) and subjected to Illumina TruSeq Nano library preparation (GAM-chIP ExpOOl, GAM-chIP Exp002; Table 2) or to Illumina Nextera XT library preparation (GAM-chIPExp003 ; Table 2).
  • GAM-ch libraries (4-12 pM) were loaded onto the Genome Analyser flow cell.
  • the single- stranded DNA fragments bind randomly across the surface of the flow cell due to hybridisation between the adaptor sequences added to DNA ends during library preparation, and the oligonucleotides that coat the flow cell.
  • Polymerase-based extension converts each fragment to a cluster of approximately 1000 identical fragments.
  • the amount and size of DNA fragments loaded on to the flow cell was optimised to obtain the highest number of non-overlapping clusters following cluster generation.
  • Clusters were then sequenced by synthesis, using adaptor- specific primers and incorporation of fluorescent nucleotides. Digital images were taken at each round of nucleotide incorporation and the unique fluorescent signal assigned to each nucleotide enables its correct identification. Sequential images of a given cluster therefore represent the fragment sequence.
  • GAM-chIP libraries sequenced on the HiSeq or MiSeq were not imaged for the first thirty sequencing cycles (known as dark cycles) in order to avoid issues relating to low sequence diversity in the WGA adaptor. This step avoids the need for trimming reads after sequencing used in earlier GAM-ch datasets (Fig. 5A).
  • DNA reads were firstly aligned to the reference mouse genome (assembly mm9) using Illumina Extended software (pipeline 1.6) allowing only for two mismatches at most and unique matches only. Un-aligned reads were then trimmed at their 5 ' or 3 ' end and aligned to the mm9 genome using Bowtie software, version 0.9.8.1 (Langmead B. et al. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10, R25).
  • DNA reads were first aligned to the reference mouse genome (assembly mm 10) using Bowtie2 and enforcing a minimum mapping quality of 20. Read depth of coverage was calculated using bedtoolsmultibamcov (Quinlan & Hall 2010, BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26:6). Curve fitting was performed in python using the fmin function from scipy. A combination of two distributions was fitted to the histogram of the number of reads per window.
  • a negative binomial distribution represents sequencing noise, and the parameters of the fit for this distribution were used to determine a threshold number of reads X where the probability of observing more than X reads mapping to a single genomic window by chance was less than 0.001. Such a threshold was thus independently determined for each sample, and windows were scored as positive if the number of sequenced reads was greater than the determined threshold.
  • a lognormal distribution representing true signal
  • positive windows were also called using JAMM (Ibrahim et al, 2015) in the peak mode with default settings.
  • ChlP-seq libraries for R APII-S5p and control (using non-specific antibody against plant steroid digoxigenin) were prepared from 10 ng of immunoprecipitated DNA (as measured by Picogreen quantification) with corresponding antibodies using the Next ChlP-Seq library Prep Master Mix Set from Illumina (NEB, # E6240) following the NEB protocol, with some modifications.
  • the intermediate products from the different steps of the NEB protocol were purified using MiniElute PCR purification kit (Qiagen, # 28004).
  • Adaptors, PCR amplification primers and indexing primers were from the Multiplexing Sample Preparation Oligonucleotide Kit (Illumina, # PE-400-1001).
  • Samples were PCR amplified prior to size selection of DNA fragments (250- 600bp) on an agarose gel. After purification by QIAquick Gel Extraction kit (Qiagen, # 28704), libraries were quantified by qPCR using Kapa Library Quantification Universal Kit (KapaBio systems, #KK4824). Library size distribution was assessed by 2100 Bioanalyzer (Agilent) with High Sensitivity DNA analysis Kit (Agilent, #5067-4626) before high-throughput sequencing. Libraries were quantified by Qubit and sequenced on Illumina HiSeq2000 (single- end sequencing, 51 nucleotides), according to the manufacturer's instructions.
  • Sequenced reads were aligned to the mouse genome (assembly mmlO, December 2011) using Bowtie2 version 2.0.5 (Langmead and Salzberg, 2012), with default parameters. Duplicated reads (i.e. identical reads, aligned to the same genomic location) occurring more often than a threshold were removed. The threshold is computed for each dataset as the 95th percentile of the frequency distribution of reads.
  • RNAPII-S5p and control ChIP enrichment at enhancers the list of enhancers from Whyte et al. 2013 was used.
  • TPM Transcripts per Million
  • Genes in the top 25% by expression were classified as active, whilst genes in the bottom 25% by expression were classified as silent.
  • paired-end (2xl00bp) reads from mRNA-seq were aligned against the mouse genome using STAR (Spliced Transcripts Alignment to a Reference, v2.4.2a, (Dobin et al, 2013) and expression levels were estimated in TPM with RSEM (RNA-Seq by Expectation-Maximization, vl .2.25 (Li and Dewey, 2011).
  • the reference for STAR and RSEM was produced from the Mouse Genome version mmlO, providing the gtf annotation from UCSC Known Genes (mmlO, version 6) and associated isoform-gene relationship information from the Known Isoforms table. Both tables were downloaded from the UCSC Table browser (http://genome.ucsc.edu/cgi-bin/hgTables).
  • the detection frequency of each window overlapped by the gene, ⁇ one window upstream/downstream, was calculated as the number of GAM-chIP samples in which the window was detected divided by the total number of GAM-chIP samples. Since each window is detected with a different frequency, each window can be described by its own binomial distribution.
  • the expected distribution of the number of positive windows from the same gene detected simultaneously in a single GAM-chIP sample was calculated as the convolution of the binomial distributions for each component window.
  • the average expected number of positive windows per GAM-chIP sample was calculated as the sum of the window detection frequencies. For each gene, the number of tubes with more than double this average was counted and compared to the expected number of tubes with more than double the average. The distribution of observed vs. expected values was plotted and compared between active genes and silent genes.
  • Enhancer loops appear stable during development and are associated with paused polymerase. Nature. 2014 Aug 7;512(7512):96-100.
  • Chromatin Interaction Analysis with Paired-End Tag (ChlA-PET) sequencing technology and application. BMC Genomics. 2014;15 Suppl 12:S11.
  • CTCF mediates long-range chromatin looping and local histone modification in the beta- globin locus.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Microbiology (AREA)
  • Molecular Biology (AREA)
  • Physics & Mathematics (AREA)
  • Immunology (AREA)
  • Biotechnology (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Pathology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention relates to the field of analysis of the three-dimensional structure of the genome, i.e., for genome architecture mapping on chromatin (GAM-ch). The invention provides a method of determining interaction of a plurality of nucleic acid loci in a compartment comprising nucleic acids, such as the cell nucleus, comprising separating nucleic acids from each other depending on their interaction in the compartment by crosslinking nucleic acids with each other directly or indirectly, fragmenting the nucleic acids of the compartment to obtain fragments and/or cross-linked complexes of fragments, and dividing the fragmented nucleic acids to obtain a collection of fractions such that every fraction contains, on average, less than one copy of every locus; determining the presence or absence of the plurality of loci in said fractions; and determining the co-segregation of said plurality of loci in the fractions. Co-segregation may then be analysed with statistical methods to determine interactions. The method can be used e.g., for identifying the frequency of interactions across a cell population between a plurality of loci; and mapping loci and/or genome architecture, e.g., in the nucleus, an organelle, a microorganism or a virus; identification of regulatory regions directing expression of a specific gene through spatial contacts; identifying the spatial contacts between loci that depend on their co-association with specific protein(s) or RNA and/or diagnosing a disease associated with a disturbed co- segregation of loci. Chromatin immunoprecipitation (ChIP) can be combined with the method of the invention.

Description

Genome Architecture Mapping on Chromatin
The present invention relates to the field of analysis of the three-dimensional structure of the genome, i.e., for genome architecture mapping on chromatin (GAM-ch). The invention provides a method of determining interaction of a plurality of nucleic acid loci in a compartment comprising nucleic acids, such as the cell nucleus, comprising separating nucleic acids from each other depending on their interaction in the compartment by crosslinking nucleic acids with each other directly or indirectly, fragmenting the nucleic acids of the compartment to obtain fragments and/or cross-linked complexes of fragments, and dividing the fragmented nucleic acids to obtain a collection of fractions such that every fraction contains, on average, less than one copy of every locus; determining the presence or absence of the plurality of loci in said fractions; and determining the co-segregation of said plurality of loci in the fractions. Co-segregation may then be analysed with statistical methods to determine interactions. The method can be used e.g., for identifying the frequency of interactions across a cell population between a plurality of loci; and mapping loci and/or genome architecture, e.g., in the nucleus, an organelle, a microorganism or a virus; identification of regulatory regions (enhancers) directing expression of a specific gene through spatial contacts; identifying the spatial contacts between loci that depend on their co- association with specific protein(s), or R A, and/or diagnosing a disease associated with a disturbed co-segregation of loci. Chromatin immunoprecipitation (ChIP) can be combined with the method of the invention.
Several approaches have been taken to analyse the structure of the genome and chromatin interactions. Linear genomic distances are often analysed by sequencing, for example shotgun sequencing. Problems with localizing sequences in the genome, in particular, in the case of repetitive sequences, can be approached, e.g., by HAPPY mapping (Dear P.H., Cook P.R. (1989) Happy mapping: a proposal for linkage mapping the human genome. Nucleic Acids Res. 17, 6795), which measures linear genomic distances between loci based on the frequency of locus co-segregation after random fragmentation and dilution of genomic DNA.
Information about the three-dimensional structure of chromatin is also of high interest, in particular, to discover contacts between regulatory regions (e.g. enhancers) and gene promoters which may be disrupted in disease due to genetic mutations in the non-coding part of the genome (e.g. Uslu V.V. et al. 2014 Long-range enhancers regulating Myc expression are required for normal facial morphogenesis. Nature Genetics 46: 753). One example of chromosomal interactions influencing gene expression is a chromosomal region which can fold in order to bring an enhancer and associated transcription factors within close proximity of a gene. Studying the structural properties and spatial organization of chromosomes is important for the understanding and evaluation of the regulation of gene expression, DNA replication and repair, and recombination. The folding of chromosomes and their contacts has important implications for disease mechanisms and elucidation of targets for therapeutic approaches, e.g., in cancer or congenital diseases.
Chromatin exists in interacting and non-interacting states. Interacting states have different properties depending on the characteristics of the genomic sites, or binding sites, involved in the interactions, namely (a) their number, distance and distribution, (b) their specificity and affinity for binders, and (c) the concentration and specificity of binders. Chromatin interactions can also involve different numbers of loci associating simultaneously (multiplicity of interaction).
Fluorescence in situ hybridization (FISH) uses microscopy to directly measure spatial distances between genomic loci, but it can only be applied to the study of a small number of genomic regions at a time in the same nucleus (e.g., Pombo A. 2003. Cellular genomics: which genes are transcribed when and where? Trends Biochem. Sci. 28, 6). It is theoretically possible to re-probe the same cells or tissue sections with different sets of probes, but there are concerns that repeated re-probing causes structural artefacts, e.g. due to DNA denaturation necessary to dissociate subsequent sets of probes, that e.g. induce artificial aggregation (contacts) of loci (i.e. repositioning of subgenomic regions relative to each other and relative to nuclear landmarks, e.g. the nuclear lamina). In the case of metaphase chromosomes, which represent more condensed (and expectedly more stable) chromatin, the re-probing can be repeated for a maximum of six times (Pauciullo A et al. 2014, Development of a sequential multicolor-FISH approach with 13 chromosome-specific painting probes for the rapid identification of river buffalo (Bubalus bubalis, 2n = 50) chromosomes. J Appl Genet. 55(3): 397-401), but concerns about degradation of chromosome morphology can already be apparent after the second probing, which can lead to loss of chromosomes or nuclei (Heslop-Harrison JS, Harrison GE, Leitch IJ (1992) Reprobing of DNA: DNA in situ hybridization preparations. Trends Genet 8: 372-373). RNA-FISH is a milder FISH approach that does not involve DNA denaturation but that can only be used to determine the nuclear position of actively transcribed genes (not silent genes). Samples from cells in the interphase stage of the cell cycle, where functional chromatin contacts are most often mapped, can be re-probed for R A-FISH only about three times, although the preservation of structure has not been measured in detail. The number of probes which can be simultaneously applied in either DNA- or RNA-FISH is limited by distinguishable fluorescent markers, e.g. 181 barcodes can in principle be obtained by combining five colours, four colour ratios and two different levels of intensity (Pombo A. 2003. Cellular genomics: which genes are transcribed when and where? Trends Biochem. Sci. 28, 6). However, this approach (multiplexing fluorochromes) fails when the loci analysed are so close in space that the combination of fluorochromes in one probe is not distinguishable from the combination in another, and is therefore not amenable to the identification of loci that are spatially proximal at very short distances. Furthermore, due to the need for a labelled probe for each specific locus, FISH can only be applied to analyse interactions of known loci of interest, and not to discover e.g. the presence of an exogenous DNA sequence in an interaction with the host's DNA. The approach fails e.g. in the detection of endogenous or exogenous DNA sequences, unless they are known a priori, e.g. viral subtype integration positions and the exact sequences of exogenous DNA. FISH is also confounded by a priori assumptions of linear genome organisation, which are not acceptable to study chromatin positioning features, e.g. chromatin contacts, when e.g. the influence of natural variation in genomic sequence in organism populations is of interest, e.g. in studying human samples, due to the fact that FISH does not inherently detect sequence variations such as copy number variations, or genomic rearrangements, without a priori probe design or a priori whole genome sequencing of the sample followed by probe design.
In a different approach, designated INGRID (IN-Gel Replication of Interacting DNA segments; Gavrilov, A.A. et al., 2014 Quantitative analysis of genomic element interactions by molecular colony technique. Nucl. Acids Res. 42 (5):e36), cross-linked chromatin fragments are spread out over large area of a polyacrylamide gel layer, followed by visualization of separate and associated elements in the form of, respectively, mono- and multicomponent molecular colonies generated during in-gel amplification of selected DNA fragments, which are visualised by molecular beacon technology (Chetverin AB, Chetverina HV. 2008 Molecular colony technique: a new tool for biomedical research and clinical practice. Prog. Nucleic Acid Res. Mol. Biol.82:219-255). This technology also relies on a priori knowledge of genome organisation, and does not inherently discover DNA sequence variation, its spatial organisation and how it influences the spatial organisation of the whole genome.
Alternative current approaches to analysis of the three-dimensional structure of the genome are mainly based on chromosome conformation capture (3C) techniques, of which there are many current versions and adaptations (Pombo A., Dillon N. 2015. Three-dimensional genome architecture: players and mechanisms. Nat Rev Mol. Cell Biol. 16: 245). 3C-based methods generally start with chemical crosslinking of proteins that mediate genomic contacts. After chromatin extraction, pieces of DNA bound by the crosslinked proteins and RNAs are treated with a restriction enzyme for fragmentation. Addition of a ligase then connects (ligates) two pieces of DNA. Different variants of 3C use different methods of detecting such ligation events: a popular one is paired-end sequencing (Hi-C, 4C-seq, ChlA-PET), and in one embodiment the DNA bound by a specific protein (or molecule) is purified before the ligation step.
Major limitation of these technologies due to their dependency on ligation of DNA ends is described in Fig. 1, and has also been discussed in the literature (e.g., in O'Sullivan J.M. et al, 2013. The statistical-mechanics of chromosome conformation capture. Nucleus 4, 390; Belmont A.S. 2014.Large scale chromatin organization: the good, the surprising, and the still perplexing. Curr Op Cell Biol 26, 69; Pombo A., Dillon N. 2015. Nat Rev Mol Cell Biol 16, 245; Williamson, I. et al. 2014. Spatial genome organization : contrasting views from chromosome conformation capture and fluorescence in situ hybridization. Genes Dev. 28, 2778-2791).
Another method for the detection of DNA-RNA proximity, depending on proximity ligation of RNA and DNA, is taught by US2016040212 Al .
At present, it is not possible to identify chromatin contact sites genome-wide in an unbiased manner, an essential target for unravelling genetic diseases that affect genomic sequences which are regulatory and that do not code for proteins. As a result, we also fail to understand which nuclear components establish different aspects of chromosome architecture, and how long-range chromatin contacts help maintain genome stability and influence genome function (e.g. gene expression). Therefore, the identification of binding sites and measurement of the frequency with which binding sites interact with each other are major current challenges.
The present inventors addressed the problem of providing an improved method for determining the interaction of nucleic acids, which avoids bias based on ligation of fragmented nucleic acids for detection of nucleic acids interactions, and which allows for simultaneous analysis of several high multiplicity interactions (each involving more than two loci), in particular, more than two interactions. In one embodiment, the method allows for simultaneous analysis of substantially all nucleic acid interactions in the genome, in another, the method allows for simultaneous analyses of all nucleic acid interactions of fragments bound by a given protein or molecule of interest such as protein or RNA. This problem is solved by the method of the invention, as described below and in the claims. This method is designated Genome Architecture Mapping on Chromatin (GAM-ch).
The present invention provides a method of determining interaction of a plurality of nucleic acid loci in a compartment comprising nucleic acids, comprising steps of
(a) separating nucleic acids from each other depending on their interaction in the compartment by (i) crosslinking nucleic acids with each other directly or indirectly, (ii) fragmenting the nucleic acids of the compartment to obtain fragments and/or cross-linked complexes of fragments, e.g. by the use of sonication, mechanical shearing or restriction enzyme digestion, and (iii) dividing the fragmented nucleic acids to obtain a collection of fractions such that every fraction contains, on average, less than one copy of every locus (e.g. about 0.5 copies or one copy in every other fraction), wherein steps (i) and (ii) can be carried out simultaneously or in any order;
(b) determining the presence or absence of the plurality of loci in said fractions; and
(c) determining the co-segregation of said plurality of loci in the fractions.
In one embodiment, between steps (ii) and (iii), fragments bound by a given molecule of interest, e.g., protein or RNA, are selected, e.g. by chromatin immunoprecipitation (ChIP), as described in more detail below.
A locus (plural loci) is the specific location of a gene, DNA sequence, or position on a chromosome (Wikipedia). Each chromosome carries many genes; the number of protein coding genes in the haploid human genome is estimated to be 20,000-25,000, on the 23 different chromosomes; there are as many transcription units which produce RNA species that do not encode for proteins. A variant of the similar DNA sequence located at a given locus is called an allele. In the context of the invention, the nucleic acid may be DNA or RNA or a combination of both, e.g., if interactions between genes being actively transcribed and other genomic regions are to be analysed. Usually, the method of the invention is used to analyse co-segregation of DNA. The co-segregation of loci may be analysed in any compartment comprising nucleic acids, such as the nucleus of a eukaryotic cell, a mitochondrion, a chloroplast, a prokaryotic cell or a virus. For example, co-segregation of nucleic acid, in particular, DNA loci in the nucleus of a eukaryotic cell may be analysed. The method of the invention thus constitutes a solution to analyse locus proximity or interaction in the nucleus, through measuring their frequency of co- segregation in cross-linked DNA complexes extracted from nuclei.
The cell or particle from which the compartment is derived may be a virus, a bacterium, a protozoan, a plant cell, a fungal cell or an animal cell, e.g., a mammalian cell, such as a cell from a patient (preferably, a human patient) having a disease or a disorder, or being diagnosed for a disorder, or a healthy subject. For example, the cell may be a tumor cell or a stem cell, such as an induced pluripotent stem cell generated, e.g., through reprogramming of human tissues. Such cells can advantageously be used to apply GAM-ch to study human developmental disorders or congenital disease. If the cell is an embryonic stem cell, it is preferably not generated in a method involving destruction of a human embryo. A plurality of cells/compartments or single cells may be analysed with the method of the invention.
The mammal preferably is a human, but it may also be of interest to investigate, and, optionally, compare the genomic architecture of other organisms, such as E. coli, yeast, A. thaliana, C. elegans, X. leavis, D. rerio, D. melanogaster, mouse, rat or primate, or possibly parasitic interactions, e.g. the proximity of parasitic nucleic acids relative to the host genome, such as the chromatin contacts a virus (e.g. HIV, HSV) make with the host DNA, or of an artificially inserted nuclei acid (e.g. in the context of gene therapy).
Cells can be derived from cell culture or analysed ex vivo from a specific tissue from a living organism or a dead organism, i.e., post-mortem, or from a whole experimental organism (e.g. a whole D. melanogaster embryo or C. elegans embryo), or from a mixture of microorganisms. Cells used in the analysis can be selected, e.g., by synchronizing the cells in a particular stage of the cell cycle, or sorting the cells e.g. by fluorescence activated cell sorting to capture a particular cell type expressing a specific marker, e.g., using an antibody specific for a protein uniquely expressed in the cell type or cell stage of interest, or detected by in situ hybridization e.g. with a nucleic acid probe that detects a specific e.g. mRNA, or other RNA, expressed specifically in the cell type of interest, or a fluorescent marker such as GFP showing expression of a specific gene or characteristic of a specific stage. For example, a GFP transgene under the control of the promoter of the Pitx3 transcription factor can be used to mark dopamine- expressing neurons (Maxwell S. et al, 2005, Pitx3 regulates tyrosine hydroxylase expression in the substantia nigra and identifies a subgroup of mesencephalic dopaminergic progenitor neurons during mouse development. Dev. Biol. 282 (2): 467-479). Cells can be pre-treated with an agent, e.g., to test the effect of drugs on co-segregation or positioning of loci, or be studied during the lifetime of an organism to understand development, ageing and degeneration.
Optionally, a suspension of single cells is prepared before step (a), depending on the species and type of tissue, e.g., a single cell suspension of mammalian solid tissues may be prepared. Preparation of a single cell suspension may be carried out by any procedure that is also compatible with 3C-techonologies. Detailed description of several single cell preparations compatible with the production of a chromatin sample that preserves crosslinked chromatin contacts can be found in e.g. Hagege H. et al. 2007. Quantitative analysis of chromosome conformation capture assays (3C-qPCR). Nature Protocols 2, 1722. In the case of multicellular organisms, the preparation of a single cell suspension may start by tissue dissection, followed by treatment with collagenase, or, for soft tissues (e.g. mouse thymus or fetal liver), by passage of tissue through a cell strainer (e.g. 40 micrometer mesh), or in the case of cells grown in in vitro culture or microorganism cultures, through centrifugation of the culture at appropriate force for the cell type, followed by resuspension at appropriate strength to yield a single cell suspension with minimal cell damage or death. Application to post-mortem samples is also possible using published protocols or developments thereafter (Mitchell A.C. et al. 2014. The genome in three dimensions: a new frontier in human brain disease. Biol. Psychiatry 75, 961). Methods to produce chromatin containing crosslinked DNA fragments that reflect chromatin contacts has also been possible using plant materials (Grab S. et al. 2013. Characterization of chromosomal architecture in Arabidopsis by chromosome conformation capture. Genome Biology 14:R129) and insect tissues (Ghavi-Helm Y. et al. 2014. Enhancer loops appear stable during development and are associated with paused polymerase. Nature 512, 96).
The separation of nucleic acids from each other in step (a) is carried out by (i) crosslinking nucleic acids with each other directly or indirectly, i.e., DNA and/or RNA may be cross-linked directly or through proteins interaction with the nucleic acid, using e.g. chemical crosslinking agents such as formaldehyde, (ii) fragmenting the nucleic acids of the compartment to obtain a fragments and/or complexes of cross-linked fragments of nucleic acids, e.g. by sonication, and (iii) dividing the nucleic acids into fractions to obtain a collection of fractions each containing a plurality of fragments and/or complexes of cross-linked fragments, such that every fraction contains, on average, less than one copy of every locus.
Nuclei, cells, tissues or whole organisms are treated with a crosslinking agent, e.g. a chemical crosslinking agent in step (a) (i). The crosslinking agent induces linkage of proteins with each other and between nucleic acids (DNA and/or RNA) and proteins. The method of the invention is compatible with cross-linking conditions that are also compatible with current 3C-based methods. Preferably, the crosslinking agent comprises formaldehyde or another crosslinking agent compatible with DNA extraction. Formaldehyde will preferably be used, at a concentration of 0.5-4%, preferably, about l%-2% (all w/w), e.g., in a buffered solution, e.g., of PBS pH 7.0- 8.0, or directly by addition of concentrated solution of the cross-linking agent directly to cell medium, preferably for 5-120 min, preferably 10-20 min. Alternative cross-linkers are, e.g., disuccinimidylglutarate, dithiobis-succinimidyl propionate, glutaraldehyde. Crosslinking may also be performed by UV radiation.
For transportation or storage, fixed nuclei or cells can be pelleted and stored frozen, e.g., at - 20°C, or -70°C or -80°C, e.g. in 1% formaldehyde.
Steps (i) and (ii) may be carried out at the same time or in any order. Usually, crosslinking is performed as soon as possible to maintain the structure of chromatin intact as well as possible, i.e., it is usually performed first. Step (a) of the method may further comprise, e.g., permeabilisation of cells by a lysis buffer and/or freezing. The crosslinking can, e.g., be done directly on cells and then followed by permeabilisation, e.g., lysis with a suitable lysis buffer, and/or, freezing, and then fragmentation, e.g., by restriction (see Hagege et al. 2007). Alternatively, crosslinking and permeabilising can be performed at the same time.
The fragmenting in step (a)(ii) can be carried out by any method, which preferably leads to formation of fragments of homogenous length, or randomly and evenly- spaced breaks in the nucleic acids. For example, fragmentation can be done by ultrasound, by mechanical shearing, by Dounce homogenisation, vortexing with glass beads, or by restriction digest, or a combination of two or more of these methods. Physical methods such as ultrasound or shearing can be adapted to yield fragments or complexes of fragments of a desired fragment size, which may vary depending on the tissue and/or cell analysed. Preferred average fragment size depends on the resolution with which chromatin interactions are aimed to be mapped (which depend on organism and on aims) and is about 100bp-5 Mbp, or preferably, 200bp-500kbp or lkbp-5kbp nucleotides. For example, in mammals the average "chromatin loop-size is about 100 kbp. Promoter contacts with regulatory regions are often local, below 50 kbp, so an appropriate resolution needs to be chosen. For example, Dounce homogenisation can be performed using e.g. 100 mg tissue in (a) 2 mL IX PBS (phosphate buffered saline) or another suitable buffer, and (b) 200 μΐ^ protease inhibitor (Mitchell A.C. et al. 2014. The genome in three dimensions: a new frontier in human brain disease. Biol. Psychiatry 75, 961).
The method is also theoretically possible without a crosslinking step through cryo milling of vitrified cells (Oeffinger M, Wei KE, Rogers R, DeGrasse JA, Chait BT, Aitchison JD, Rout MP, 2007 Comprehensive analysis of diverse ribonucleoprotein complexes. Nat Methods 4, 951- 6; Hakhverdyan et al. 2015. Rapid, optimized interactomic screening. Nature Methods 12, 553). In either case, the use of vitrification (i.e. rapid freezing) to preserve cellular ultrastructure would avoid any artefacts that may potentially be introduced by the application of chemical crosslinking agents (e.g. formaldehyde). For example, treatment with chemical crosslinking agents e.g. formaldehyde could lead to repositioning of subgenomic regions relative to each other and/or relative to nuclear landmarks, e.g. the nuclear lamina, whilst this potential for repositioning would not occur in vitrified samples. Cell or tissue samples structurally preserved using vitrification cannot be assayed by either 3C-based methods or FISH methods.
While restriction digestion may be considered to introduce some bias into the formation of fragments, it may be acceptable if it is taken into account in the analysis of results. Furthermore, frequently cutting restriction enzymes may be used, or a combination of enzymes recognizing different restriction sites e.g., two, three or four different restriction enzymes, may be used. For example, a restriction digest with the enzymes Hindlll, Ncol, EcoRI or Bglll (6-base cutters) or DpnII or Nlalll (4-base cutters) may be carried out e.g. for 60 min, or over night at 37°C and will provide different fragment sizes depending on the genomic distribution of the restriction sequence.
The separation into fractions in step (a) (iii) can be preceded by an additional step (a) (iii.0) comprising selection of fragments/complexes of fragments that are bound by a given molecule of interest, in particular a given protein, a given protein post-translational modification, a given RNA (if fragments are DNA) or a given DNA (if fragments are RNA), or a chemical modification of DNA (e.g. DNA methylation) or RNA, or a given protein/nucleic acid complex, or, after targeting a locus with Cas9 complex with guide RNAs. Preferably, the given molecule of interest is a protein that is bound to chromatin at the time that chromatin forms contacts.
Said selection may be carried out by an affinity-based method such as affinity precipitation, e.g.by performing a chromatin immunoprecipitation or pull down using antibodies or other affinity molecules (e.g. aptamer), followed by dividing/aliquoting e.g. the 'beads' used for pull down. Such approach has been combined with 3C-technologies as Chromatin Interaction Analysis with Paired-End Tag (ChlA-PET) (G. Li et al. 2014. Chromatin Interaction Analysis with Paired-End Tag (ChlA-PET) sequencing technology and application. BMC Genomics 15(Suppl 12):S11). While affinity precipitation with antibodies is preferred, other affinity based selection methods, e.g. based on biotin binding to avidin or derivatives such as streptavidin, e.g., after labelling of chromatin using in vivo biotinylation, or incorporation of biotin to specific nucleic acid sequences, e.g. after in situ incorporation of Biotin-UTP or Biotin-dUTP into nascent RNA or nascent DNA, respectively, can also be employed. Specific nucleic acids may also be selected by use of hybridizing nucleic acids for selection, e.g., by affinity precipitation. Affinity precipitation can be substituted for by passage over columns comprising a ligand specific for the molecule of interest. Methods known for Chromatin Immunoprecipitation can be employed (e.g., Collas, 2010. The current state of chromatin immunoprecipitation. Molecular Biotechnology 45(1):87-100; Stock et al. 2007; Brookes et al. 2012). Suitable conditions for specific interaction with the molecule of interest are employed, e.g., conditions for stringent hybridization. Methods disclosed in WO 2014/14152397 A2 may be employed.
Then, in step (a) (iii), the nucleic acids in the preparation resulting from the previous steps, e.g., directly from step (a)(ii) or from step (a)(iii.0), are divided (or aliquoted) into fractions to obtain a collection of fractions such that every fraction contains, on average, less than one copy of every locus (e.g. 0.0001-0.9, 0.01-0.7, 0.1-0.6, 0.4-0.5, preferably, about 0.5 copies, i.e. one copy in every other fraction). Typically, one locus is seen in every other fraction (i.e. in 50% of the fractions), or in 40% or 30% or 10% or 5% of fractions. The number of fractions depends on the approximate number of loci and the genomic resolution at which the assay will be carried out (i.e. it depends on the total genome length of the organism under study and the length of the loci for which contacts are measured, in other words on the resolution). Preferably, the nucleic acids are separated into many fractions. The number of fractions depends on whether only pairwise or multiple contacts are to be found between loci, on whether only the most highly frequent contacts (interactions) (e.g. frequency above 50% across the cell population), or also the least frequent contacts (e.g. 5%) also are to be identified. Typically, for a genome of similar size to the human or mouse genomes, 1000 fractions, or 3000 fractions, or 5000 fractions, will be analysed to discover the least abundant contacts (5%), but a lower number of fractions will also provide contact information for contacts that are more frequent across the cell population (e.g. 750 fractions to detect interactions that occur with frequency 25%), or if step (a) (iii.0) is used to reduce the complexity of the sample. If step (a) (iii.0) is used, analysis of about 180 fractions (or more) already provides meaningful results. The nucleic acid (often DNA) content of the fractions should be homogenous for the whole analysis, but non-homogenous fractions (e.g. fractions that have excessive DNA content) may be excluded a posteriori once nucleic acid content is mapped; e.g., if using fractions that are supposed to contain approximately 30% of genomic DNA coverage on average, any tubes that contain more than 40% or less that 20% coverage can be excluded, or analysed separately, upon DNA detection. These fractions may be obtained from a plurality of cells (or nucleic acid containing cellular compartments) or from single cells.
The separation into fractions is preferably done after homogenous division of the fragments and/or cross-linked complexes of fragments. However, some fractions will, statistically, contain one or more copies of all possible loci that cover the given genome. This may be found in different situations, firstly, when the preparation of fractions of the compartment leads to fractions with very heterogeneous content in terms of number of fragments (e.g. an large chunk of chromatin; Gavrilov A. A. et al. 2013. Disclosure of a structural milieu for the proximity ligation reveals the elusive nature of an active chromatin hub. Nucleic Acids Res. 41 , 3563-75). This is an artefact, which can be detected and disregarded in the analysis of the said invention. Furthermore, this may happen when the two alleles in a cell interact so closely that they appear in the same fraction. When loci are identified with sequencing, this is not a problem, as it can be measured based on sequence difference due to SNP variation between alleles.
In step (b), the presence or absence of the plurality of loci may be determined by e.g., polymerase-chain reaction (PCR), or preferably, by sequencing, preferably, by next generation sequencing and eventually by the developing single molecule sequencing techniques. For example, single cell whole genome amplification (WGA) may be used to extract nuclei acids from the protein-DNA-R A complexes fractionated from the compartment. Preferably, the nucleic acids of loci in the fraction are sequenced substantially or completely. This is of particular interest if the method is carried out to detect possible interactions between different loci in a research setting, and a "normal" co-segregation pattern has not yet been established for the cell type of interest in the physiological conditions used. The method of the invention may thus be used to analyse spatial proximity (and, consequently, interactions) of unknown and/or unspecified loci, or of transgenic loci inserted in the genome (e.g. in gene therapy) to study their effects of chromatin contacts. In a mixture of organisms with different genetic composition, the method can be used to detect specific (and new) species, as the DNA in cells of each species crosslinks with DNA from each species, and is more often found co-segregated.
For example, nucleic acids such as DNA may be analysed by crosslinking, nuclear fractionation (optional), fragmentation (i.e. chromatin preparation or preparation of nuclei acid complexes), dilution and separation into fractions or sub-samples, followed by amplification using single-cell whole genome amplification (WGA; Baslan, T. et al. 2012. Genome-wide copy number analysis of single cells. Nat. Protoc. 7: 1024) (Fig. 4A). WGA-amplified DNA may be sequenced, e.g., using Illumina HiSeq technology. Visual inspection of tracks from single fractions shows that each contains a different complement of sub-chromosomal regions of expected size (Fig. 6, Fig. 14), as expected from sequencing a sub-cellular fraction of chromatin containing fragment lengths of a given genomic length. Furthermore, each fraction contains only a restricted subset of sequences from each chromosome (Fig. 15B).
An exemplary method for parallel sequencing of RNA from samples from multiple sources while maintaining source identification, which may be employed for determining the presence or absence of the plurality of loci in the method of the invention, is provided by US 2016024572 Al.
However, there may be situations where the presence or absence of a specific interaction (co- segregation) has previously been investigated, so the interacting loci of interest are already known. In particular in diagnostic settings, a significant difference in the frequency with which two loci interact may have been found between different patient groups (e.g., healthy subjects and subjects having a disease, such as a tumor or a congenital disease). In such situations, presence or absence of the two (or more) loci of interest can also be determined by specific PCR, or by otherwise specifically checking for their presence, e.g., by Southern blot or by Illumina HiSeq technology, after selection of nucleic acids covering locus of interest, e.g. via IDT target capture for next generation sequencing (IDT, Coralville, Iowa, USA) or Agilent SureSelect technology, as used e.g. for Capture-C (Hughes J.R. et al. 2014. Analysis of hundreds of cis- regulatory landscapes at high resolution in a single, high-throughput experiment. Nature Genetics 46, 205-12) or Capture-Hi-C (Schoenfelder S. et al. 2015 The pluripotent regulatory circuitry connecting promoters to their long-range interacting elements. Genome Res. 25, 582- 97).
GAM-ch thus preferably combines single copy locus fractionation of a crosslinked chromatin preparation with DNA detection (e.g. by whole genome amplification and next generation sequencing). When chromatin is crosslinked, loci that are closer to each other in the nuclear space (but not necessarily on the linear genome) are found together in the single molecule fraction more frequently than distant loci (i.e. they co-segregate more frequently, Fig. 2). The frequency of contacts between genomic loci can then be inferred by scoring the presence or absence of loci among a number of aliquots containing a sub-genome sample of fragments (Fig. 2). The resulting table can be used to compute the co-segregation frequency of each locus against every other locus to create a matrix of inferred contact frequencies between loci. Therefore, GAM allows for the calculation of chromatin contacts genome wide without the need for end-to- end ligation between the interacting fragments.
Co-segregation may be analysed with a statistical method to determine chromatin contacts. Close spatial proximity can be a sign for specific interaction of loci. Specific interaction of loci may thus also be determined by analysing co-segregation with a statistical method. Statistical methods used in the method of the invention may be, e.g., inferential statistic methods. Statistical methods used in the examples may also be used in the method of the invention to analyse samples of different origin and/or for different loci of interest, e.g., as mentioned herein.
Preferably, the loci are determined to interact specifically, when they co-segregate at a frequency higher than expected from their linear genomic distance on a chromosome. If all possible pairs of loci in the genome at a given genomic (linear) distance are considered, pairs of loci that do NOT interact will be found distributed around an average frequency of chromatin contacts (i.e., co- segregation across the collection of fractions) that depends on the genomic distance between the two loci and the degree of chromatin compaction. The term "contact" is used herein to describe co-segregation across the collection of fractions i.e., a quantitative measure of interaction. Loci that do not interact, e.g., are considered to have a value of contact of zero. In contrast, interacting pairs will have higher frequencies of chromatin contacts (i.e., co-segregation in the fractions) than the average for that genomic distance that depends on their physical distance in the nucleus of that particular cell type. More complex arguments can also be considered, but an interaction can be most simply defined as a deviation from the random (three-dimensional) arrangement of the chromatin fibre taking into consideration any additional contributing factor(s) to a non- random behaviour.
GAM-ch measures the frequency with which two loci co-segregate in the same fraction, and can measure the co-segregation of all genomic loci simultaneously, producing quantitative information that is amenable to (a) the identification of genomic coordinates that more frequently interact with other genomic regions, but also (b) to a wide-range of mathematical treatments that calculate the probability of loci interacting above some random (expected) behaviour.
A plurality of loci means two or more loci, optionally, at least three, at least four, at least five, at least six, at least seven, at least eight, at least nine, at least ten, at least eleven, at least 12, at least 13, at least 15, at least 20, at least 30, at least 40, at least 50, at least 75, at least 100, at least 200, at least 500 or at least 1000 loci and up to several million or billion loci, which are analysed simultaneously. For example, allele-specific analysis of a human cell at 5 kb resolution requires simultaneous analysis of 1.3 million loci. In one option, substantially all loci or all loci in a compartment are analysed with the method of the invention, e.g., by sequencing substantially all nucleic acids, preferably, all DNA, in the compartment. The loci to be analysed may be determined in a biased way (e.g. by choosing to analyse all 23000 protein coding genes in a human cell, or all gene promoters or all non-coding regulatory regions, or all enhancers), or in an unbiased way, e.g. by dividing the genome into windows of a certain size, e.g., windows of 100 bp to 10 Mbp, preferably, 1 kbp to 1 Mbp, 5 kbp-50 kbp, or 10 kbp-30 kbp windows. Further, the method of the invention can be applied in a way which does not distinguish between different alleles (e.g. the two homologous copies of a gene present in a normal human cell), or, alternatively, it can be used to distinguish the two (or more, in the case of e.g. polyploid amphibian cells) alleles of a locus in the same cell.
The method of the invention allows for the detection of multiple co-segregating loci, in particu- lar, more than two co-segregating loci, preferably, more than three, more than four, more than 8, or more than 20, co-segregating loci. In contrast, the identification of multiple interactions using 3C-based methods has been attempted and shown to be both inefficient and highly biased (Sexton et al, 2012, Cell 148:458-72). There is mathematical evidence showing that these experimental limitations of 3C-based methods will remain insurmountable, irrespective of incremental improvements (O'SuUivan J.M. et al., 2013, Nucleus 4:390-8). In particular, the ligation of fragmented DNA molecules - having only two ends - with each other as a basis for identifying interactions in 3C-based methods leads to a phenomenon whereby the detection of higher multiplicity interactions becomes more difficult as the number of loci simultaneously interacting increases above three interacting loci. However, it is known that active genes often interact with three or even more enhancers (Markenscoff-Papadimitriou E. et al., 2014. Enhancer Interaction Networks as a Means for Singular Olfactory Receptor Expression. Cell 159: 543-557), and that active genes interact with each other (Schoenfelder et al, 2010. Preferential associations between co-regulated genes reveal a transcriptional interactome in erythroid cells. Nat. Genet. 42: 53-61). Furthermore, restriction sites are not randomly distributed in the genome, leading to a bias in detection. The efficiency of ligation is affected by the different length of DNA fragments, which adds further bias to 3C-based results. The method of the invention is preferably not or not substantially affected by these biases.
For all steps of methods of the invention, no ligation occurs between nucleic acids originally present in the compartment, in particular, no ligation has to be performed prior to step (b). However, ligation, e.g., with external linkers is possible in the context of detection of the presence or absence of nucleic acid loci, e.g., for amplification or sequencing. The avoidance of ligation of nucleic acids derived from the compartment with each other overcomes the structural bias of 3C-based methods.
In this sense, GAM-ch is unique compared with competing technologies, as it can detect the multiplicity of loci interacting simultaneously, where there are more than three loci interacting at once (such detection being impossible or inefficient by ligation-based 3C-based methods), and it can also detect all loci present in the compartment and their copy number, irrespectively of whether they are found to participate in an interaction, which allows important corrections to be made in the contact maps. It is also one of the advantages of the method of the invention that it can be used to identify spatial proximity of loci which were not known before the method was carried out, i.e., interactions can be identified between newly discovered or non-defined loci. The present invention also provides the use of the method of the invention for
(a) determining the probability of interaction between a plurality of loci. As described, the method of the invention may be used to determine specific interactions, and is capable of differentiating leading interactions from bystander interactions;
(b) mapping loci and/or genome architecture in the compartment. A map, in particular, a matrix, can be drawn up for specific loci or the chromosomal architecture based in the co- segregation frequencies determined;
(c) analysing interactions of different functional elements selected from the group comprising promoters, enhancers, enzymes, e.g., involved in transcription, transposable elements, transcription factor binding sites, repressors, gene bodies, splicing signals, or R A;
(d) identification of regulatory regions regulating expression of a specific gene;
(e) identification of targets for and/or effects of a drug capable of influencing co-segregation of loci;
(f) analysing effects of a gene therapy or genetic engineering on co-segregation of loci.
Chromosomal insertion of a nucleic acid due to gene therapy or other genetic engineering approaches may affect genome architecture, e.g., it may enhance or prevent interaction of regulatory regions with specific promoters and thus affect transcription of "unrelated" genes. In addition, the expression pattern of the introduced nucleic acid may itself depend on, or be disrupted by, its interactions with endogenous regulatory regions. The method of the invention allows for assessment of the effects of gene therapy or genetic engineering on the level of interaction between different loci;
(g) mapping chromosomal rearrangements, e.g., in cancer, including in specific sub-tissue cell populations, e.g. to study clonal evolution of rearrangements;
(h) analysing a disturbed co-segregation of loci in a disease; and/or (i) diagnosing a disease associated with a disturbed co-segregation of loci.
(j) stratifying patients with a specific disease into sub-groups that are more or less likely to respond to a particular drug treatment depending on the proximity or position of certain loci or chromosomes.
(k) identifying a species in a mixture of species, e.g., identifying a potentially novel microorganism species in a mixture of species, e.g. the method of the invention may be used in identification of species in microbial communities, e.g. as described for Hi-C in Burton et al. (2014, G3 4, 1339-1346).
(1) specifically mapping contacts mediated by a defined factor (or molecule of interest, e.g., protein, R A, DNA and/or their modifications), e.g., by extracting said factor and associated complexes of fragments after step (a)(ii) is carried out, e.g., by immunoprecipitation of the defined protein and associated complexes of fragments (step (a) (iii.O)).
Option (1) may be of specific interest, as it reduces the complexity of the sample.
The present invention thus also provides a method of diagnosing a disease associated with a disturbed co-segregation of loci in a patient, comprising, in a sample taken from said patient, analysing co-segregation of a plurality of loci in the patient, and comparing said co-segregation with co-segregation of said loci in a subject already diagnosed with said disease, wherein the co- segregation is preferably also compared with co-segregation in a healthy subject. Alternatively, co-segregation of loci may be compared between specific sub-groups of cells, which may be derived from the same patient, e.g., tumor cells and normal tissue. Co-segregation can also be analysed in different cell types upon derivation of pluripotent stem cells from the patient, or model organism, and their experimental differentiation into specific cell types through laboratory culture in appropriate conditions, e.g. in the presence of the appropriate factors, in the suited container, at the appropriate temperature, e.g. 37°C for human samples. In the context of the invention, "a" is meant to refer to "at least one", if not specifically mentioned otherwise.
As the present invention may be used to investigate a disturbed co-segregation of loci in a patient, i.e., chromatin misfolding, it may also inform or guide the treatment of patients having a disease associated with chromatin misfolding, as such patients may, after diagnosis with a method of the invention, be treated to correct chromatin misfolding (Deng W., Blobel G., 2014, Curr Op Genet Dev. 25: 1-7). The present invention may then be used to monitor the effects of such treatments on chromatin misfolding.
The following examples are included to illustrate the invention, not to limit its scope. Methods of sample preparation and analysis and/or statistical methods used in the examples may also be used in the method of the invention to analyse samples of different origin and/or for different loci of interest, e.g., as mentioned herein. Literature cited is herewith fully incorporated herein for all purposes.
Figure legends:
Fig. 1. Limitations of current 3C-based methods due to dependency on ligation of DNA ends for capturing contacts between nucleic acids. In 3C-based methods, the presence of multiple loci in a single interaction may dilute the measured ligation frequency between any two loci that are member of the interaction. In GAM-ch, the measured interaction is not affected by multiplicity. For full statistical treatment, see O'Sullivan J.M. et al., 2013. The statistical- mechanics of chromosome conformation capture. Nucleus 4, 390.
Fig. 2. Outline of the GAM-ch method. Chromatin is prepared from mildly-fixed cells and randomly fragmented (1). Crosslinked chromatin is divided (ali quoted) across tubes to have <1 haploid genome equivalent per tube (2). The DNA content of each tube is determined to assess the co-segregation of genomic sequences across tubes (3). Co-segregation of genomic sequences reflects chromatin contacts of genomic sequences in the cell nucleus dependent on protein- protein and protein-RNA bridged interactions and is used to measure long-range chromatin interactions.
Fig. 3. Validation of chromatin by 3C.
A. Schematic presentation of the mouse β-globin gene cluster (adapted from Tolhuis, B. et al. (2002). Looping and interaction between hypersensitive sites in the active beta-globin locus.Molecular Cell 10, 1453). Arrows and circles depict the individual hypersensitive sites. The β-globin genes are indicated by triangles, with active genes (Pmaj and βηώι) in grey and inactive genes (εγ and βΜ) in black. The olfactory receptor (OR) genes are indicated by white boxes, of which some were shown to interact with the β-globin gene cluster. Grey boxes also indicate other gene loci (3' prime olfactory receptor genes, Uros and Eraf), which were shown to interact with the β-globin gene cluster in embryonic liver tissue. LCR, Locus Control Region. B. A hypothetical 3D model of the active chromatin hub (ACH) based on population-based 3C data from Tolhuis et al. (2002). Neither the size of the ACH nor the actual position of the elements relative to each other is to scale. Hypersensitive sites and active genes of the locus form a hub of hyper-accessible chromatin (ACH). The inactive regions of the locus, having a more compact chromatin structure, are indicated in grey, with the inactive εγ and βΐιΐ genes in lighter grey. The olfactory genes are not shown. The interactions in the ACH would be dynamic in nature, in particular with the active genes (Pmaj and βηώι), which are alternately transcribed.
C. Previously published 3C analyses of the β-globin cluster (Tolhuis et al, 2002), using the 3'end hypersensitive site (3'HS1) as bait. Fetal liver and brain cells from E14.5 mouse embryos were fixed (10 min) in 2% formaldehyde, digested with Hindlll before ligation under highly diluted conditions. Ligation products were quantified by PCR.
Crosslinking frequency with value 1 arbitrarily corresponds to the crosslinking frequency between two neighbouring control fragments within the Calreticulin (CALR) gene locus, which is expressed at similar levels in the two tissues. A schematic illustration of mouse β-globin gene cluster is depicted; the grey shading represents the position and size of fragments generated by Hindlll restriction.
D. The quality of the chromatin preparation produce was validated by 3C at four regions of the murine β-globin gene cluster in fetal liver and brain cells. Fetal liver and brain cells from El 4.5 mouse embryos were fixed (5 or 10 min) in 2% formaldehyde, digested with Hindlll and ligated under highly diluted conditions. Ligation products were quantified by qPCR using the 3'end hypersensitive site (3'HS1) as bait. Means and SEM are shown. The black vertical line indicates the position and size of the 3C-bait fragment containing 3'HS1. Crosslinking frequency with a value of 1 arbitrarily corresponds to the crosslinking frequency between two neighbouring control fragments (with analyzed restriction sites being 8.3 kb) within the Ercc3 gene locus (on chromosome 18), which is expressed at similar levels in fetal liver and brain. Black bars indicate the position of primer pairs used for 3C.
Fig. 4. Library preparation of GAM-ch samples.
A. Representation of the WGA procedure (Sigma). This procedure is adapted according to the present invention, e.g., to include a cross-linking step before fragmentation and a step of division into fractions after fragmentation. DNA is randomly fragmented by proteinase K (PK) digestion/cleavage. 30 bp adapters are ligated to fragment ends and adapter-specific primers are used to amplify fragments with similar efficiency (Image from Sigma's WGA kit advisor; sigma.com/wga). B. Agarose gel electrophoresis of WGA-amplified GAM-ch fractions (top) containing -0.2, -0.7 and -10 genomes, followed by preparation of Illumina libraries (bottom). GAM-ch samples marked with an asterisk were used for library preparation. WGA-amplified DNA was fragmented to -400 bp using Covaris, and amplified using the Illumina library mate-pair kit DNA fragments were excised (350-650 bp for -0.2 and -10 genomes, 200-650 bp for -0.7 genomes), quantified and sequenced.
Fig. 5. Alignment of sequencing reads to the genome.
A. Strategy to align unmapped reads to the genome by trimming either the first or the last 36 nucleotides of each 72 nucleotides read. This strategy is not necessary when sequencing using dark cycles or tagmentation based library preparation.
B. Low percentage of alignment of sequenced reads to the genome due to WGA adapters. Using the trimming strategy increases strongly the percentage of alignment of sequenced reads to the genome. GAM-ch is also designated xGAM.
Fig. 6. Mapping of GAM-ch-seq datasets corresponding to -0.2 and -10 genomes in comparison with linear DNA.
Images of sequencing profiles were generated by mapping sequenced reads to the mouse genome and visualising in the UCSC Genome Browser (mouse genome assembly mm9). Read distribution for different GAM-ch (xGAM) samples at resolution of 1 Mbp (A), 50 kbp (B) and 5 kbp (C) are shown for chromosome 2. Homogeneous coverage of the genome is apparent for libraries produced from WGA-amplified linear DNA (10 ng) and GAM-ch sample containing -10 genomes (10 G), in contrast with the GAM-ch sample containing -0.2 genome (0.2 G). Low background and higher accumulation of reads per cluster are detected in GAM-ch-0.2 G samples.
Fig. 7. Gaps are defined as regions which are not covered by reads.
The sequencing depth is calculated by dividing the genome into identical windows and counting the number of nucleotides covered by reads, which fall into each window.
Fig. 8. Gap-size (A) and sequencing depth (B) distributions for 10 ng of linear DNA and GAM-ch (xGAM) samples at -0.2 and -10 genomes.
X axes represent the gap-sizes and sequencing depth at 1 kb windows (bp) in log 10 scale. Y axes represent Kernel probability densities. Graphs are plotted using density function in R. Fig. 9. Thresholds from Gaussian fitting to GAM-ch fractions with ~0.2 genomes.
The threshold is defined as the number of reads for which the height of the Gaussian fit (Δχ in dotted thick line) equals the height of the entire sequencing depth distribution (Ay in thin grey line). X-axes represent the sequencing depth at 1 kb windows in the loglO scale. Y-axes represent the Kernel probability densities.
Fig. 10. Number of "positive windows" detected from random sampling the original datasets of -0.2 genomes (10 to 100%, 12 pM).
Erosion of reads from GAM-ch-0.2 genome dataset shows only a mild change of detected "positive windows" when randomly sampling -60% of reads. Information is markedly lost when <30% of reads are considered. The threshold used here for the detection of 4 kb windows is based on the residual analysis in Fig. 9.
Fig. 11. Outline of the GAM-ch method in combination with immunoprecipitation of chromatin bound by, e.g., RNA polymerase II.
A. Single gene analyses show that enhancers are known to contact gene promoters and coding regions, and promoter regions can contact intra-genic coding regions.
B. Chromatin crosslinking and fragmentation, e.g., chromatin is prepared from mildly fixed cells and randomly fragmented, e.g. by sonication (1). Optional: fragment enrichment, e.g., by immunoprecipitation of a specific chromatin-bound protein such as RNA polymerase II (2). Division of the fragmented nucleic acids to obtain a collection of fractions (every fraction contains <1 copy of every locus, typically <0.5 copies). For example, crosslinked chromatin is either directly divided (aliquoted) across tubes to have <1 haploid genome equivalent per tube (3 a), or (optionally) first enriched for chromatin occupied by a given protein (or other bound molecule of interest), e.g. by chromatin immunoprecipitation (3b). Extract and detect nucleic acids, e.g., the DNA content of each tube is extracted and identified to assess the co-segregation of genomic sequences across tubes (4). Co-segregation of genomic sequences reflects chromatin contacts of genomic sequences in the cell nucleus dependent on protein-protein and protein-RNA bridged interactions and is used to measure long-range chromatin interactions. Boxes: Enhancers, thick black line: active gene, medium thick line: inactive gene, arrows: promoters.
Fig. 12. RNA polymerase II occupies active gene promoters, coding regions and enhancers.
A. Occupancy of RNA polymerase II at active and inactive genes. Density plots show distribution of RNAPII-S5p ChlP-seq signal at promoters (also called transcription start sites (TSS)) of active genes, through coding regions, downstream of the polyadenylation site, and at transcription end/termination sites (TES). Transcriptionally silent genes are not occupied by RNAPII-S5p. The average occupancy profiles are represented at ±5kb windows centered at the transcription start site (TSS) or transcription end site (TES). All mouse genes were ranked by their expression levels determined by mRNA-seq in mouse ESCs (Brookes et al. 2012), then top 25% genes were selected as most actively transcribed genes and the bottom 25% genes were selected as most transcriptionally silent.
B. RNA polymerase II co-associates with enhancers. RNAPII-S5p is present at enhancers defined in murine ESCs according to Whyte et al. (2013). Background levels of ChIP signal was determined by a control ChIP experiment using non-specific antibody against plant steroid digoxigenin.
C. Confirmation of RNAPII-S5p occupancy determined by ChIP combined with quantitative PCR at active, Polycomb-repressed and inactive genes. Quantitative PCR confirms the expected enrichment of RNAPII-S5p of active (Oct4) and Polycomb-repressed (Nkx2.2, HoxA7) genes, and its absence at inactive (Myf5) gene, as expected (Stock et al. 2007). Background levels (mean enrichment after ChIP with non-specific antibody against plant steroid digoxigenin) at promoter and coding regions are shown in black bars. Means and standard deviations from three biological replicates are shown.
Fig. 13. GAM-chIP optimisation of DNA extraction and amplification by WGA.
A. Titration of amount of DNA used for WGA reactions. Different amounts of immunoprecipitated DNA (2.5 pg to 2.5 fg) were subjected to WGA in parallel with a sample containing no DNA ('water control') and resolved by agarose gel electrophoresis. The total amount of DNA amplified from each sample is shown underneath (as measured by Picogreen quantitation; in ng).
B. Reliability of DNA extraction by WGA. Equal amounts of chromatin immunoprecipitated DNA were amplified by WGA (in triplicate) and resolved by agarose gel electrophoresis. The total amount of DNA amplified is shown underneath (as measured by Picogreen quantitation; in ng).
C. Percent of ChlP-enriched positive windows for different starting amounts of chromatin immunoprecipitated DNA. The percentage of positive windows for GAM-chIP dataset is higher for GAM-chIP samples with larger amounts of input DNA. ChlP-enriched positive windows were determined by number of reads in each 5 kb window from published ChlP-seq RNAPII- S5p obtained in mESC (Brookes et al. 2012). The top 2% of 5 kb windows were taken as the genomic windows most enriched for RNAPII-S5p. Fig. 14. GAM-chIP raw data and detection of positive genomic windows.
A. GAM-chIP profiles of raw sequencing data across two genes show that more positive windows are detected across an actively transcribed gene than an inactive gene. Represented tracks from top to bottom: 1 - RNAPII-S5p ChlP-seq in mESC; 2 - cumulative window detection frequency across 182 GAM-chIP datasets; 3-7 - raw sequencing data for five randomly chosen GAM-chIP datasets together with representation of positive windows defined by fitting binominal distributions (black horizontal bars) or by JAMM peak-finder approach (striped horizontal bars); 8 - raw sequencing data for a control sample containing no chromatin immunoprecipitated material (water control). Images were obtained from UCSC Genome Browser using mean as windowing function. Schematic representation of the genes present in the selected regions is shown underneath.
Fig. 15. Quality controls of GAM-chIP dataset.
A. Percentage of mapped reads to the mouse genome for the exploratory GAM-chIP dataset (Table 2, GAM-chIP. Exp003). The distribution of the percentage of mapped reads was highly reproducible between GAM-chIP samples within each pool and between sequencing pools. Negative control samples yielded a very low percentage of mapped reads, whilst positive control samples show the highest percentage of mapped reads to the mouse genome.
B. Each GAM-chIP sample contains only a restricted subset of sequences from each chromosome. Each mouse chromosome was divided into 5 kb windows, and the percentage of positive 5kb windows was plotted for each chromosome and for each GAM-chIP sample. No GAM-chIP sample contains more than 12% of any given chromosome, and all chromosomes are comparable in coverage except for chromosome X, which is present in only a single copy (whereas autosomal chromosomes are present in two copies), as expected in the male ESC line used.
C. Detection of 5 kb windows overlapping gene promoters in the GAM-chIP dataset is proportional to the abundance of RNAPII-S5p occupancy in ChlP-seq datasets (from published data; Brookes et al. 2012). The TSS-overlapping 5 kb windows with the least binding of RNAPII-S5p are detected in 4.4% of GAM-chIP samples on average, whereas those with the most abundant binding are detected in an average of 12.5% of GAM-chIP samples.
D. The percentage of 5 kb positive windows overlapping transcriptionally active genes and enhancers are higher than the percentage of 5 kb positive windows overlapping transcriptionally silent genes. The percentage of positive windows is shown for gene body (gene), promoters (transcription start site, TSS) and transcription end site (TES). The set of most actively transcribed and of most silent genes were chosen based on their expression levels, as determined by mR A-seq in a published dataset (Brookes et al. 2012). Positive 5 kb windows overlap gene promoters with high R APII-S5p levels (as determined from published ChlP-seq dataset; Brookes et al. 2012) more often than gene promoters with low R APII-S5p levels.
C. Detection of 5kb windows overlapping gene promoters in the GAM-chIP dataset is proportional to the abundance of RNAPII-S5p occupancy in ChlP-seq datasets (from published data; Brookes et al. 2012).
Fig. 16. Co-segregation of genomic windows within actively transcribed genes in GAM- chIP samples. GAM-chIP samples containing multiple positive windows from the same actively transcribed gene occur more frequently than GAM-chIP samples containing multiple positive windows from the same silent genes and more often than would be expected by chance, confirming that chromatin contacts can be formed within actively transcribed genes during transcription (as schematized in Fig. 11).
Fig. 17. Co-segregation of genie regions of actively transcribed genes coincides with preferential co-segregation of nearby enhancers in GAM-chIP samples.
The number of positive genomic windows detected within active genes, Osbpl9 and Birc6 (A), and within silent genes, Ofccl and Rosl (B), was used to order the tracks representing window detection in the 182 GAM-chIP samples. For active (but not for silent) genes, the nearest enhancer was more frequently observed in the GAM-chIP samples with the highest number of positively detected intragenic windows. The co-segregation of nearby enhancers in the same GAM-chIP samples as actively transcribed genes is indicative of a chromatin interaction between the enhancer and gene during transcription.
Examples
Classical genome mapping makes use of different in-vivo mapping approaches such as linkage analysis or radiation hybrid mapping which are limited by biological complications such as fluctuations of meiosis or cloning artifacts of various sources. To overcome the limitations of in- vivo mapping approaches, HAPPY Mapping was introduced by Dear and Cook (Dear P.H. and Cook P.R. (1989). Happy mapping: a proposal for linkage mapping the human genome. Nucleic acids research 17, 6795-6807; Dear P.H. and Cook P.R. (1993). Happy mapping: linkage mapping using a physical analogue of meiosis. Nucleic acids research 21, 13-20.), and involves an in-vitro single-molecule technique to accurately map the linear sequence of genomes of any species by using haploid equivalents of DNA and the polymerase chain reaction. HAPPY Mapping is based on the co-segregation and detection of nearby DNA markers in the genome and uses limiting dilutions of fragmented DNA to single molecule contents.
The probabilistic presence of unlinked DNA markers in a set of samples with subgenomic content follows a Poisson distribution, whilst for linked markers their increased linear proximity increases their probability of co-detection in the same tube than more distant markers. For the purpose of HAPPY Mapping, the most informative frequency of detection of DNA markers among tubes is 50% (Dear, P.H. 2005. HAPPY Mapping. Encyclopedia of Life Sciences), which corresponds to a genomic content of -0.7 haploid genomes per sample, from the Poisson distribution, equivalent to ~2 pg of DNA from a murine cell. The presence of DNA markers in each sample can be detected by two-phase hemi-nested PCR and the extent of co-segregation measured using LOD (logarithm of the odds) scores (Dear P.H., Cook P.R. (1993). Happy mapping: linkage mapping using a physical analogue of meiosis. Nucleic acids research 21, 13- 20). LOD scores measure the likelihood that two DNA markers are linked, and can be converted into a one-dimensional genetic map using the technique of distance geometry (Newell WR et al. 1995. Construction of genetic maps using distance geometry. Genomics 30, 59-70).
GAM-ch applies the basic principle of HAPPY Mapping to a different purpose: instead of measuring linear genomic distances, it measures long-range chromatin interactions between any genomic regions within the three-dimensional cell. Cells are first treated with a crosslinking agent which, for example, chemically crosslinks proximal genomic regions in the same or differ- rent chromosomes, before chromatin fractionation. Unlike current 3C methods, GAM-ch detects chromatin proximity but does not require ligation of crosslinked DNA fragments. In GAM-ch, chromatin preparations similar to 3C are prepared and diluted as for HAPPY Mapping, before quantification of co-segregation frequency; genomic regions that are bridged by proteins and crosslinked during the chromatin preparation will co-segregate more frequently than genomic regions that do not interact (Fig. 2). GAM-ch can provide single allele information about multiplicity of interactions, i.e. multiple genomic regions interacting at the same time with a given allele.
In 3C, a given DNA fragment in a high multiplicity chromatin interaction can only ligate with one or two (at high restriction and ligation efficiency) other DNA fragments. This limitation of 3C makes it difficult to distinguish, for example, between a low-frequency chromatin interaction involving only two fragments and an interaction that involves many genomic partners at high frequency across the cell population (Fig. 1). For example, the same 3C signal (e.g. a measured contact of 50%) can be due to an interaction that occurs for half the alleles in the cell population if the multiplicity of interaction is only two (or possibly three), or be due to an interaction that occurs in all alleles (real contact frequency is 100%) but is underestimated to only 50%> due to competition with other bound DNA fragments that co-bind at high multiplicity, thereby diluting the probability of ligation between any single fragment with all others.
The first 3C-based study to show the formation of an active chromatin conformation was at the β-globin gene cluster in fetal liver (Fig. 3A-C) (Tolhuis et al. 2002). This study suggested that different genomic regions form an "active chromatin hub". This active chromatin hub contained the two active adult β-globin genes and multiple hypersensitive sites (HSs) of the Locus Control Region (LCR) as well as upstream and downstream of the β-globin gene cluster. However, 3C analyses cannot unequivocally determine whether active β-globin genes and HSs come together at the same time or whether the active chromatin hub is just an average reflection of all chromatin interactions that occur at different alleles within the same cell or different cells in the cell population. GAM-ch has the potential to circumvent this limitation, providing information about multiplicity of interactions at single-allele resolution.
One possible approach to perform DNA detection in the context of GAM-ch is to perform single- molecule multiplex PCR to identify DNA in GAM-ch samples, in similar conditions to HAPPY Mapping. The best DNA extraction approach with whole genome amplification was subsequently combined with high-throughput sequencing to assess the extent of DNA recovery, crucial to assess the feasibility of GAM methods genome-wide.
To allow for a more uniform and efficient amplification of the GAM-ch samples with unbiased sequence representation, single-cell whole-genome-amplification (WGA, Sigma) was used on cross-linked chromatin after the Proteinase K (PK) treatment in WGA fragmentation buffer followed by amplification of DNA fragments with universal primers (Fig. 4A).
Part 1. Genome- wide GAM-ch
To perform genome-wide GAM-ch combined with high-throughput sequencing (GAM-ch-seq), each GAM fraction was subjected first to WGA fragmentation, primer ligation and PCR amplification. WGA-amplified GAM-ch samples were then further amplified using the Illumina library preparation, which adds new sets of primers at each end of the DNA fragments. GAM-ch- seq samples were sequenced using the Illumina sequencing platform (Table 1). As recent 3C- based genome-wide mapping approaches use Hindlll digestion, instead of sonication, this approach was also adopted here. Validation of Hindlll-digested chromatin preparations was performed by 3C analyses (Fig. 3D). Linear DNA was used in parallel to test the effects of WGA and high-throughput sequencing on sequence representation, and as a positive control.
Table 1. Summary of libraries sequenced on the Illumina platform.
Figure imgf000028_0001
Preparation of sequencing libraries for Illumina sequencing
GAM-ch samples were prepared for Illumina sequencing as described for 3C and validated by 3C-qPCR using published primer sequences (Fig. 3D). Nuclei from fetal liver cells, fixed for 5 min, were extracted, counted using a haemocytometer, subjected to digestion with Hindlll (digestion efficiency of -77%), and aliquots of -100 genomes^L were prepared and frozen for further use. Different genome numbers of 3C-like chromatin were first subjected to WGA fragmentation (1 h at 50°C with PK and 4 min at 99°C) and amplification (-0.2, -0.7 and -10 genomes/tube; Fig. 4B). Linear human DNA (2 ng; provided with the WGA kit) was used as a positive control for the WGA reaction.
Fragment sizes of crosslinked chromatin range from -0.3-2 kb, whereas linear DNA is less fragmented upon WGA, probably due to lower-sized DNA fragments present in Hindlll digested chromatin (average distance between Hindlll restriction sites is -4 kb in the mouse genome). GAM-ch samples of -0.2 genomes did not show visible products on ethidium bromide gels after WGA amplification (Fig. 4B), but yielded visible products upon preparation of sequencing library (Fig. 6).
GAM-ch samples were subjected to Illumina library preparation and DNA fragments were size- selected (350-650 bp for -0.2 and -10 genomes, 200-650 bp for -0.7 genomes) and sequenced. Since the -0.7 genome GAM-ch sample showed less-intense WGA products, DNA fragments from a wider range size were excised and sequenced. Linear mouse DNA was also amplified by WGA and Illumina library kits (not shown) and sequenced in parallel.
Alignment of reads to the genome
Sequencing yielded -lxl 06 to lOxlO6 filtered 72 nucleotides (nts) reads across the four libraries sequenced (linear DNA, GAM-ch-0.2, GAM-ch-0.7 and GAM-ch- 10). When trying to align reads sequenced from Illumina TruSeq libraries made from WGA-amplified DNA to the genome, the percentage of alignment was relatively low for both linear DNA and GAM-ch samples (~24±6%) probably due to the presence of the WGA primers (-30 nts) at a proportion of DNA ends (Fig. 5B). To circumvent this problem, a trimming strategy was developed (Fig. 5 A), which involves distal trimming of unmappable reads. Each unmappable read is trimmed at its 5 'end by 36 nts and mapped back to the genome. For the remaining reads that still do not align, then 36 nts are trimmed at the 3 'end of the read, and resulting 36 nt read realigned to the genome. This trimming strategy increased the overall percentage of alignment to ~54±6% (Fig. 5B). This trimming pipeline is not necessary for libraries produced using Illumina Nextera library kits, as the library production relies on tagmentation.
Sequencing profiles of GAM-ch samples
Sequencing reads from 10 ng of linear DNA, and from -0.2 and -10 genomes of GAM-ch DNA were visualized at different resolutions (1 Mbp, 50 kbp and 5 kbp) using the UCSC Browser tool (Fig. 6). The GAM-ch sample of -0.7 genomes revealed several over-represented peaks (not shown), in addition to inefficient library amplification (Fig. 4B bottom), it was not further analysed.
A homogeneous coverage of genomic sequence is observed for both 10 ng of linear DNA and -10 genomes of GAM-ch, suggesting that the WGA amplification combined with Illumina library sequencing does not induce visible biases in genome amplification and that DNA can be efficiently extracted from single molecule 3C-like chromatin and detected by next-generation sequencing. GAM-ch libraries obtained from -0.2 genomes show a more clustered distribution of sequencing reads with higher enrichment, as expected due to lower genomic content. This is consistent with a lower diversity of DNA fragments in the -0.2 genome libraries. The higher enrichment suggests that the amount of sequence obtained may already be sufficient to over- represent this diversity. Inspection of genomic coverage at higher resolution shows that reads in the subgenomic GAM-ch-0.2 dataset are clustered within specific DNA fragments. The very low detection of reads at intervening genomic regions shows the quality of our GAM-ch samples and is consistent with undetectable DNA contamination throughout the various procedures involved in single DNA molecule detection.
Evaluation of different sequencing libraries obtained for GAM-ch
The first step in the analysis of GAM-ch samples is to detect DNA fragments that are present or absent in each GAM-ch sample analysed with subgenomic content. This requires the definition of background read distribution, and a decision about an appropriate window size. In the case of GAM-ch, the window size should reflect the average size of the DNA fragments present in 3C- like chromatin. For Hindlll restriction, this corresponds to -4 kb fragments. Two different statistical approaches were performed to analyse and to compare sequencing results from multiple libraries. First, the distribution of the gap-size between adjacent covered areas of the genome was analysed, and second the sequencing depth at different window sizes was studied (Fig. 7). Both approaches were used to analyse the sequencing results from linear DNA and GAM-ch samples (Fig. 8).
For 10 ng of linear DNA, an even distribution of reads is observed with an average gap-size of -5 kb (Fig. 8A). Sequencing libraries obtained from samples with high amounts of linear DNA results in a high diversity of reads that are distributed across the genome, but which have little chance of being sequenced multiple times. This effect can also be seen when plotting the distribution of sequencing depth in 1 kb windows, where on average one single read is present at each 1 kb window (Fig. 8B). This shows that the WGA overall provides good genomic coverage, although further analyses are necessary to study more detailed biases, such as GC content.
The content of GAM-ch samples with -10 genomes also show an even distribution across the genome meaning that the whole genome is covered, which suggests that DNA extraction from 3C-like chromatin is efficient. The average gap size peaks at -1 kb (Fig. 8A) and displays a second population of gap-sizes of -100 bp. This may reflect the fact that not all genomic regions are represented in this low DNA content sample; it can be the result of interacting DNA sequences within short range distances (as seen in 4C results) being frequently brought together due to crosslinking; further sequencing experiments and analyses are currently ongoing to investigate the significance of the different gap distributions. Due to the lower diversity of DNA fragments, sequenced reads in the -10 genomes sample are sequenced multiple times, such that each 1 kb window is covered by more reads with the distribution of sequencing depths peaking at -500 nts per 1 kb window (Fig. 8B). Since sequences are a mix of 36 and 72 nt reads, which will appear as single spikes representing a multiple unit of 36 nts in the sequencing depth distribution, each average read would contain about 50 nts ((36+72)/2). Therefore, each 1 kb window of the -10 genomes GAM-ch sample would be covered by -10 reads. In addition many windows with <10 reads exist, which are visualized by the left spiky tail in the sequencing depth curve and are hardly distinguishable from the main population of windows with 10 reads.
GAM-ch samples with -0.2 genomes contain only a fraction of the genome, as seen in wider gaps in the read distribution and a gap-size peaking at -50 kb, with additional shoulders reflecting non-random spacing between DNA fragments; this is consistent with the presence of chromatin interactions in these few GAM-ch samples. The less diverse set of fragments that are sequenced in the -0.2 genomes sample, are sequenced more frequently than fragments in GAM- ch- 10 genomes sample, resulting in about 5000 sequenced nucleotides in each 1 kb window corresponding to -100 reads per 1 kb window. This suggests that the depth of sequencing as performed here, for a standard Illumina GAII lane, is likely to be sufficient to saturate the detection of DNA fragments in the lowest complexity 0.2-genome fractions. In contrast to GAM- ch samples with -10 genomes, a clear separation of two populations of windows with fewer reads (<10 reads) and windows with many reads (>10 reads) is observed and a threshold of read number can separate both populations to distinguish signal from noise.
Detection of positive and negative DNA fragments in GAM-ch samples
To distinguish signal from noise in GAM-ch subgenomic datasets, we devised two strategies. One strategy consisted of comparisons between two replicate sequencing datasets of the same -0.2 genome GAM-ch sample (these were loaded at 12 and 16 pM; named thereafter as 0.2 G-12 pM or 0.2 G-16 pM). The distribution of their residuals was used to calculate the thresholds based on the quantiles of the data (not shown). For a window size of 4 kb, the threshold corresponds to 940 nts (-13 reads each with 72 nts) at 99% percentile. The second, preferred strategy based on fitting two curves to the sequencing depth distribution for GAM-ch sample with -0.2 genomes (Fig. 9). The GAM-ch sample with -10 genomes did not have enough sequencing depth to sufficiently resolve the signal from the noise distribution. The threshold value was defined as the point where the height of the distribution of the "true signal" (=Gaussian fit) equals "noise" (see sequencing depth distribution, Fig. 8). For the GAM-ch sample with -0.2 genomes, the threshold is 790 nts (-1 1 reads with 72 nts), in the same order of magnitude of the residual distribution approach.
Estimation of sequencing depths required for GAM-ch analyses
As part of the optimisation of GAM-ch-seq, it was important to estimate the optimal sequencing depth required to achieve good signal/noise ratios at a minimum sequencing cost. Analysing the sequencing depth from GAM-ch samples at -0.2 genomes (12 pM; Fig. 8), it became clear that the number of reads per window could be reduced. To estimate the minimal optimal sequencing depth, the number of 4 kb windows that are positively detected when the number of reads is decreased from 100-10%. To do so, a random number of reads corresponding to 10-90%) of total reads was sampled from the original dataset and used to calculate the sequencing depth at each 4 kb window. The number of "positive windows" (windows in each sequencing depth above threshold) increases with the increase of the percentage of sampled reads, as expected, but reaches a plateau consistent with saturation (Fig. 10). Mathematical calculations suggest that -60%) of the reads produced (equivalent to 4.8xl06) would be sufficient to achieve enough information for GAM-ch sample with genomic sequence complexity equivalent to -0.2 genomes. This corresponds to l/5th of a typical lane of sequencing on the more recent Illumina platform, HiSeq, suggesting that the cost of GAM-ch-seq is likely to become accessible as further developments in sequencing take place.
Part 2. GAM-chIP: combining GAM-ch with immunoprecipitation to capture fragments co-occupied by RNA polymerase II phosphorylated on Serines.
In one embodiment of GAM-ch, the DNA fragments bound by a specific protein (or other molecule of interest) are selected from the bulk chromatin, e.g. by chromatin immunoprecipitation (ChIP), a strategy called GAM-chIP. GAM-chIP is performed with an additional step in which crosslinked chromatin fragments, containing a given protein or protein post-translational modification, are first selected prior to their dilution between tubes, e.g. to enrich for fragments containing genes and regulatory regions (enhancers) (Fig. 11). Including this additional selection step has two advantages: first it allows for detection of chromatin contacts which are formed in the presence of the given protein or protein post-translational modification. Second, it reduces the complexity of the fragments present in the chromatin material, which is partioned (aliquoted) in a collection of samples, and therefore reduces the sequencing cost per tube. A diagram describing GAM-chIP in comparison with GAM-ch is presented in Fig. 11B.
To test the feasibility of GAM-chIP, we chose to select DNA fragments bound by R A polymerase II phosphorylated on the Serine-5 residue of the CTD, which we abbreviate to R APII-S5p. R APII-S5p was chosen because it has high occupancy at active genes, especially at promoters, throughout coding regions and transcription termination sites, and enhancers (Fig. 12A,B). Combining GAM-ch with ChIP for R APII-S5p therefore has the potential of increasing the power of GAM-ch to detect contacts between enhancers and their target genes.
To perform genome-wide GAM-chIP, chromatin was crosslinked using formaldehyde and fragmented by sonication, then chromatin fragments bound by R APII-S5p were selected by immunoprecipitation using a specific antibody coupled to beads (CTD-4H8, Covance; according to Brookes et al 2012). To implement GAM-chIP, fragments resulting from ChIP were eluted from beads, and fractionated/diluted into a multitude of fractions and WGA amplified.
For the initial test experiments of GAM-chIP (ExpOOl, Exp002; Table 2), WGA-amplified GAM-chIP samples containing different amounts of DNA (from 0.25 fg to 1.25 pg DNA, Fig. 13A,B) were further amplified by Illumina TruSeq Nano preparation kit and sequenced using the Illumina MiSeq or HiSeq 2000 sequencing platform. In the follow up experiment (Exp003), we produced 180 GAM-chIP samples (lpg DNA), which were first WGA-amplified, before the second amplification using the Illumina Nextera library preparation kit. The Illumina Nextera library preparation kit uses tagmentation to create libraries in which WGA-amplified DNA fragments are surrounded by two sequencing adaptors (one at each end). The GAM-chIP samples in Experiment 3 were sequenced using the Illumina NextSeq sequencing platform (Table 2). Table 2. Summary of GAM-chIP libraries.
Figure imgf000034_0001
Preparation of GAM-chIP libraries
ChIP of RNAPII-S5p bound DNA fragments was performed as described previously (Stock et al. 2007; Brookes et al. 2012). Mouse embryonic stem cells (ESCs) were fixed in 1% formaldehyde for 10 min. Nuclei were then extracted, counted using a haemocytometer, and chromatin was extracted using sonication. Sonicated chromatin fragments bound by RNAPII-S5p were selected by immunoprecipitation.
Before fractionation of immunoprecipitated DNA fragments for GAM-chIP, the quality of the fragments enriched by ChIP of RNAPII-S5p was validated using quantitative PCR of DNA fragments known to be bound by RNAPII-S5p in mouse ESCs, namely promoters and coding regions of active and Polycomb-repressed genes (Fig. 12C); inactive gene promoter and coding region were used as negative control. A control ChIP experiment was performed with nonspecific antibody against plant steroid digoxigenin, which showed no DNA fragment enrichment, as expected (Stock et al. 2007; Brookes et al. 2012). This analysis demonstrated that the antibody immunoprecipitation step had successfully and efficiently selected RNAPII-S5p-bound chromatin fragments (Fig. 12C).
Following successful quality control of the DNA fragment enrichment by ChIP, the immunoprecipitated chromatin material was divided (aliquoted) into multiple tubes at the chosen dilution factor based on the measured DNA concentration.
In the first two feasibility experiments (ExpOOl and Exp002), different amounts of ChIP chromatin (from 0.25 fg to 2.5 pg) were subjected to WGA fragmentation and amplification. As negative control, we performed WGA using water alone without the addition of immunoprecipitated chromatin (Fig. 13 A).
GAM-chIP samples show a fragment size distribution of ~100bp to ~1200bp following WGA amplification (Fig. 13 A, slightly smaller than for GAM-ch samples prepared by Hindlll digestion without chromatin immunoprecipitation; Fig 4B). The fragment size distributions and the amount of DNA after amplification were comparable between different samples prepared from the same concentration of input DNA (Fig. 13B). GAM-chIP samples from the first two exploratory experiments were subjected to Illumina TruSeq Nano library preparation (Table 2).
An exploratory GAM-chIP dataset was collected consisting of 182 GAM-chIP samples (Table 2. GAM-chIP Exp003), each generated from 1 pg of chromatin after ChIP for RNAPII-S5p, plus four positive controls containing 500 pg of the same chromatin, and four negative controls where no chromatin was added (water control). GAM-chIP samples in this exploratory collection were WGA amplified and subjected to Illumina Nextera XT library preparation. DNA fragments from 300-500 bp were size-selected and sequenced.
Alignment of reads to the genome
In the first experiment (Exp 001), samples were sequenced on by Illumina MiSeq, which provides a limited sequencing depth suitable only for assessing quality of samples. In the second experiment (Exp.002), samples were sequenced by Illumina HiSeq to determine the replicability of test library quality with different amounts of input chromatin and additionally to determine optimal sequencing depth. Sequencing yielded between 5.9 and 9.8 million filtered 75 nucleotides (nts) reads for GAM-chIP libraries and 3.6 million reads for the positive control sample. In the third experiment (Exp.003), Illumina Nextera libraries were sequenced by Illumina NextSeq platform to a depth exceeding that predicted to be necessary to achieve saturation. Reads were aligned and yielded between 8.2 and 41.7 million 75 nts reads for each of the 182 GAM-chIP libraries and between 18.6 and 26.2 million reads for each of the four positive control samples. Nextera tagmentation inserts sequencing adaptors randomly within input DNA, rather than strictly adding adaptors at DNA fragment ends, therefore no trimming strategy was required to remove WGA adaptor sequences (Fig. 5A). Detection of positive and negative DNA fragments in GAM-chIP samples
The mouse genome was divided into 5 kb windows and the number of sequencing reads mapping to each window was calculated. To distinguish signal from noise in GAM-chIP datasets, a two- curve fitting strategy was applied. The distribution of sequencing depth over 5 kb windows was fit with a negative binomial distribution (representing sequencing noise) and a lognormal distribution (representing true signal). A threshold number of reads x was determined, where the probability of observing more than x "noise" reads mapping to a single genomic window was less than 0.001. Such a threshold was thus independently determined for each sample, and windows were scored as positive if the number of sequenced reads was greater than the determined threshold.
GAM-ch and GAM-chIP experiments have the greatest statistical power when the chance of a given tube containing a given locus of interest is < 0.5. In the case of GAM-chIP, the loci of interest are those which are bound by the protein targeted for enrichment, which can be identified by sequencing the bulk immunoprecipitated chromatin (ChlP-seq) without dilution and WGA amplification.
To test the detection frequency of the DNA fragments most abundantly enriched for RNAPII- S5p, and make sure that their detection in each GAM-chIP sample should remain <50%, we determined the frequency of detection of the top most enriched genomic windows for RNAPII- S5p occupancy.
As an estimation of the complexity of the datasets produced in the second experiment (Exp002), we determined the number of sequencing reads mapping to each 5 kb window by ChlP-seq of RNAPII-S5p using a published ChlP-seq dataset obtained in mouse ESCs (Brookes et al. 2012). The top 2% of 5 kb windows were taken as the genomic windows "most enriched for RNAPII- S5p". The percentage of "RNAPII-S5p most enriched windows" identified as positive in each GAM-chIP sample was determined (Fig. 13C). The percentage of most enriched windows identified as positive in each GAM-chIP dataset was highest for GAM-chIP samples with larger amount of input DNA, but was 2-16%, i.e. in all cases was less than 50%. Approximately one picogram of DNA obtained after immunoprecipitation using RNAPII-S5p antibodies was deemed suitable for the collection of further GAM-chIP samples as part of a larger-scale GAM- chIP experiment (Fig. 13C). Analyses of exploratory GAM-chIP dataset of 182 samples
The exploratory GAM-chIP R APII-S5p dataset consisted of 182 samples containing lpg of ChIP DNA, four samples with 500 pg DNA (positive controls) and four samples without DNA (negative controls). Positive windows were identified for each of these 190 samples as outlined above for the other GAM-chIP datasets. Positive windows were examined in the UCSC Genome Browser and compared to the raw sequencing data, confirming that the window-calling approach was performing sensibly (Fig. 14). We confirmed that each GAM-chIP sample contained only a subset of 5 kb windows, whilst very few positive windows were identified for the negative control samples, in support of the feasibility of the approach. Closer visual inspection of UCSC Genome Browser tracks confirmed that windows enriched for R APII-S5p occupancy in a ChlP-seq dataset (e.g. at a active gene, Brd4) are detected more frequently in the GAM-chIP samples than windows not enriched for R APII-S5p (e.g. at a silent/inactive gene, E430016F) (Fig 14). Positive windows called by the dual-curve fitting approach described above were extremely comparable to those called by an alternative published peak-finding algorithm (JAMM; Ibrahim et al, 2015), further confirming that these windows represent signal above background, and of the suitability of the window detection approach.
The 182 GAM-chIP samples were collected in two batches, each of which was further divided into four pools for independent sequencing to achieve sufficient sequencing depth. The first four batches were WGA amplified immediately after ChIP, the second four batches were WGA amplified from the same ChIP material following storage at -20°C after the aliquoting step but before WGA amplification. This collection of GAM-chIP samples gave a total of eight pools, each containing around 24 GAM-chIP samples. As quality control of purity of the amplified material (from very small amounts of mouse DNA fragments, i.e. lpg), the percentage of sequencing reads from each library that could be successfully mapped back to the mouse genome was plotted by library pool number (Fig. 15 A). Consistent with excellent purity of samples amplified, the negative control samples yielded very low percentages of mapped reads to the mouse genome, indicating that they were not contaminated by mouse DNA (e.g. from the GAM- chIP samples processed in parallel) during the WGA amplification or library preparation steps. Positive control samples (each with 500 pg of DNA) yielded the highest percentage of mapped reads (85% on average), whilst 178 out of 182 GAM-chIP libraries showed robust read mapping rates to the mouse genome of >70%. The distribution of the percentage of mapped reads was highly reproducible between samples and between sequencing pools. In particular, pools 5 to 8 did not yield a smaller percentage of mapped reads than pools 1 to 4, indicating that they were not affected by the addition of the freezing step (Fig. 15 A).
Consistent with homogeneous sampling from the fragments obtained after ChIP, each GAM- chIP sample contains only a restricted subset of sequences from each chromosome (Fig. 15B). No GAM-chIP sample contains more than 12% of any given chromosome, and all chromosomes are comparable in coverage except for chromosome X, which is present in only a single copy (whereas autosomal chromosomes are present in two copies), as expected in the male ESC line used.
GAM-chIP with RNAPII-S5p antibodies shows abundant detection of DNA fragments co- occupied by RNA polymerase II phosphorylated on Serine-5
RNAPII-S5p is most abundant at actively transcribed genes, and in particular at their promoters (Fig. 12A). To confirm that the promoters of genes more highly bound by RNAPII-S5p are also more frequently detected in GAM-chIP samples, 5kb windows overlapping gene promoters were identified and sorted into five equal groups (quantiles) according to the occupancy of RNAPII- S5p (as determined by ChlP-seq, published dataset from Brookes et al. 2012; Fig. 15C). As expected, the detection frequency of 5 kb windows that overlap gene promoters (also called transcription start sites or TSSes) increases with increased chromatin occupancy of RNAPII-S5p. The TSS-over lapping 5 kb windows with the lowest binding of RNAPII-S5p are detected in 4.4% of GAM-chIP samples on average, whereas those windows with the highest binding are detected in an average of 12.5% of GAM-chIP samples (Fig. 15C). Future experiments will include the use of larger DNA fragment amounts per sample, to reach detection of genomic windows most abundantly occupied by RNAPII-S5p closer to the optimal 0.5 frequency of detection of each fragment, which will provide optimal chromatin contact information from the least number of samples (as expected from linear HAPPY Mapping).
One possible use for GAM-chIP is to identify enhancers regulating the expression of given genes. RNAPII-S5p is expected to be found at transcriptionally expressed genes and enhancers but not transcriptionally silent genes (Fig. 12A,B), and was therefore chosen as a suitable target for the exploratory GAM-chIP experiment in order to increase the potential to identify interactions within and between enhancers and active genes. The use of different proteins for immunoprecipitation may yield optimal co-segregation of promoters and their target enhancers.
To confirm that 5 kb windows overlapping enhancers and transcriptionally active genes are detected more frequently than 5 kb windows overlapping silent genes in the GAM-chIP exploratory dataset, mouse genes were ranked according to their expression level, as determined by mPvNA-seq. The top 25% of genes were selected as most actively transcribed genes, whilst the bottom 25% of genes was selected as transcriptionally silent genes. 5 kb windows were identified that overlapped the gene body, transcription start site (TSS) or transcription end site (TES) of genes in the top or bottom 25% by expression.
The percentage of 5 kb windows overlapping each feature that were identified as positive was plotted for each of the 182 GAM-chIP samples and compared to the percentage of all 5 kb windows or of 5 kb windows overlapping enhancers detected as positive in each sample (Fig. 15D). 5 kb windows overlapping the gene body, TSS or TES of a silent gene were detected slightly less frequently than the average for all 5 kb windows. In contrast, 5 kb windows that overlapped the gene body, TSS or TES of a gene in the top 25% by expression were detected more frequently than the genome wide average, as were 5 kb windows overlapping enhancers, thus confirming that GAM-chIP with R APII-S5p antibodies indeed enriches for both actively transcribed genes and enhancers (Fig. 15D).
GAM-chIP with RNAPII-S5p antibodies shows preferential co-segregation of genes with themselves
Previous work has shown that chromatin contacts can form within the bodies of actively transcribed genes (Larkin, Cook & Papantonis, 2012). This means that distant regions within the same gene should be crosslinked both to each other and to R APII-S5p. GAM-chIP identifies the presence or absence of genomic loci across a collection of tubes. If actively transcribed genes interact with themselves during transcription, some tubes will contain many chromatin fragments derived from the same gene, which were crosslinked to each other during the fixation step. Alternatively, if actively transcribed genes do not interact with themselves, a smaller number of tubes will contain multiple windows from the same gene by chance alone.
To determine whether GAM-chIP detects the co-segregation of intra-genic windows, the number of positive windows from each gene (top 25% most expressed and bottom 25% least expressed) was counted for each tube and compared to the number of positive windows that would be expected by chance if windows were detected independently of one another. The average number of tubes containing more or less windows than expected by chance was calculated over all genes (Fig. 16). The resulting plot shows that windows within actively transcribed genes are not detected independently of one another, with many GAM-chIP samples containing more positive windows from the same gene than would be expected by chance. In contrast, windows within silent genes (those in the bottom 25% by expression) are better described by the assumption that window detection is independent, with very few GAM-chIP samples containing more windows from the same silent gene than expected by chance (Fig. 16).
GAM-chIP detects co-association of actively transcribed genes with nearest candidate enhancer regions
As well as contacting themselves, previous work has shown that genes can also make contacts to nearby enhancers during the process of transcription (Lee et al. 2015). Genomic windows overlapping enhancers should therefore co-segregate in the same GAM-chIP samples as the genomic windows overlapping their target genes. Furthermore, since different parts of each gene also contact themselves during transcription, GAM-chIP samples containing multiple positive windows from the same gene are the most likely to have originated from the gene during its transcription cycle and therefore likely to additionally co-segregate with the enhancer.
For each gene, we ordered the GAM-chIP samples according to the proportion of intragenic windows detected. GAM-chIP samples which contain many positive windows from the same active gene often also contain a nearby enhancer, whereas GAM-chIP samples containing few positive windows from the same gene are often less likely to additionally contain the enhancer (Fig. 17A). In contrast, this behaviour is not expected for silent genes, since these genes are not expected to contact nearby regions classified as enhancers in mouse ESCs. For silent genes, the detection of a nearby enhancer is often uncorrected to the detection of the gene itself (Fig. 17B). With a larger collection of GAM-chIP samples each produced from fragment frequencies closer to 0.5, it should be possible to assign enhancers to their target genes based on the correlation of detection of the enhancer with detection of the gene across the collection of samples.
Discussion
Long-range chromatin interactions are known to play a role in various important nuclear processes such as gene regulation, genomic stability or replication. Different methods exist such as FISH or 3C-based approaches to study long-range chromatin interactions but are limited in various aspects of resolution and universality.
GAM-ch samples with -0.2 and 10 genomes were subjected to WGA and detected by next- generation sequencing. The sequencing profile of the GAM-ch-0.2 sample has distinct islands across the genome whereas linear DNA at high concentration is evenly distributed (Fig. 6). The sequencing profile of -0.2 genomes suggests that only a sub-fraction of the genome is captured, which is then frequently sequenced, as expected (Fig. 8B).
The threshold of signal detection of positive windows above background (Fig. 9) was 13 reads (-940 nts) for 4 kb windows, resulting in 45xl03-50xl03 windows of 4 kb passing the threshold (Fig. 10). 45xl03-50xl03 windows of 4 kb correspond to a total of 1.8xl08-2xl08nts (out of 2.6xl09bp in the total mouse genome including repetitive sequences). If -0.2 genomes are dispensed across tubes, each molecule has a probability of 0.18 to be present in each tube assuming a Poisson distribution, which would correspond to ~4.7xl08bp. Out of these ~4.7xl08bp, we currently detect 1.8-2xl08nts (38-42%) by WGA and sequencing of -0.2 genomes of crosslinked DNA. -40% of detection efficiency might be an underestimation due to the fact that genomes may have been lost upon digestion and dilution across tubes and genomic regions with highly repetitive sequences might be present in the tubes but not captured by sequencing (Rozowsky J. et al. 2009. PeakSeq enables systematic scoring of ChlP-seq experiments relative to controls. Nature Biotechnology 27, 66-75). The contribution of repetitive sequences can be measured by allowing mapping of multi-hit reads (e.g. using Bowtie2). In summary, DNA detection by WGA and sequencing is a promising solution for the detection of single DNA molecules from 3C-like crosslinked chromatin, which will enable ligation-free detection of DNA locus co-segregation and measurement of chromatin interactions.
Identifying contacts between active genes and their regulatory regions (enhancers) is a major current challenge, especially as there is evidence for complex interactions between clustered enhancers and their target genes (Fig. 11 A). 3C-based technologies underestimate contacting partners of most complex interactions (i.e. interactions involving three or more fragments; O'Sullivan et al. 2013; Fig. 1). FISH in interphase nuclei is limited by sensitivity of detection which requires that probes cover several kilobase pairs of genomic sequence, and by spatial resolution, which is limited to detect interactions between genomic sequences separated by several tens of kilobase pairs. Novel ligation-free technologies should help detect enhancers that participate in the most complex interactions (Fig. 11B). Combining GAM-ch with ChIP for a protein or nuclear factor involved in enhancer-gene contacts holds the power to identify contacts between genes and their active enhancers without the disadvantages of ligation (Fig. 11B). R APII-S5p is present at both active genes and at enhancers as determined by ChIP and is therefore a suitable candidate factor for this purpose (Fig. 12).
GAM-chIP after R APII-S5p ChIP can be performed reliably for different amounts of DNA, especially for 1 pg of DNA yielding GAM-chIP libraries with low complexity (2-10% of detection of 5 kb genomic windows; Fig. 13, 14, 15). The GAM-chIP libraries produced were enriched for genomic windows containing active genes, including windows covering the gene promoters (TSS) and the gene termination sites (TES) (Fig. 15C,D). 5kb genomic windows containing candidate enhancers were also more likely to be detected in the pool of positive windows in each GAM-chIP dataset (Fig. 15D), consistent with the presence of RNAPII-S5p at these regulatory regions. In support of the detection of chromatin interactions in GAM-chIP, we find over-representation of intragenic windows in only a proportion of GAM-chIP datasets for each gene (Fig. 16). Finally, in GAM-chIP samples where each given active gene is most detected (where many of its intragenic windows co-segregate the most), the nearest enhancer is also more likely to be found (Fig. 17A); conversely, detection of increased intragenic 5kb windows within silent genes, does not coincide with increased detection of the nearest enhancer (Fig. 17B).
Methods
Murine fetal liver and fetal brain were dissected from El 4.5 wildtype mouse embryos as described previously (Hagege et al. 2007) and processed in parallel for 3C and GAM-ch. The quality of the resulting 'chromatin' preparation was determined using a chromosome conformation capture (3C)-qPCR assay, performed as previously described (Hagege et al. 2007), on the mouse β-globin gene cluster as a reference locus.
Chromatin preparation for 3C and GAM-ch
Mouse fetal liver and brain tissue from 14.5 dpc embryos were dissected and processed into a single cell suspension as previously described (Hagege et al. 2007), resulting in a single-cell sample containing approximately 2xl07 cells/mL in 10% (v/v) heat inactivated fetal calf serum in PBS. Cells were fixed by addition of 2% formaldehyde/ 10%> FCS/PBS and incubated for 5 or 10 min at room temperature. The crosslinking reaction was then quenched by addition of 1 M glycine solution to give 0.14 M final concentration. Fixed cells were then lysed with 10 mM Tris (pH 7.5), 10 mM NaCl, 5 mM MgCl2, 0.1 mM EGTA. lx complete protease inhibitor (Roche), pelleted by centrifugation for 5min at 400xg at 4°C. Pelleted nuclei were frozen in liquid nitrogen for long-term storage at -80°C.
For immediate sample processing by enzyme restriction, lxlO7 nuclei were resuspended in restriction enzyme buffer (500 μί; NEB2 buffer), 20%> (w/v) SDS solution (7.5 μί) was added to a final concentration of 0.3%, and incubated (1 h shaking at 900 rpm) to increase chromatin accessibility for restriction enzyme digestion. 50 μΐ, of 20%> Triton X-100 solution were added (2% final concentration) and incubated at 37°C (1 h shaking) to sequester SDS. Hindlll (400 units; BioLab) was added directly, and digestion was performed overnight (37°C, shaking) followed by addition of 40 μΐ^ 20% SDS solution (1.6% final concentration) and incubation at 65°C (20 min) to inactivate Hindlll. Aliquots of undigested and digested chromatin were taken for subsequent analysis of digestion efficiency.
Chromosome conformation capture (3C)
The digested nuclei were transferred to a 50 mL Falcon tube and diluted in 6.125 mL of ligation buffer (66 mMTris-HCl, pH 7.5; 5 mM DTT; 5 mM MgCl2; 1 mM ATP). After addition of 375 μΐ. of 20% (v/v) Triton X-100 solution (1% final concentration), nuclei were incubated (1 h shaking at 37°C). T4 DNA ligase (Promega) was added (100 Units) and ligation was performed at 16°C for 4 h. Reversal of crosslinks was performed by addition of 30 μί of 10 mg/mL proteinase K (300 μg total; Sigma) and incubation at 65°C overnight followed by RNase incubation (300 μg total; Roche) at 37°C (1 h), and by phenol-chloroform extraction and ethanol precipitation (Sigma). The 3C material was desalted using Micro Bio-Spin P-30 chromatography columns (BioRad) before qPCR. Each qPCR reaction was performed with -120 ng of 3C material. Quantitative real-time PCR (MJ MiniOpticon, BioRad) was performed with Platinum Taq DNA Polymerase (Invitrogen) and double-dye oligonucleotides (5'FAM + 3'TAMRA) as TaqMan probes, using the following concentrations: 0.1 μLTaq-polymerase from kit; 2.5 μΐ, lOxTaq-buffer from kit; 0.75 μΕ MgCl2 (final 1.5 mM) from kit; 0.5 μΕάΝΤΡβ (final 200 μΜ); 0.25 μΐ, of each primer (from stock solution of 0.29 μg/μL); 0.025 μLTaq-probe (final 2.5 pmol); 1-2 μΐ, DNA template and adjusting to 25 μΐ, with H20. Cycling conditions were: 90°C for 10 min, 44 cycles of 10 seconds at 95°C and 1 min at 60°C. Primer sequences and TaqMan probes were obtained from published literature (Splinter E. et al. 2006. CTCF mediates long-range chromatin looping and local histone modification in the beta-globin locus. Genes & Development 20, 2349).
The efficiency of DNA digestion was assessed as previously described (Hagege et al, 2007). 5 μΕ aliquots were taken before (UND) and after digestion (D). 500 μΐ^ of lxPK buffer (5 mM EDTA, pH 8.0; 10 mM Tris-HCl, pH 8.0; 0.5% SDS) and 1 μΕ of 20mg/mL PK (20 μg final) was added and incubated for 30 min at 65°C (or overnight). After equilibrating at 37°C, 1 μί of 1 mg/mL RNase A (1 μg final, Sigma) was added and incubated for 2 h at 37°C. Then 500 μΐ, of phenol-chloroform was added, mixed vigorously and centrifuged for 5 min at 16100xg at 4°C. The supernatant was transferred into a new tube and 50 μί of 2 M sodium acetate pH 5.6 and 1.5 mL of ethanol were added, mixed and placed at -80°C for 1 h. After centrifugation for 20 min at 16000xg at 4°C, the supernatant was mixed with 500 μΐ^ of 70% ethanol and centrifuged for 4 min at 16000xg at room temperature. The pellet was dried and resuspended in 60 μΐ, of H20. A real-time qPCR (95°C for 10 min, 40 cycles with 95°Cfor 30 seconds, 58°C or 15 seconds and 72°C for 15 seconds) with SybRGreen as performed with the undigested (UND) and digested (D) samples using 2xPCR mix (Promega) on the MJ MiniOpticon PCR engine (BioRad). To check for restriction efficiencies, primer sets that amplify across each restriction site of interest (R) were used. To correct for differences in the amount of template, internal primers (C) not containing a restriction site were used. The cycle thresholds were used to calculate the restriction efficiency according to the following formula: Percentage of restriction = 100-100/2^ctR Ctc^D " (ctR-ctc)UND) Digested samples with a restriction efficiency of >70% were used for further analyses.
To correct for different crosslinking and ligation efficiencies between fetal liver and brain tissues, we used a control gene locus, the Ercc3 gene, which is expressed ubiquitously. To correct for different primer efficiencies, a control template was generated that contains all possible ligation products in equimolar amounts. Two BACs spanning the β-globin gene cluster and the Ercc3 locus were mixed in equimolar amounts, digested with Hindlll and religated. This random control template was diluted and a standard curve was produced for each primer. Genomic DNA was added to the random template to mimic 3C sample conditions. The relative interaction frequencies values for both fetal liver and fetal brain samples with each specific 3C primer pair were extrapolated using the standard curve of the random control template with the corresponding 3C primer pair.
Preparation of chromatin for GAM-ch
Preparation of crosslinked nuclei from mouse fetal liver cells for GAM-ch is similar as for 3C. Briefly, fetal liver cells were resuspended in 2% formaldehyde/ 10% FCS/PBS and the reaction was quenched with glycine after 10 min. Fixed cells were lysed in cold lysis buffer, and nuclei were spun as for 3C (as described above). Frozen or fresh nuclei were resuspended in sonication buffer (50 mM HEPES pH 7.9, 140 mM NaCl, 1 mM EDTA, 1% Triton X-100, 0.1% Na- deoxycholate, 0.1%) SDS), counted using a haemocytometer and diluted to -lxlO7 nuclei/mL to keep sonication efficiency optimal. Nuclei were sonicated in 2.5 mL aliquots using a Bioruptor (Diagenode) for 30 min at 30 s on/off intervals at medium energy. After sonication and centrifugation at 14000 rpm for 10 min at 4°C, the supernatant was diluted in Tris-EDTA buffer (pH 7.5) into aliquots of -100 genomes^L and stored at -20°C.
For preparation of linear DNA, mouse fetal liver cells were embedded into DNA agarose strings at a density of ~lxl07 cells/mL (~2xl05 genomes/cm; prepared according to Dear D.H. et al. 1998. A high-resolution metric HAPPY map of human chromosome 14. Genomics 48:232). Agarose strings of distinct length were melted in 0.5x PCR buffer II (68°C, 10 min) and DNA was diluted in molecular biology-grade H20 (Sigma) into aliquots of -100 genomes^L and stored at -20°C.
Preparation of chromatin for GAM-chIP
Mouse embryonic stem cells (ESCs; 46C cell line, male) were grown in ESGRO medium (Merck, SF001-500P) supplemented by 1000 units/ml LIF (Merck), and chromatin prepared as previously described (Stock et al, 2007). Briefly, cells were treated with 1% formaldehyde (37°C, 10 min) and the reaction stopped with addition of glycine to a final concentration of 0.125 M. Cells were washed in ice-cold PBS, before "swelling" buffer (25 mM HEPES pH 7.9, 1.5 mM MgC12, 10 mM KC1 and 0.1% NP-40) was added to lyse the cells (10 min, 4°C). Cells were scrapped from flasks, and nuclei isolated by Dounce homogenization (50 strokes, "Tight" pestle) and centrifugation. After resuspension in "sonication" buffer (50 mM HEPES pH 7.9, 140 mM NaCl, 1 mM EDTA, 1% Triton X-100, 0.1% Na-deoxycholate and 0.1% SDS), nuclei were sonicated using a Diagenode Bioruptor (full power; 30 min: 30s On', 30s Off; 4°C). The resulting material was centrifuged twice (4°C, 15 min) at 14,000 rpm. Swelling and sonication buffers were supplemented with 5 mM NaF, 2 mM Na3V04, 1 mM PMSF, and protease inhibitor cocktail (Roche).
Chromatin immunoprecipitation for GAM-chIP
Protein-G-magnetic beads (Active Motif) were first incubated with rabbit anti-mouse (IgG+IgM) bridging antibodies (Jackson Immunoresearch; 10 μg per 50 μΐ beads) for 1 h at 4°C and washed with sonication buffer. Seven hundred μg of chromatin was immunoprecipitated (4°C, overnight) with 10 μg of RNAPII-S5p antibody (clone CTD-4H8, Covance) and 50 μΐ magnetic beads beads. ChIP washes and elutions after immunoprecipitation were performed as described previously (Stock et al, 2007). Briefly, crosslinked DNA-protein complexes were eluted twice from beads (65°C, 5 min; and room temperature, 15 min) with 50 mM Tris-HCl pH 8.0, 1 mM EDTA and 1% SDS. Half of the eluted immunoprecipitated chromatin was diluted into multiple tubes (based on the measured DNA concentration in the other half of eluted chromatin). To measure DNA concentration, half of the eluted chromatin was incubated overnight at 65 °C with addition of NaCl (160 mM final concentration) and RNase A (20 μg/ml; Sigma) to reverse cross- linking. After reverse cross-linking samples were incubated with 200 μg/ml proteinase K for 2 h at 50°C and DNA was recovered by phenol-chloroform extraction and ethanol precipitation with addition of 30 μg glycogen. The final DNA concentration was determined by PicoGreen f uorimetry (Molecular Probes, Invitrogen). For single gene analysis by quantitative PCR, immunoprecipitated and input chromatin DNA were diluted to the same concentration (0.2 ng/μΐ) and equal amount (0.5 ng) of DNA was analyzed by quantitative real-time PCR (qPCR) using SensiMix SYBR (BioLine). Primer sequences (in 5' to 3' orientation) are listed below:
Oct4 promoter F GGCTCTCC AGAGGATGGCTGAG (SEQ ID NO : 1 )
Oct4 promoter R TCGGATGCCCCATCGCA (SEQ ID NO: 2)
Oct4 coding F CCTGCAGAAGGAGCTAGAACA (SEQ ID NO: 3)
Oct4 coding R TGTGGAGAAGCAGCTCCTAAG (SEQ ID NO: 4)
Nkx2.2 promoter F CAGGTTCGTGAGTGGAGCCC (SEQ ID NO: 5)
Nkx2.2 promoter R GCGCGGCCTC AGTTTGTAAC (SEQ ID NO : 6)
Nkx2.2 coding F AGAGCCCTCGGCTGACGAGT (SEQ ID NO: 7)
Nkx2.2 coding R CGTGAGACGGATGAGGCTGG (SEQ ID NO: 8)
HoxA7 promoter F GAGAGGTGGGC AAAGAGTGG (SEQ ID NO : 9)
HoxA7 promoter R CCGACAACCTCATACCTATTCCTG (SEQ ID NO: 10)
HoxA7 coding F CTGGACCTTGATGCTTCTAACT (SEQ ID NO: 11)
HoxA7 coding R AGCCAGAGAAAGAGGGATTCTA (SEQ ID NO: 12) Myf5 promoter F GGAGATCCGTGCGTTAAGAATCC (SEQ ID NO: 13) Myf5 promoter R CGGTAGCAAGACATTAAAGTTCCGTA (SEQ ID NO:
14)
Myf5 coding F GATTGCTTGTCCAGCATTGT (SEQ ID NO: 15)
Myf5 coding R AGTGATCATCGGGAGAGAGTT (SEQ ID NO: 16)
"Cycle over threshold" values for the immunoprecipitated DNA (IP Ct) from the quantitative PCR were subtracted from the input Ct values (input Ct). This figure was converted into the fold enrichment using the equation 2(input ct " IP ct).
Procedure of GAM-ch in combination with high-throughput sequencing
To prepare GAM-ch samples for high-throughput sequencing, aliquots of DNA with -100 genomes^L were diluted in Sigma-H20 to -0.2, -0.7 or -10 genomes per tube and immediately processed for whole genome amplification (WGA, Sigma) according to the manufacturers instructions. Incubation of DNA in WGA fragmentation buffer and proteinase K (PK, Sigma) was followed by PK heat inactivation for 4 min at 99°C. Finally, WGA adapters were ligated to DNA fragment ends followed by whole genome amplification and gel electrophoresis.
Illumina library preparation for GAM-ch high-throughput sequencing
Illumina libraries were prepared for HT sequencing from WGA-amplified GAM-ch DNA. WGA-amplified GAM-ch samples were fragmented using a Covaris shearing system before library preparation. Illumina libraries were size selected on agarose gels, enabling visualisation of the amplified DNA fragments, and therefore more careful extraction of appropriate sized fragments. After purification by QIAgen Gel Extraction kit, libraries were quantified by QuBit (Invitrogen) and qPCR, and library size was analysed by Bioanalyser (Agilent). Fragment sizes were within the expected size distribution of 210-600 bp (including adapters) for all libraries.
Procedure of GAM-chIP in combination with high-throughput sequencing
Chromatin precipitated with antibodies against RNAPII-S5p was quantified fluorimetrically with PicoGreen (Molecular Probes, Invitrogen) and diluted into multiple tubes (see Table 2 for amounts). DNA was extracts by WGA, first by incubation in WGA fragmentation buffer containing PK for 2 h (Exp.001 and Exp.002) or 8 h (Exp.003); subsequent steps were carried out according to the manufacturer's specifications. Amplified DNA was purified with MinElute 96 UF PCR Purification Kit (Qiagen) according to manufacturer's instructions. DNA fragments from 300-500 bp were size-selected with Agencourt AMPure XP (Beckman Coulter) and the final DNA concentration was determined by PicoGreen fluorimetry (Molecular Probes, Invitrogen) and subjected to Illumina TruSeq Nano library preparation (GAM-chIP ExpOOl, GAM-chIP Exp002; Table 2) or to Illumina Nextera XT library preparation (GAM-chIPExp003 ; Table 2).
High-throughput sequencing for GAM-ch
Libraries were sequenced using Illumina Sequencing Technology at the MRC-CSC Genomics laboratory using a Genome Analyzer II according to the manufacturer's instructions.
GAM-ch libraries (4-12 pM) were loaded onto the Genome Analyser flow cell. The single- stranded DNA fragments bind randomly across the surface of the flow cell due to hybridisation between the adaptor sequences added to DNA ends during library preparation, and the oligonucleotides that coat the flow cell.
Polymerase-based extension converts each fragment to a cluster of approximately 1000 identical fragments. The amount and size of DNA fragments loaded on to the flow cell was optimised to obtain the highest number of non-overlapping clusters following cluster generation. Clusters were then sequenced by synthesis, using adaptor- specific primers and incorporation of fluorescent nucleotides. Digital images were taken at each round of nucleotide incorporation and the unique fluorescent signal assigned to each nucleotide enables its correct identification. Sequential images of a given cluster therefore represent the fragment sequence.
High-throughput sequencing for GAM-chIP
Libraries were sequenced with Illumina MiSeq (GAM-chIP ExpOOl, Table 2), Illumina HiSeq 2000 (GAM-chIP Exp002; Table 2) or Illumina NextSeq 500/550 (GAM-chIP Exp003, Table 2), according to the manufacturer's instructions. Briefly, all library pools were analysed on the 2100 Bioanalyzer (Agilent) with the High Sensitivity DNA kit, followed by quantification on Qubit fluorimeter (ThermoFisher Scientific), diluted to 4 nM concentration, denatured and loaded into the HiSeq flow cell, MiSeq flow cell or NextSeq reagent cartridge at 1.8 pM concentration. GAM-chIP libraries sequenced on the HiSeq or MiSeq were not imaged for the first thirty sequencing cycles (known as dark cycles) in order to avoid issues relating to low sequence diversity in the WGA adaptor. This step avoids the need for trimming reads after sequencing used in earlier GAM-ch datasets (Fig. 5A).
Analysis of high-throughput sequencing data from GAM-ch samples
DNA reads were firstly aligned to the reference mouse genome (assembly mm9) using Illumina Extended software (pipeline 1.6) allowing only for two mismatches at most and unique matches only. Un-aligned reads were then trimmed at their 5 ' or 3 ' end and aligned to the mm9 genome using Bowtie software, version 0.9.8.1 (Langmead B. et al. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10, R25). To trim reads at their 5' or 3' end, the option "— trim5" or "— trim3" was used, respectively; to allow direct comparison with Illumina data (allowing 2 mismatches and unique reads only), the option "- m l" was used. Read "depth of coverage" was computed by calculating the number of sequenced nucleotides that overlap the window of interest. Python scripts were used to generate "depth of coverage" counts. The density histograms were plotted using R software function "density" (Team R.D.C. 2010. R: A language and environment for statistical computing. R foundation for Statistical Computing. Royal Foundation for Statistical Computing, Vienna, Austria).
Analysis of high-throughput sequencing data from GAM-chIP samples
DNA reads were first aligned to the reference mouse genome (assembly mm 10) using Bowtie2 and enforcing a minimum mapping quality of 20. Read depth of coverage was calculated using bedtoolsmultibamcov (Quinlan & Hall 2010, BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26:6). Curve fitting was performed in python using the fmin function from scipy. A combination of two distributions was fitted to the histogram of the number of reads per window. A negative binomial distribution represents sequencing noise, and the parameters of the fit for this distribution were used to determine a threshold number of reads X where the probability of observing more than X reads mapping to a single genomic window by chance was less than 0.001. Such a threshold was thus independently determined for each sample, and windows were scored as positive if the number of sequenced reads was greater than the determined threshold. To obtain a robust estimate of the sequencing noise, we fit a lognormal distribution (representing true signal) simultaneously with the negative binomial, although the parameters of the lognormal are not used in determining the threshold. As a quality check, positive windows were also called using JAMM (Ibrahim et al, 2015) in the peak mode with default settings. RNAPII-S5p and control DIG ChlP-seq datasets and average profiles
ChlP-seq libraries for R APII-S5p and control (using non-specific antibody against plant steroid digoxigenin) were prepared from 10 ng of immunoprecipitated DNA (as measured by Picogreen quantification) with corresponding antibodies using the Next ChlP-Seq library Prep Master Mix Set from Illumina (NEB, # E6240) following the NEB protocol, with some modifications. The intermediate products from the different steps of the NEB protocol were purified using MiniElute PCR purification kit (Qiagen, # 28004). Adaptors, PCR amplification primers and indexing primers were from the Multiplexing Sample Preparation Oligonucleotide Kit (Illumina, # PE-400-1001). Samples were PCR amplified prior to size selection of DNA fragments (250- 600bp) on an agarose gel. After purification by QIAquick Gel Extraction kit (Qiagen, # 28704), libraries were quantified by qPCR using Kapa Library Quantification Universal Kit (KapaBio systems, #KK4824). Library size distribution was assessed by 2100 Bioanalyzer (Agilent) with High Sensitivity DNA analysis Kit (Agilent, #5067-4626) before high-throughput sequencing. Libraries were quantified by Qubit and sequenced on Illumina HiSeq2000 (single- end sequencing, 51 nucleotides), according to the manufacturer's instructions.
Sequenced reads were aligned to the mouse genome (assembly mmlO, December 2011) using Bowtie2 version 2.0.5 (Langmead and Salzberg, 2012), with default parameters. Duplicated reads (i.e. identical reads, aligned to the same genomic location) occurring more often than a threshold were removed. The threshold is computed for each dataset as the 95th percentile of the frequency distribution of reads.
To produce average ChlP-seq profiles (Fig. 12A,B), the depth of coverage was calculated for non-overlapping windows of 10 bp, covering each region of interest (e.g. 5 kb windows centred at transcription start site or transcription end site or 1 kb window relative to enhancer center). To represent RNAPII-S5p and control ChIP enrichment at enhancers, the list of enhancers from Whyte et al. 2013 was used.
Gene classification in Active and Silent groups
Genes were sorted according to their expression in TPM (Transcripts per Million). Genes in the top 25% by expression were classified as active, whilst genes in the bottom 25% by expression were classified as silent. To calculate TPMs, paired-end (2xl00bp) reads from mRNA-seq (published dataset in Brookes et al. 2012) were aligned against the mouse genome using STAR (Spliced Transcripts Alignment to a Reference, v2.4.2a, (Dobin et al, 2013) and expression levels were estimated in TPM with RSEM (RNA-Seq by Expectation-Maximization, vl .2.25 (Li and Dewey, 2011). The reference for STAR and RSEM was produced from the Mouse Genome version mmlO, providing the gtf annotation from UCSC Known Genes (mmlO, version 6) and associated isoform-gene relationship information from the Known Isoforms table. Both tables were downloaded from the UCSC Table browser (http://genome.ucsc.edu/cgi-bin/hgTables).
Calculating the observed and expected number of positive windows per gene
Active (top 25% expression) and silent (bottom 25% expression) genes were selected that overlapped five or more 5 kb windows (n=1631 active genes, n=1335 silent genes). The detection frequency of each window overlapped by the gene, ± one window upstream/downstream, was calculated as the number of GAM-chIP samples in which the window was detected divided by the total number of GAM-chIP samples. Since each window is detected with a different frequency, each window can be described by its own binomial distribution. The expected distribution of the number of positive windows from the same gene detected simultaneously in a single GAM-chIP sample was calculated as the convolution of the binomial distributions for each component window. The average expected number of positive windows per GAM-chIP sample was calculated as the sum of the window detection frequencies. For each gene, the number of tubes with more than double this average was counted and compared to the expected number of tubes with more than double the average. The distribution of observed vs. expected values was plotted and compared between active genes and silent genes.
References
1. Dear P.H., Cook P.R. (1989) Happy mapping: a proposal for linkage mapping the human genome. Nucleic Acids Res. 17, 6795
2. Uslu VV, Petretich M, Ruf S, Langenfeld K, Fonseca NA, Marioni JC, Spitz F. Long- range enhancers regulating Myc expression are required for normal facial morphogenesis. Nat Genet. 2014 Jul;46(7):753-8.
3. Pombo A. 2003. Cellular genomics: which genes are transcribed when and where? Trends Biochem. Sci. 28, 6
4. Pauciullo A, Perucatti A, Iannuzzi A, Incarnato D, Genualdo V, Di Berardino D, Iannuzzi L. Development of a sequential multicolor-FISH approach with 13 chromosome-specific painting probes for the rapid identification of river buffalo (Bubalus bubalis, 2n = 50) chromosomes. J Appl Genet. 2014 Aug;55(3):397-401.
5. Heslop-Harrison JS1, Harrison GE, Leitch I J. Reprobing of DNA:DNA in situ
hybridization preparations. Trends Genet. 1992 Nov;8(l l):372-3.
6. Gavrilov AA, Chetverina HV, Chermnykh ES, Razin SV & Chetverin AB, 2014.
Quantitative analysis of genomic element interactions by molecular colony technique. Nucl. Acids Res. 42 (5):e36
7. Chetverin AB, Chetverina HV, 2008. Molecular colony technique: a new tool for
biomedical research and clinical practice. Prog. Nucleic Acid Res. Mol. Biol 82:219-255.
8. Pombo A, Dillon N. Three-dimensional genome architecture: players and mechanisms. Nat Rev Mol Cell Biol. 2015 Apr;16(4):245-57
9. O'Sullivan JM, Hendy MD, Pichugina T, Wake GC, Langowski J. The statistical- mechanics of chromosome conformation capture. Nucleus. 2013 Sep-Oct;4(5):390-8. Belmont A.S., 2014. Large scale chromatin organization: the good, the surprising, and the still perplexing. Curr Op Cell Biol 26, 69
Williamson I, Berlivet S, Eskeland R, Boyle S, Illingworth RS, Paquette D, Dostie J, Bickmore WA. Spatial genome organization: contrasting views from chromosome conformation capture and fluorescence in situ hybridization. Genes Dev. 2014 Dec 15;28(24):2778-91.
Maxwell S, Ho H, Kuehner E, Zhao S, Li M. 2005. Pitx3 regulates tyrosine hydroxylase expression in the substantia nigra and identifies a subgroup of mesencephalic
dopaminergic progenitor neurons during mouse development. Dev. Biol, 282(2):467-479. Hagege H, Klous P, Braem C, Splinter E, Dekker J, Cathala G, de Laat W, Forne T.
Quantitative analysis of chromosome conformation capture assays (3C-qPCR). Nat Protoc. 2007;2(7): 1722-33.
Mitchell AC, Bharadwaj R, Whittle C, Krueger W, Mimics K, Hurd Y, Rasmussen T, Akbarian S. The genome in three dimensions: a new frontier in human brain research. Biol Psychiatry. 2014 Jun 15;75(12):961-9. doi: 10.1016/j.biopsych.2013.07.015.
Grab S, Schmid MW, Luedtke NW, Wicker T, Grossniklaus U. Characterization of chromosomal architecture in Arabidopsis by chromosome conformation capture. Genome Biol. 2013 Nov 24;14(11):R129.
Ghavi-Helm Y, Klein FA, Pakozdi T, Ciglar L, Noordermeer D, Huber Wl, Furlong EE. Enhancer loops appear stable during development and are associated with paused polymerase. Nature. 2014 Aug 7;512(7512):96-100.
Oeffinger M, Wei KE, Rogers R, DeGrasse JA, Chait BT, Aitchison JD, Rout MP.
Comprehensive analysis of diverse ribonucleoprotein complexes. Nat Methods. 2007 Nov;4(l l):951-6.
Hakhverdyan Z, Domanski M, Hough LE, Oroskar AA, Oroskar AR, Keegan S, Dilworth DJ, Molloy KR, Sherman V, Aitchison JD, Fenyo D, Chait BT, Jensen TH, Rout MP, LaCava J. Rapid, optimized interactomic screening. Nat Methods. 2015 Jun;12(6):553-60. Li G, Cai L, Chang H, Hong P, Zhou Q, Kulakova EV, Kolchanov NA, Ruan Y.
Chromatin Interaction Analysis with Paired-End Tag (ChlA-PET) sequencing technology and application. BMC Genomics. 2014;15 Suppl 12:S11.
Collas P. The current state of chromatin immunoprecipitation. Mol Biotechnol. 2010 May;45(l):87-100.
Stock JK, Giadrossi S, Casanova M, Brookes E, Vidal M, Koseki H, Brockdorff N, Fisher AG, Pombo A. Ring 1 -mediated ubiquitination of H2A restrains poised RNA polymerase II at bivalent genes in mouse ES cells. Nat Cell Biol. 2007 Dec;9(12): 1428-35.
Brookes E, de Santiago I, Hebenstreit D, Morris KJ, Carroll T, Xie SQ, Stock JK,
Heidemann M, Eick D, Nozaki N, Kimura H, Ragoussis J, Teichmann SA, Pombo A. Polycomb associates genome-wide with a specific RNA polymerase II variant, and regulates metabolic genes in ESCs. Cell Stem Cell. 2012 Feb 3;10(2): 157-70.
Gavrilov AA, Gushchanskaya ES, Strelkova O, Zhironkina O, Kireev II, Iarovaia OV, Razin SV. Disclosure of a structural milieu for the proximity ligation reveals the elusive nature of an active chromatin hub. Nucleic Acids Res. 2013 Apr l;41(6):3563-75
Baslan T, Kendall J, Rodgers L, Cox H, Riggs M, Stepansky A, Troge J, Ravi K, Esposito D, Lakshmi B, Wigler M, Navin N, Hicks J. Genome-wide copy number analysis of single cells. Nat Protoc. 2012 May 3 ;7(6): 1024-41.
Hughes JR, Roberts N, McGowan S, Hay D, Giannoulatou E, Lynch M, De Gobbi M, Taylor S, Gibbons R, Higgs DR. Analysis of hundreds of cis-regulatory landscapes at high resolution in a single, high-throughput experiment. Nat Genet. 2014 Feb;46(2):205-12. Schoenfelder S, Furlan-Magaril M, Mifsud B, Tavares-Cadete F, Sugar R, Javierre BM, Nagano T, Katsman Y, Sakthidevi M, Wingett SW, Dimitrova E, Dimond A, Edelman LB, Elderkin S, Tabbada K, Darbo E, Andrews S, Herman B, Higgs A, LeProust E, Osborne CS, Mitchell JA, Luscombe N, Fraser P. The pluripotent regulatory circuitry connecting promoters to their long-range interacting elements. Genome Res. 2015 Apr;25(4):582-97. Sexton T, Yaffe E, Kenigsberg E, Bantignies F, Leblanc B, Hoichman M, Parrinello H, Tanay A, Cavalli G. Three-dimensional folding and functional organization principles of the Drosophila genome. Cell. 2012 Feb 3;148(3):458-72.
Markenscoff-Papadimitriou E, Allen WE, Colquitt BM, Goh T, Murphy KK, Monahan K, Mosley CP, Ahituv N, Lomvardas S. Enhancer interaction networks as a means for singular olfactory receptor expression. Cell. 2014 Oct 23; 159(3):543-57.
Schoenfelder S, Sexton T, Chakalova L, Cope NF, Horton A, Andrews S, Kurukuti S, Mitchell JA, Umlauf D, Dimitrova DS, Eskiw CH, Luo Y, Wei CL, Ruan Y, Bieker JJ, Fraser P. Preferential associations between co-regulated genes reveal a transcriptional interactome in erythroid cells. Nat Genet. 2010 Jan;42(l):53-61.
Deng W., Blobel G., 2014. Manipulating nuclear architecture. Curr Op Genet Dev. 25: 1-7. Tolhuis B, Palstra RJ, Splinter E, Grosveld F, de Laat W. Looping and interaction between hypersensitive sites in the active beta-globin locus. Mol Cell. 2002 Dec;10(6): 1453-65. Burton JN, Liachko I, Dunham MJ, Shendure J 2014. Species-Level Deconvolution of Metagenome Assemblies with Hi-C-Based Contact Probability Maps. Gene Genomes Genetics G3 4, 1339-1346.
Ibrahim MM, Lacadie SA, Ohler U. JAMM: a peak finder for joint analysis of NGS replicates. (2015) Bioinformatics. 2015 Jan l;31(l):48-55.
Whyte WA, Orlando DA, Hnisz D, Abraham BJ, Lin CY, Kagey MH, Rahl PB, Lee TI, Young RA. Master transcription factors and mediator establish super-enhancers at key cell identity genes. Cell. 2013;153:307-319.
Newell WR, Mott R, Beck S, Lehrach H. Construction of genetic maps using distance geometry. Genomics. 1995 Nov l;30(l):59-70.
Lee, K., Hsiung, C. C.-S., Huang, P., Raj, A. & Blobel, G. A. Dynamic enhancer-gene body contacts during transcription elongation. Genes Dev. 29, 1992-1997 (2015).
Larkin, J. D., Cook, P. R., & Papantonis, A. (2012). Dynamic reconfiguration of long human genes during one transcription cycle. Molecular and Cellular Biology, 2012 July, 32 (14), 2738-2747.
Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB. PeakSeq enables systematic scoring of ChlP-seq experiments relative to controls. Nat Biotechnol. 2009 Jan;27(l):66-75.
Splinter E, Heath H, Kooren J, Palstra RJ, Klous P, Grosveld F, Galjart N, de Laat W. CTCF mediates long-range chromatin looping and local histone modification in the beta- globin locus. Genes Dev. 2006 Sep l;20(17):2349-54.
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory- efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25.
Team R.D.C. 2010. R: A language and environment for statistical computing. R foundation for Statistical Computing. Royal Foundation for Statistical Computing, Vienna, Austria Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2.
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan 1;29(1): 15-21.
Li B, Dewey CN.RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011 Aug 4;12:323.

Claims

Claims
1. A method of determining interaction of a plurality of nucleic acid loci in a compartment comprising nucleic acids, comprising steps of
(a) separating nucleic acids from each other depending on their interaction in the compartment by (i) crosslinking nucleic acids with each other directly or indirectly, (ii) fragmenting the nucleic acids of the compartment to obtain fragments and/or cross-linked complexes of fragments, and (iii) dividing the fragmented nucleic acids to obtain a collection of fractions such that every fraction contains, on average, less than one copy of every locus, wherein steps (i) and (ii) can be carried out simultaneously or in any order;
(b) determining the presence or absence of the plurality of loci in said fractions; and
(c) determining the co-segregation of said plurality of loci in said fractions.
2. The method of claim 1, wherein the nucleic acids are DNA and/or R A, preferably, DNA.
3. The method of any of the preceding claims, wherein the compartment is a nucleus of an eukaryotic cell, a mitochondrion or a prokaryotic cell.
4. The method of any of the preceding claims, wherein step (a) is preceded by preparation of a single cell suspension.
5. The method of any of the preceding claims, wherein fragmenting is carried out by at least one method selected from ultrasound, shearing, Dounce homogenisation, vortexing with glass-beads and restriction digest.
6. The method of any of the preceding claims, wherein more than 750 fractions are analysed.
7. The method of any of the preceding claims, wherein each fraction comprises, on average, 0.4-0.6 copies of each locus.
8. The method of any of the preceding claims, wherein the collection of fractions is obtained from a plurality of compartments.
9. The method of any of the preceding claims, wherein the plurality of loci is two loci to all nucleic acid loci in the compartment.
10. The method of any of the preceding claims, wherein the method allows for the detection of at least three co-segregating loci.
11. The method of any of the preceding claims, wherein no ligation is carried out before step (b).
12. The method of any of the preceding claims, wherein the presence or absence of the plurality of loci is determined by sequencing, preferably, by next generation sequencing.
13. The method of any of the preceding claims, wherein interaction is determined by analysing co-segregation with a statistical method selected from the group comprising inferential statistical methods, wherein loci are preferably determined to interact when they co-segregate at a frequency higher than expected from their linear genomic distance on a chromosome.
14. The method of any of the preceding claims, wherein separation into fractions in step (a) (iii) is preceded by step (a) (iii.O) comprising selection of fragments/complexes of fragments that are bound by a given molecule of interest, wherein the molecule of interest is selected from the group comprising a protein, RNA, DNA or a chemical modification of DNA, RNA or protein, wherein said selection is preferably performed by an affinity- based method selected from the group comprising precipitation with an antibody directed against the molecule of interest.
15. Use of the method of any of the preceding claims for
(a) determining the probability of interaction between a plurality of loci;
(b) mapping loci and/or genome architecture in the compartment;
(c) analysing interactions of different functional elements selected from the group comprising promoters, enhancers, enzymes, e.g., involved in transcription, RNA, transposable elements, transcription factor binding sites, repressors, gene bodies, splicing signals;
(d) identification of regulatory regions regulating expression of a specific gene;
(e) identification of targets for and/or effects of a drug capable of influencing co- segregation of loci;
(f) analysing effects of a gene therapy on co-segregation of loci;
(g) analysing a disturbed co-segregation in a disease;
(h) mapping chromosomal rearrangements;
(i) diagnosing a disease associated with a disturbed co-segregation of loci;
(j) stratifying patients with a specific disease into sub groups that are more or less likely to respond to a particular drug treatment depending on the proximity of certain loci.
(k) identifying microbiological species in a mixture of species; and/or
(1) mapping contacts mediated by a defined factor.
A method of diagnosing a disease associated with a disturbed co-segregation of loci in a patient, comprising, in a sample taken from said patient, analysing co-segregation of a plurality of loci in the patient with the method of any of claims 1-14, and comparing said co-segregation with co-segregation of said loci in a subject already diagnosed with said disease, wherein the co-segregation is preferably also compared with co-segregation in a healthy subject.
PCT/EP2016/057025 2015-03-31 2016-03-31 Genome architecture mapping on chromatin WO2016156469A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP15161949 2015-03-31
EP15161949.1 2015-03-31

Publications (1)

Publication Number Publication Date
WO2016156469A1 true WO2016156469A1 (en) 2016-10-06

Family

ID=52811014

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2016/057025 WO2016156469A1 (en) 2015-03-31 2016-03-31 Genome architecture mapping on chromatin

Country Status (1)

Country Link
WO (1) WO2016156469A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018045137A1 (en) * 2016-09-02 2018-03-08 Ludwig Institute For Cancer Research Ltd Genome-wide identification of chromatin interactions
CN111727248A (en) * 2017-09-25 2020-09-29 弗雷德哈钦森癌症研究中心 Efficient targeted in situ whole genome profiling
CN112599189A (en) * 2020-12-29 2021-04-02 北京优迅医学检验实验室有限公司 Data quality evaluation method for whole genome sequencing and application thereof
EP3988669A1 (en) 2020-10-22 2022-04-27 Max-Delbrück-Centrum für Molekulare Medizin in der Helmholtz-Gemeinschaft Method for nucleic acid detection by oligo hybridization and pcr-based amplification
CN114842914A (en) * 2022-04-24 2022-08-02 山东大学 Chromatin loop prediction method and system based on deep learning

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100081141A1 (en) * 2008-08-06 2010-04-01 University Of Southern California Genome-Wide Chromosome Conformation Capture
WO2012159025A2 (en) * 2011-05-18 2012-11-22 Life Technologies Corporation Chromosome conformation analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100081141A1 (en) * 2008-08-06 2010-04-01 University Of Southern California Genome-Wide Chromosome Conformation Capture
WO2012159025A2 (en) * 2011-05-18 2012-11-22 Life Technologies Corporation Chromosome conformation analysis

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ANA POMBO ET AL: "Three-dimensional genome architecture: players and mechanisms", NATURE REVIEWS MOLECULAR CELL BIOLOGY, vol. 16, no. 4, 11 March 2015 (2015-03-11), pages 245 - 257, XP055207128, ISSN: 1471-0072, DOI: 10.1038/nrm3965 *
DEAR P H ET AL: "HAPPY MAPPING: A PROPOSAL FOR LINKAGE MAPPING THE HUMAN GENOME", NUCLEIC ACIDS RESEARCH, INFORMATION RETRIEVAL LTD, vol. 17, no. 17, 12 September 1989 (1989-09-12), pages 6795 - 6807, XP000371654, ISSN: 0305-1048 *
JENNIFER L CRUTCHLEY ET AL: "Chromatin conformation signatures: ideal human disease biomarkers?", BIOMARKERS IN MEDICINE, vol. 4, no. 4, 1 August 2010 (2010-08-01), pages 611 - 629, XP055155789, ISSN: 1752-0363, DOI: 10.2217/bmm.10.68 *
PHILIPPE COLLAS: "The Current State of Chromatin Immunoprecipitation", MOLECULAR BIOTECHNOLOGY, vol. 45, no. 1, 1 May 2010 (2010-05-01), pages 87 - 100, XP055021496, ISSN: 1073-6085, DOI: 10.1007/s12033-009-9239-8 *
TOLHUIS B ET AL: "Looping and interaction between hypersensitive sites in the active beta-globin locus", MOLECULAR CELL, CELL PRESS, CAMBRIDGE, MA, US, vol. 10, no. 6, 1 December 2002 (2002-12-01), pages 1453 - 1465, XP002301469, ISSN: 1097-2765, DOI: 10.1016/S1097-2765(02)00781-5 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018045137A1 (en) * 2016-09-02 2018-03-08 Ludwig Institute For Cancer Research Ltd Genome-wide identification of chromatin interactions
CN111727248A (en) * 2017-09-25 2020-09-29 弗雷德哈钦森癌症研究中心 Efficient targeted in situ whole genome profiling
EP3988669A1 (en) 2020-10-22 2022-04-27 Max-Delbrück-Centrum für Molekulare Medizin in der Helmholtz-Gemeinschaft Method for nucleic acid detection by oligo hybridization and pcr-based amplification
WO2022084528A1 (en) 2020-10-22 2022-04-28 Max-Delbrück-Centrum Für Molekulare Medizin In Der Helmholtz-Gemeinschaft Method for nucleic acid detection by oligo hybridization and pcr-based amplification
CN112599189A (en) * 2020-12-29 2021-04-02 北京优迅医学检验实验室有限公司 Data quality evaluation method for whole genome sequencing and application thereof
CN114842914A (en) * 2022-04-24 2022-08-02 山东大学 Chromatin loop prediction method and system based on deep learning
CN114842914B (en) * 2022-04-24 2024-04-05 山东大学 Deep learning-based chromatin ring prediction method and system

Similar Documents

Publication Publication Date Title
JP7127104B2 (en) Continuity maintained dislocation
EP3334823B1 (en) Method and kit for generating crispr/cas guide rnas
McMahon et al. TRIBE: hijacking an RNA-editing enzyme to identify cell-specific targets of RNA-binding proteins
KR102425438B1 (en) Genomewide unbiased identification of dsbs evaluated by sequencing (guide-seq)
WO2016156469A1 (en) Genome architecture mapping on chromatin
US20200248229A1 (en) Unbiased detection of nucleic acid modifications
CA2968629C (en) Genome architecture mapping
US11807896B2 (en) Physical linkage preservation in DNA storage
JP2022095676A (en) Recovery from long distance sequence information from preserved sample
Shipkovenska et al. A conserved RNA degradation complex required for spreading and epigenetic inheritance of heterochromatin
US20220136041A1 (en) Off-Target Single Nucleotide Variants Caused by Single-Base Editing and High-Specificity Off-Target-Free Single-Base Gene Editing Tool
Zhong et al. High-fidelity, efficient, and reversible labeling of endogenous proteins using CRISPR-based designer exon insertion
WO2014193980A1 (en) Substantially unbiased amplification of genomes
Saayman et al. Centromeres as universal hotspots of DNA breakage, driving RAD51-mediated recombination during quiescence
WO2019152543A1 (en) Sample prep for dna linkage recovery
US20230032136A1 (en) Method for determination of 3d genome architecture with base pair resolution and further uses thereof
Pinglay et al. Synthetic genomic reconstitution reveals principles of mammalian Hox cluster regulation
Lin et al. DNA sequence preference for de novo centromere formation on a Caenorhabditis elegans artificial chromosome
Willemin et al. Context-independent function of a chromatin boundary in vivo
Herbst Scalable approaches for gene tagging and genome walking sequencing
Smith Genetic and Epigenetic Identity of Centromeres
Goldberg et al. Engineered transcription-associated Cas9 targeting in eukaryotic cells
US20180087089A1 (en) Method for Analysing Nuclease Hypersensitive Sites
Belaghzal et al. HI-C 2.0: An Optimized Hi-C Procedure for High-Resolution Genome-Wide Mapping of Chromosome Conformation [preprint]
Fasolino Epigenomic And Nuclear Architectural Insights Into Rett Syndrome

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16712365

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16712365

Country of ref document: EP

Kind code of ref document: A1