WO2023012193A1 - Method for targeted sequencing - Google Patents

Method for targeted sequencing Download PDF

Info

Publication number
WO2023012193A1
WO2023012193A1 PCT/EP2022/071758 EP2022071758W WO2023012193A1 WO 2023012193 A1 WO2023012193 A1 WO 2023012193A1 EP 2022071758 W EP2022071758 W EP 2022071758W WO 2023012193 A1 WO2023012193 A1 WO 2023012193A1
Authority
WO
WIPO (PCT)
Prior art keywords
dna
sequence
generated
target nucleotide
nucleotide sequence
Prior art date
Application number
PCT/EP2022/071758
Other languages
French (fr)
Inventor
Max Jan van Min
Original Assignee
Cergentis B.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cergentis B.V. filed Critical Cergentis B.V.
Publication of WO2023012193A1 publication Critical patent/WO2023012193A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1082Preparation or screening gene libraries by chromosomal integration of polynucleotide sequences, HR-, site-specific-recombination, transposons, viral vectors
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6806Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay

Definitions

  • the present invention relates to the field of molecular biology and more in particular to DNA technology.
  • the invention in more detail relates to the sequencing of DNA.
  • the invention relates to strategies for determining (part of) a DNA sequence of a genomic region of interest.
  • the invention relates to the determination of the sequence of parts of a genome that are in a spatial configuration with each other.
  • Gene therapy is the introduction of exogenous nucleic acids (e.g. transgenes) into cells or organisms to achieve a therapeutic effect.
  • gene therapy involves the integration of one or more copies of a transgene into the host cell genome.
  • CRISPR-Cas9 and other genome editing techniques hold great promise, the majority of current gene therapies use retroviral, lentiviral or non-viral vectors that are non-targeted and can integrate at multiple sites, with some predilection for open chromatin and transcriptionally active regions.
  • Replicationdefective retrovirus- based gene transfer vectors in particular lentiviruses which efficiently target and integrate into both dividing and non- or slowly dividing cells (e.g. stem cells and neurons), are the vector of choice for many applications.
  • lentivirus-based gene therapy clinical trials are ongoing, including targeting hematopoietic stem cells and T cells for a variety of diseases and adoptive T cell therapy (e.g. CAR-T) for cancer treatment.
  • next generation sequencing Several methods have been developed to investigate and report locations of newly integrated viral DNA using next generation sequencing (NGS).
  • NGS next generation sequencing
  • the analysis of vector integration sites involves primer binding within the vector genome, elongation into the flanking host genome sequences, ligation of a common adapter on the host part, and subsequent PCRs polymerase chain reaction (PCR) amplification of the flanking host genomic sequences, followed by sequencing of the PCR amplicons.
  • PCR polymerase chain reaction
  • LAM-PCR unidirectional linear amplification-mediated PCR
  • LAM-PCR unidirectional linear amplification-mediated PCR
  • Target enrichment strategies for sequencing, in which target genomic regions from a DNA sample are selected and subsequently sequenced.
  • Some of the most widely used targeted sequencing approaches include capture-based enrichment and PCR-based amplification. For example, in the use of capture probes, on an array or in solution, probes of 60-120 bases in length are used to capture the genomic region of interest via hybridisation. Alternatively, performing a PCR reaction using a single primer pair will amplify, and thus enrich for, a genomic region.
  • sequence information throughout the genomic region of interest is required beforehand to design probes and/or primers to capture and/or amplify the region of interest;
  • the assays are biased by using sequence data for the probes and/or primers which largely cover the genomic region of interest;
  • Targeted sequencing approaches have also been developed that rely on the physical proximity of sequences as the basis of selection.
  • TLA Targeted Locus Amplification
  • TLA and other physical proximity protocols are based on the concept that, in general, the chance of different fragments being crosslinked correlates inversely with the linear distance, i.e. the frequency of intra-chromosomal crosslinking is on average always higher than that of sequences from physically distant positions in the linear genome sequence or from other DNA fragments (e.g.
  • DNA fragments that ligate to the DNA fragment comprising the target nucleotide sequence are representative of the genomic region of interest comprising the target nucleotide sequence.
  • the TLA approach involves crosslinking of DNA, fragmenting the crosslinked DNA (e.g. with a restriction enzyme), followed by ligation of the crosslinked DNA fragments.
  • the ligated DNA fragments comprising the target nucleotide sequence, and thus the genomic region of interest may be enriched, e.g. by PCR.
  • the sequence of the genomic region of interest on the linear chromosome template can subsequently be determined using (high throughput) sequencing technologies well known in the art.
  • the present invention provides an improved physical proximity based approach for sequencing of genomic regions of interest using the principles of physical proximity protocols and the deselection of ligation products which are less valuable.
  • the invention provides a method for making a targeted DNA sequencing library (i.e. a DNA library suitable for sequencing) of a genomic region of interest comprising a target nucleotide sequence and a method for determining the sequence of a genomic region of interest comprising a target nucleotide sequence (e.g. from a targeted DNA sequencing library).
  • sample DNA is crosslinked and then fragmented with a fragmenting strategy that results in known fragment end sequences (e.g. restriction enzymes) and the crosslinked DNA fragments are then ligated.
  • the resulting ligation products are then digested at least once with a nuclease which targets the fusion sequences resulting from one or more known ligation product(s) which are less valuable in terms of the sequence analysis, prior to an enrichment step and sequencing step.
  • Ligation products which are less valuable may, for example, be of the DNA fragment containing the target nucleotide sequence and one or more DNA fragments that originally occurred in immediate or close physical proximity to the target nucleotide sequence on the linear chromosome template.
  • the targeted digestion of the methods of the invention result in the digestion of the relatively large number of ligation products that, without this step, would result in high sequencing coverage across sequences in immediate vicinity to the target nucleotide sequence.
  • the ligations of all DNA fragments thus also result in ligation products that remain undigested following the digestion step using a nuclease which is directed to the specific fusion sequence.
  • the methods of the invention provide more efficient sequencing of an entire genomic region of interest. This is achieved by more preferentially sequencing those DNA sequences within the genomic region of interest with larger physical distances from the target nucleotide sequence in the linear chromosome template than in conventional physical proximity protocols.
  • the methods of the invention therefore improve the sequencing efficiency of any genomic region of interest. This includes regions comprising repeats (e.g. concatemerized sequences) as the method of the invention improves the efficiency with which sequences at either end of such a repeat are sequenced. This also includes regions comprising large structural changes.
  • ligation products resulting from alleles in which structural changes have occurred e.g. an insertion or translocation
  • ligation products resulting from alleles in which structural changes have occurred e.g. an insertion or translocation
  • ligation products that will not be digested e.g. products of ligation events between the target nucleotide sequence and the inserted or translocated sequence.
  • These ligation products which are not digested will therefore be more preferentially sequenced compared to ligation events originating from wild-type alleles that are digested.
  • the methods of the invention also present particular advantages in the sequencing of integrated transgene sequences and vector integration sites in samples comprising integrated and episomal copies of a vector (which also comprise copies of the integrated transgene sequences).
  • Conventional targeted sequencing approaches and physical proximity approaches cannot distinguish between DNA fragments originating from integrated and episomal copies of a vector I transgene sequence and this limits the efficiency and quality with which integrated copies and their integration sites can be sequenced.
  • the method of the invention enables the targeted sequencing of integrated transgene copies by exploiting the differences between the ligation products originating from integrated and episomal copies of a vector I transgene sequence.
  • Episomal copies of the vector I transgene sequences occur in DNA molecules which are physically separated from each other and the host genome.
  • the invention is based upon the concept that, in a physical proximity approach, ligation products from episomal copies of a vector I transgene sequence frequently exclusively contain ligation events between DNA fragments from the episomal copies of the vector I transgene, e.g. between the DNA fragment containing the target nucleotide sequence and DNA fragments from the episomal copies of the vector I transgene sequence.
  • integrated copies of the vector sequence are located within a much longer stretch of DNA originating from the host genome such that ligation events are more likely to occur between sequences at either end of the fragmentation site (i.e. recognition sequence) in the vector sequence and sequences originating from its integration site.
  • ligation products from integrated copies of the vector I transgene sequence will much more frequently contain ligation events between the DNA fragment containing the target nucleotide sequence (e.g. transgene or portion thereof) and DNA fragments from the integration site in the host genome, i.e. sequences of the host genome flanking the transgene.
  • the selective nuclease digestion of possible combinations of DNA fragment ends from the vector I transgene sequence will preferentially result in the depletion of ligation events originating from episomal copies.
  • one or more of the potential fusion sequences resulting from the ligation of the DNA fragment containing the target nucleotide sequence and DNA fragments of the vector I transgene sequence is/are targeted by nuclease digestion.
  • ligation products which exclusively contain vector / transgene sequences e.g. vector-vector, vector-transgene and/or transgene-transgene ligation events
  • vector-vector, vector-transgene and/or transgene-transgene ligation events are targeted for digestion.
  • nuclease digestion of one or more of the potential fusion sequences resulting from the ligation of the DNA fragment containing the target nucleotide sequence and DNA fragments of the vector I transgene sequence results in the very preferential enrichment and sequencing of ligation events of the DNA fragment containing the target nucleotide sequence and DNA fragments from the host genome. This then results in complete sequence information across integrated vectors and integration sites in the host genome.
  • An important advantage of the methods of the invention is the provision of complete sequence information across the genomic region of interest despite the deselection of ligation products which are less valuable.
  • this enables the quality control of integrated sequences and will also provide the breakpoint sequence between the integrated vector I transgene and the host genome at the exact position of the integration site.
  • vector integration site and “transgene integration site” are used interchangeably to refer to genomic locus within which the vector I transgene has integrated.
  • the methods of the invention can also be applied to sequencing of targeted transgene integration (e.g. using CRISPR-Cas9 or other genome editing techniques) and to in vivo analysis of integration sites.
  • the methods of the invention facilitate safety evaluation of preclinical lentiviral vector gene therapies by providing vector integration site analysis with improved confidence.
  • the invention provides a method for making a DNA sequencing library of a genomic region of interest comprising a target nucleotide sequence, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence; c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) digesting the ligated DNA generated in step d) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); f) (i) degrading the digested DNA generated in step e
  • step e) enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f)(i) for DNA comprising the target nucleotide sequence; and g) optionally, determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
  • the invention provides a method for determining the sequence of a genomic region of interest comprising a target nucleotide sequence, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence; c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) digesting the ligated DNA generated in step d) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); f) (i) degrading the digested DNA generated in step e) using an ex
  • the genomic region of interest comprises one or more further target nucleotide sequences.
  • a DNA sequencing library of a plurality genomic regions of interest is made.
  • sequences of a plurality of genomic regions of interest are determined.
  • the fragmenting step b) comprises fragmenting with a restriction enzyme.
  • the fragmenting step b) comprises fragmenting with a site-directed nuclease, preferably a CRISPR-Cas nuclease.
  • the fragmenting step b) comprises fragmenting a plurality of subsamples, each subsample having a different recognition sequence.
  • the method further comprises the step of: i) (a) further fragmenting the crosslinked DNA provided in step a) or the ligated crosslinked DNA generated in step c); and
  • step f) ligating the fragmented DNA generated in step i) (a), preferably wherein the ligation is performed in the presence of an adaptor, ligating adaptor sequences in between fragments; prior to step d); or ii) (a) further fragmenting the undigested DNA generated in step f); and
  • step (b) optionally, circularising or ligating the fragmented DNA generated in step ii) (a), preferably ligating the fragmented DNA to at least one adaptor, prior to step g).
  • the further steps i) (a) and i) (b) are repeated at least once.
  • the further step i) (a) is performed using random fragmentation.
  • the further steps i) (a) and ii) (a) are performed using non-random fragmentation at a recognition sequence.
  • the recognition sequence of the fragmenting step b) is of greater length than the recognition sequence used in the further fragmenting step i) (a) or step ii) (a).
  • the average length of the DNA fragments generated in fragmenting step b) is greater than the average length of the DNA fragments generated in further fragmenting step i) (a) or step ii) (a).
  • the recognition sequence of fragmenting step b) is of 4 to 12 nucleotides in length, preferably of 6 to 8 nucleotides in length.
  • step e) is performed prior to step d).
  • the method further comprises the step of circularising the DNA of step d), prior to step e).
  • the fusion sequence of digesting step e) is of greater length than the recognition sequence of fragmenting step b).
  • the fusion sequence is of 15 to 25 nucleotides in length, preferably of 20 nucleotides in length.
  • the nuclease of digesting step e) is a restriction enzyme.
  • the nuclease of digesting step e) is a site-directed nuclease.
  • the site-directed nuclease is a CRISPR-Cas nuclease.
  • the digesting step e) uses a multiplex nuclease digestion specific for a plurality of specific fusion sequences.
  • the method further comprises the steps of:
  • step e) digesting the ligated DNA generated in step c) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using the nuclease of step e);
  • step B’ ligating the mixture of digested and undigested DNA generated in step A’); prior to step d).
  • the further steps A’) and B’) are repeated at least once.
  • the enriching step f) (ii) comprises amplifying the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i).
  • amplifying the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) comprises using at least one primer which hybridises to the DNA fragment comprising the target nucleotide sequence generated in step b) and optionally further using a plurality of primers which each hybridises to the DNA sequence of one of the DNA fragments comprising the one or more further target nucleotide sequences generated in step b).
  • At least one primer directs amplification towards the recognition sequence of step b).
  • an identifier is included in the at least one primer.
  • the enriching step f) (ii) comprises capture-based enrichment of the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i), preferably wherein the capture-based enrichment is specific for a defined sequence at one end of the target nucleotide sequence.
  • the enriching step f) (ii) comprises site-directed nuclease-based enrichment of the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i), preferably wherein the site-directed nuclease-based enrichment comprises digesting the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) using a site-directed nuclease which is specific for a recognition sequence within the target nucleotide sequence followed by amplifying the digested DNA using PCR.
  • the PCR uses at least one primer facing a fragment end generated by the nuclease of step e).
  • the enriching step f) (ii) comprises site-directed nuclease-based enrichment of the undigested circularised DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i), preferably wherein the site-directed nuclease- based enrichment comprises digesting the mixture of digested and undigested circularised DNA generated in step e) using a site-directed nuclease which is specific for a recognition sequence within the target nucleotide sequence followed by amplifying the digested linearised DNA using PCR.
  • the method further comprises the step of size selection prior to or after the enrichment step f) (ii), preferably wherein the size selection step comprises using gel extraction chromatography, gel electrophoresis or density gradient centrifugation.
  • DNA is selected of a size between 20-20,0000 base pairs, preferably 50-10,0000 base pairs, most preferably between 100-3,000 base pairs.
  • step g) is performed using whole genome sequencing.
  • step g) comprises determining at least part of the sequence of the undigested DNA comprising the target nucleotide sequence.
  • the method further comprises the step of building a contig of the genomic region of interest from the determined sequences generated in step g).
  • a contig is built for each ploidy.
  • the step of building a contig comprises the steps of:
  • step b 1) identifying the fragments of step b);
  • the step 2) of assigning the fragments to a genomic region comprises identifying the different ligation products of step e) and coupling of the different ligation products to the identified fragments.
  • the first and second flanking sequences are from two separate DNA fragments which occur in close proximity to one another in the linear DNA sequence of the genomic region of interest.
  • the genomic region of interest comprises a transgene integration site.
  • the target nucleotide sequence comprises a transgene.
  • the first and second flanking sequences are from two separate DNA fragments from the transgene and/or from the vector used to deliver the transgene.
  • the first and second flanking sequences are each from a separate DNA fragment from the vector; ii) the first flanking sequence is from a DNA fragment from the vector and the second flanking sequence is from a DNA sequence from the transgene; or iii) the first and second flanking sequences are each from a separate DNA fragment from the transgene.
  • the target nucleotide sequence comprises an allele of the genomic region of interest.
  • the first and second flanking sequences are from separate DNA fragments from the allele.
  • the invention provides a method for determining the sequence of a genomic region of interest comprising a target nucleotide sequence, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a); c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) fragmenting the ligated DNA generated in step d) by non-random fragmentation of the DNA at a recognition sequence; f) circularising the fragmented DNA generated in step e); g) digesting the circularised DNA generated in step f) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step e) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA
  • Figure 1 Schematic of an illustrative approach for targeted sequencing of integrated viruses and integration sites in accordance with the invention.
  • a target nucleotide sequence (outlined with a black box) is chosen in proximity to one of the restriction sites. In episomal copies, ligation of the DNA fragment end in proximity to the target nucleotide sequence with other DNA fragments from the vector sequence will result in known fusion sequences which may be targeted with nucleases.
  • the DNA fragment end in proximity to the target nucleotide sequence will frequently ligate to DNA fragments from the integration site, i.e. the host genome.
  • Figure 2 Schematic of an illustrative approach for whole genome sequencing in accordance with the invention.
  • A A viral genome sequence with annotation of a restriction site which occurs twice in the viral genome sequence. The fragmentation of the viral genome with the restriction enzyme which recognizes this restriction site will result in three fragments, shown in different shades. Resulting fragment ends A, B, C and D generated in the fragmentation of the viral genome are shown.
  • B A portion of the human genome sequence showing the presence of multiple restriction sites (recognised by the same restriction enzyme as in Figure 2A) in the human genome in proximity to any integration site of the virus. The selective digestion of transgene-transgene, transgene-vector and vectorvector fusion sequences will help minimise or prevent the sequencing of episomal copies.
  • Figure 3 Plasmid map of pSUB201 showing the position of unique restriction sites within the plasmid sequence.
  • a method for isolating "a" DNA molecule includes isolating a plurality of molecules (e.g. 10's, 100's, 1000 's, 10's of thousands, 100's of thousands, millions, or more molecules).
  • nucleic acid may include any polymer or oligomer of pyrimidine and purine bases, preferably cytosine, thymine, and uracil, and adenine and guanine, respectively (see Albert L. Lehninger, Principles of Biochemistry, at 793-800 (Worth Pub. 1982) which is herein incorporated by reference in its entirety for all purposes).
  • the present invention contemplates any deoxyribonucleotide, ribonucleotide or peptide nucleic acid component, and any chemical variants thereof, such as methylated, hydroxymethylated or glycosylated forms of these bases, and the like.
  • the polymers or oligomers may be heterogeneous or homogenous in composition, and may be isolated from naturally occurring sources or may be artificially or synthetically produced.
  • the nucleic acids may be DNA or RNA, or a mixture thereof, and may exist permanently or transitionally in single- stranded or double-stranded form, including homoduplex, heteroduplex, and hybrid states.
  • aligning and “alignment” is meant the comparison of two or more nucleotide sequence based on the presence of short or long stretches of identical or similar nucleotides.
  • Methods and computer programs for alignment are well known in the art.
  • One computer program which may be used or adapted for aligning is "Align 2", authored by Genentech, Inc., which was filed with user documentation in the United States Copyright Office, Washington, D.C. 20559, on Dec. 10, 1991.
  • a method for determining the sequence of a genomic region of interest comprising a target nucleotide sequence comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) digesting the ligated DNA generated in step d) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in
  • genomic region of interest refers to a DNA sequence of an organism of which it is desirable to determine, at least part of, the DNA sequence.
  • a genomic region which comprises, or is suspected of comprising, an allele associated with a disease may be a genomic region of interest.
  • a genomic region which comprises a vector insertion site is another example.
  • the whole genome sequence may be determined following deselection of episomal copies of the vector I transgene sequence.
  • the genomic region of interest is the whole genome.
  • target nucleotide sequence refers to a DNA sequence of interest within a genomic region of interest.
  • the target nucleotide sequence may be a transgene or a portion thereof.
  • the target nucleotide sequence may be an allele or a portion thereof.
  • the target nucleotide sequence may be used for the design of the nonrandom fragmenting strategies described herein as well as in the enrichment steps described herein.
  • DNA fragments that originate from a genomic region of interest remain in proximity of each other because they are crosslinked.
  • DNA fragments of the genomic region of interest which are in the proximity of each other due to the crosslinks, are ligated.
  • This type of ligation is also referred to as proximity ligation.
  • DNA fragments comprising the target nucleotide sequence may ligate with DNA fragments within a large linear distance at the sequence level.
  • Each individual target nucleotide sequence is likely to be crosslinked to multiple other DNA fragments.
  • often more than one DNA fragment may be ligated to a fragment comprising the target nucleotide sequence and, in a sample comprising multiple copies of a genomic region of interest, each individual DNA fragment comprising the target nucleotide sequence may ligate to different combinations of DNA fragments originating from the genomic region of interest.
  • a sequence of the genomic region of interest may be built.
  • a DNA fragment ligated with the fragment comprising the target nucleotide sequence includes any fragment which may be present in ligation products.
  • ligation product means a DNA sequence which is generated by ligating DNA fragments together.
  • a ligation product comprises at least two DNA fragments.
  • the DNA fragments which are subsequently ligated have been produced by a previous fragmentation step.
  • the methods of the invention have the advantages that extensive sequence information is not required to focus on the genomic region of interest and the method is not sequence- biased (i.e. bias by using oligonucleotides and/or probes which cover the transgene of interest, allelic sequence of interest, or flanking sequences surrounding the sequence of interest, is avoided).
  • the methods of the invention may be used in the analysis of the 3D folding of regions of interest.
  • Methods for the analysis of the 3D folding of regions of interest are known in the art (see, for example, Sungalee et al (2021) Nature Genetics 53: 650-662).
  • the methods of the invention can be applied to the analysis of the 3D folding of regions of interest.
  • step a) a sample of crosslinked DNA is provided.
  • sample DNA refers to a sample that is obtained from an organism or from a tissue of an organism, or from tissue and/or cell culture, which comprises DNA.
  • a sample DNA from an organism may be obtained from any type of organism, e.g. microorganisms, viruses, plants, fungi, animals, humans and bacteria, or combinations thereof.
  • a tissue sample from a human patient suspected of a bacterial and/or viral infection may comprise human cells, but also viruses and/or bacteria.
  • the sample may comprise cells and/or cell nuclei.
  • the sample may comprise or consist of isolated DNA.
  • the sample DNA is from a patient or a person who may be at risk of, suspected of having, or has a particular disease, for example cancer, a viral infection (e.g. HIV-1) or any other condition which warrants the investigation of their DNA.
  • a particular disease for example cancer, a viral infection (e.g. HIV-1) or any other condition which warrants the investigation of their DNA.
  • the sample DNA is from a patient or a person who is undergoing or has undergone gene therapy, for example using a lentiviral vector.
  • Samples may be taken from a patient and/or from diseased tissue, and may also be derived from other organisms or from separate sections of the same organism, such as samples from one patient, one sample from healthy tissue and one sample from diseased tissue. Samples may thus be analysed according to the invention and compared with a reference sample, or different samples may be analysed and compared with each other. For example, for a patient being suspected of having cancer, a biopsy may be obtained from the suspected tumour. Another biopsy may be obtained from non-diseased tissue. Both tissue biopsies may be analysed according to the invention.
  • Genomic regions of interest may be those containing a gene associated with the cancer type (e.g. the BRCA1 and BRCA2 gene, which are 83 and 86 kb long, respectively (reviewed in Mazoyer, 2005, Human Mutation 25:415-422), for suspected breast cancer).
  • a gene associated with the cancer type e.g. the BRCA1 and BRCA2 gene, which are 83 and 86 kb long, respectively (reviewed in Mazoyer, 2005, Human Mutation 25:415-422), for suspected breast cancer.
  • a reference gene sequence e.g. a reference BRCA gene sequence
  • crosslinking means reacting DNA at two different positions, such that these two different positions may be connected.
  • the connection between the two different positions may be direct, forming a covalent bond between DNA strands.
  • Two DNA strands may be crosslinked directly using UV-irradiation, forming covalent bonds directly between DNA strands.
  • the connection between the two different positions may be indirect, via an agent, e.g. a crosslinker molecule.
  • a first DNA section may be connected to a first reactive group of a crosslinker molecule comprising two reactive groups, that second reactive group of the crosslinker molecule may be connected to a second DNA section, thereby crosslinking the first and second DNA section indirectly via the crosslinker molecule.
  • a crosslink may also be formed indirectly between two DNA strands via more than one molecule.
  • a typical crosslinker molecule that may be used is formaldehyde.
  • Formaldehyde induces protein-protein and DNA-protein crosslinks.
  • Formaldehyde thus may crosslink different DNA strands to each other via their associated proteins.
  • formaldehyde can react with a protein and DNA, connecting a protein and DNA via the crosslinker molecule.
  • two DNA sections may be crosslinked using formaldehyde forming a connection between a first DNA section (DNA1) and a protein
  • the protein may form a second connection with another formaldehyde molecule that connects to a second DNA section (DNA2), thus forming a crosslink which may be depicted as DNA1-crosslinker-protein- crosslinker-DNA2.
  • crosslinking according to the invention involves forming connections (directly or indirectly) between strands of DNA that are in physical proximity of each other.
  • DNA strands may be in physical proximity of each other in the cell, as DNA is highly organised, while being separated from a linear sequence point of view e.g. by 100kb.
  • the crosslinking method is compatible with subsequent fragmenting and ligation steps, such crosslinking may be contemplated for the purpose of the invention.
  • sample of crosslinked DNA refers to sample DNA which has been subjected to crosslinking.
  • Crosslinking the sample DNA has the effect that the three- dimensional state of the DNA within the sample remains largely intact. This way, DNA strands that are in physical proximity of each other remain in each others’ vicinity.
  • crosslinking the sample DNA as it is present in the sample results in largely maintaining the three dimensional architecture of the DNA.
  • the sample of crosslinked DNA is fragmented in step b). By fragmenting the crosslinked DNA, DNA fragments are produced which are held together by the crosslinks.
  • the fragmenting step b) may comprise fragmenting with one or more restriction enzymes, or combinations thereof.
  • the fragmenting step b) may also comprise fragmenting with one or more site-directed nucleases, or combinations thereof.
  • Fragmenting with a restriction enzyme or site-directed nuclease is advantageous as it may allow control of the average fragment size.
  • the fragments that are formed may have compatible overhangs or blunt ends that allow ligation of the fragments in the subsequent step c).
  • restriction enzymes or site-directed nucleases with different recognition sites may be used. This is advantageous because by using different restriction enzymes or site-directed nucleases having different recognition sites, different DNA fragments can be obtained from each subsample.
  • the fragmenting step b) comprises fragmenting a plurality of subsamples, each subsample having a different recognition sequence.
  • fragmenting DNA includes any technique that, when applied to DNA, which may be crosslinked DNA or not, or any other DNA, results in DNA fragments. Techniques well known in the art are sonication, shearing and/or enzymatic restriction, but other techniques can also be envisaged. Fragmenting techniques may result in random fragmentation of the DNA (e.g. sonication or shearing). Suitably, the fragmenting technique may result in non-random (i.e. targeted) fragmentation of the DNA (e.g. restriction enzymes or site-directed nucleases).
  • the methods of the invention may also be performed using DNA fragmentation techniques that preferentially fragment either methylated DNA or unmethylated DNA and may be used for the targeted sequencing of either methylated or unmethylated alleles of genomic regions of interest, respectively.
  • DNA fragmentation techniques that preferentially fragment either methylated DNA or unmethylated DNA and may be used for the targeted sequencing of either methylated or unmethylated alleles of genomic regions of interest, respectively.
  • certain restriction enzymes preferentially fragment methylated DNA (as compared to unmethylated DNA) whilst other restriction enzymes preferentially fragment unmethylated DNA (as compared to methylated DNA).
  • the methods of the invention are applicable to the sequencing of alleles in which known sequences are either methylated or unmethylated.
  • the promoter sequences of actively transcribed genes are typically unmethylated, whereas the corresponding gene body sequences can contain enriched levels of methylation.
  • the digestion of unmethylated DNA or methylated DNA will result in the deselection of either the promoter or corresponding gene body sequence, respectively.
  • the methods of the invention permit the enrichment and sequencing of the promoter or corresponding gene body sequence.
  • the methods of the invention may be used in combination with bisulfite treatment.
  • the methods may be used for the sequencing and quantification of epigenetic changes in alleles in which known sequences are either methylated or unmethylated.
  • nucleases examples include site-specific methyl-directed (MD) DNA endonucleases (e.g. Glal). These enzymes recognise and cleave methylated DNA sequences only and do not cleave unmethylated DNA sequences (Tarasova et al. (2008) BMC Mol. Biol. 9: 7). Suitably, a restriction enzyme or site-directed nuclease may be used.
  • MD site-specific methyl-directed
  • Glal DNA endonucleases
  • methylation-sensitive restriction enzymes that fragment unmethylated DNA include:
  • nucleases which preferentially fragment methylated and/or unmethylated DNA may be used.
  • Random fragmentation of the DNA it is meant that the fragmenting technique results in DNA fragments with unknown end sequences. Sonication results in the fragmenting of DNA at random sites, which can be either blunt ended, or can have 3’- or 5’- overhangs, as these DNA breakage points occur randomly, the DNA may be repaired (enzymatically), filling in possible 3’- or 5’-overhangs, such that DNA fragments are obtained which have blunt ends that allow ligation of the fragments to adaptors and/or to each other in a subsequent step. Alternatively, the overhangs may also be made blunt ended by removing overhanging nucleotides, using e.g. exonucleases.
  • the sample of crosslinked DNA is fragmented in step b) by non-random fragmentation of the DNA at a recognition sequence.
  • non-random fragmentation of the DNA it is meant that the fragmenting technique results in DNA fragments with known end sequences, i.e. that the fragmenting is targeted.
  • non-random fragmentation involves fragmenting at a specific recognition sequence.
  • the non-random fragmentation of the DNA may be performed using a site-directed nuclease or a restriction enzyme which targets a specific recognition sequence.
  • the specific recognition sequence is also referred to herein as a “restriction enzyme site” or “restriction site”.
  • the term “recognition sequence” means a specific nucleotide sequence which is recognised by a fragmenting technique (e.g. a site-directed nuclease or restriction enzyme) and directs cleavage of the DNA molecule at or near the recognition sequence.
  • the specific nucleotide sequence which is recognized may determine the frequency of cleaving, e.g. a nucleotide sequence of 6 nucleotides occurs on average every 4096 nucleotides, whereas a nucleotide sequence of 4 nucleotides occurs much more frequently, on average every 256 nucleotides.
  • the recognition sequence of fragmenting step b) is of 4 to 12 nucleotides in length, preferably of 4 to 8 nucleotides in length, more preferably 4 to 6 nucleotides in length.
  • the recognition sequence of fragmenting step b) is of 3 (suitably, of 4, 5, 6, 7, 8, 9, 10, 11 or 12) nucleotides in length.
  • the recognition sequence of fragmenting step b) is of 4 nucleotides in length.
  • the recognition sequence of fragmenting step b) is of 5 nucleotides in length.
  • the recognition sequence of fragmenting step b) is of 6 nucleotides in length.
  • the recognition sequence of fragmenting step b) is of 7 nucleotides in length.
  • the recognition sequence of fragmenting step b) is of 8 nucleotides in length.
  • the fragmenting step b) comprises fragmenting with a restriction enzyme.
  • restriction endonuclease and “restriction enzyme” are used interchangeably to mean an enzyme that recognizes a specific nucleotide sequence (i.e. recognition sequence) in a double-stranded DNA molecule, and will cleave both strands of the DNA molecule at or near every recognition sequence, leaving a blunt end or a 3’- or 5’- overhanging end.
  • the fragmenting step b) comprises fragmenting with a site-directed nuclease, preferably a clustered regularly interspaced short palindromic repeats (CRISPR)- CRISPR associated protein (Cas) nuclease.
  • a site-directed nuclease preferably a clustered regularly interspaced short palindromic repeats (CRISPR)- CRISPR associated protein (Cas) nuclease.
  • CRISPR clustered regularly interspaced short palindromic repeats
  • Cas CRISPR associated protein
  • site-directed nuclease means a DNA-cutting enzyme (nuclease) which is directed to recognize a predetermined specific nucleotide sequence (i.e. recognition sequence) and to cleave both strands of the DNA molecule at or near every recognition sequence.
  • the site-directed nuclease may be engineered to target a desired recognition sequence.
  • the site-directed nuclease may be a zinc-finger nuclease (ZFN), transcription activator-like effector nuclease (TALEN), Argonaute (Ago) protein (Song et al:, Nucleic Acids Research; 2020; 48(4); e19) or a CRISPR-Cas nuclease (e.g. Cas9).
  • ZFN zinc-finger nuclease
  • TALEN transcription activator-like effector nuclease
  • Ago Argonaute protein
  • Cas9 CRISPR-Cas nuclease
  • the site-directed nuclease is a CRISPR-Cas nuclease.
  • CRISPR is a family of DNA sequences found in the genomes of prokaryotic organisms such as bacteria and archaea which play a key role in the anti-bacteriophage defence system of prokaryotes and provide a form of acquired immunity.
  • CRISPR loci also termed CRISPR arrays
  • CRISPR arrays comprise regularly spaced repeat sequences with each repeat separated by a unique spacer sequence. These spacer sequences are typically derived from the genome of invading bacteriophages or extrachromosomal DNA (e.g. plasmids) and are used to detect and destroy DNA having a similar sequence during subsequent invasions.
  • a CRISPR array is accompanied by a set of homologous genes that make up CRISPR- associated systems (cas) genes.
  • cas genes have been identified and grouped into 35 families based on sequence similarity of the encoded proteins.
  • the Cas proteins comprise helicase and/or nuclease motifs, and are involved in the maintenance and formation of the dynamic structure of the CRISPR loci as well as recognition and degradation of invading DNA.
  • CRISPR-Cas systems require a transactivating CRISPR RNA (tracrRNA) which plays a role in the maturation of CRISPR RNAs (crRNAs).
  • crRNAs are transcribed from the CRISPR locus, specifically, the spacer sequences are used to generate crRNA.
  • tracrRNA is a small trans-encoded RNA which is partially complementary to and base pairs with a pre-crRNA, thereby forming an RNA duplex. This duplex is cleaved by the RNA-specific ribonuclease RNase III to form a crRNA/tracrRNA hybrid which acts as a guide RNA (gRNA).
  • gRNA guide RNA
  • Type V CRISPR- Cas systems require the crRNA but not tracrRNA.
  • crRNAs are transcribed from the CRISPR locus and then incorporated into effector complexes comprising a CRISPR-Cas nuclease, where the crRNA acts as a gRNA.
  • a crRNA alone or a crRNA/tracrRNA hybrid forms the gRNA.
  • a gRNA guides the Cas enzyme to destroy DNA (e.g. bacteriophage DNA or plasmids) having a specific sequence (i.e. guides the Cas nuclease to provide immunity against repeat invasions).
  • the gRNA associates with a CRISPR-Cas nuclease and directs the enzyme to recognize and cleave the DNA target complementary to the gRNA sequence.
  • the CRISPR-Cas nuclease will typically bind to and cleave the DNA sequence adjacent to a protospacer adjacent motif (PAM).
  • a PAM is a 2-6-base pair DNA sequence. PAMs were initially identified as a consensus sequence found adjacent to the protospacer (the sequence in the bacteriophage or plasmid which will form the spacer in the CRISPR locus). There is a close correlation between the sequence identity of PAM and the CRISPR subtype.
  • the PAM is a component of the invading bacteriophage or plasmid, but is not found in the bacterial host genome and hence is not a component of the bacterial CRISPR locus.
  • the bacterial CRISPR loci does not contain a PAM sequence, and thus will not be cut by the CRISPR-Cas nuclease, but the protospacer in the invading virus or plasmid (or the target sequence in e.g. genome editing approaches) will contain the PAM sequence, and thus will be cleaved by the nuclease.
  • the PAM is a targeting component which prevents the CRISPR locus from being targeted and destroyed by the CRISPR-Cas nuclease.
  • the nucleases can be engineered to recognize different PAMs.
  • CRISPR-Cas nucleases of Type II and Type V systems have been exploited in gene editing technologies, e.g. CRISPR-associated protein 9 (Cas9) and CRISPR-associated protein 12a (Cas12a).
  • the native Cas9 endonuclease is a four-component system that includes the crRNA and tracrRNA.
  • the Cas9 endonuclease has been re-engineered into a two-component system by fusing the two RNA molecules into a single gRNA.
  • the synthetic Cas9 system can be directed to target any DNA sequence for cleavage.
  • Cas9 provides a “blunt” cut in the target DNA strand.
  • Cas9 cuts the DNA 3 base pairs upstream of the PAM site, such that the NHEJ pathway results in indel mutations that destroy the recognition sequence in a cell, thereby preventing further rounds of cutting.
  • the canonical Cas9 PAM is the sequence 5'-NGG-3', where "N" is any nucleobase followed by two guanine ("G”) nucleobases.
  • the nuclease Cas12a (formerly known as Cpf1) requires only the crRNA for successful targeting and provides a 3’ or 5’ overhanging end in the double stranded target DNA following cleavage.
  • Cas12a may be more suitable for multiplexed CRISPR than Cas9, as more of the small crRNAs can be packaged in one vector than can Cas9's gRNAs.
  • Cas12a cleaves DNA 18-23 base pairs downstream from the PAM site. This means there is no disruption to the recognition sequence after repair in a cell, and so Cas12a enables multiple rounds of DNA cleavage.
  • Cas12a can be used for DNA assembly or ligation that is much more target-specific than traditional restriction enzyme cloning.
  • Cas12a can be directed to target any DNA sequence for cleavage by manipulating the nucleotide sequence of the gRNA.
  • step b) comprises fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence using (synthetic) Cas9 or Cas12a.
  • Methods for manipulating a gRNA sequence to direct a CRISPR-Cas nuclease to target a desired DNA sequence for cleavage are known in the art (see, for example, Jinek M, et al. (2012) Science 337: 816-821; and Kim H et al., (2017) Nature Communications 8: 14406).
  • designing a suitable gRNA and directing a CRISPR-Cas nuclease to target a desired DNA sequence for cleavage is within the ambit of the skilled person.
  • the fragments are ligated.
  • ligating involves the joining of separate DNA fragments.
  • the DNA fragments may be blunt ended, or may have compatible overhangs (also termed sticky overhangs or sticky ends) such that the overhangs can hybridise with each other.
  • the joining of the DNA fragments may be enzymatic, with a ligase enzyme, DNA ligase.
  • a non-enzymatic ligation may also be used, as long as DNA fragments are joined, i.e. forming a covalent bond.
  • a phosphodiester bond between the hydroxyl and phosphate group of the separate strands is formed.
  • a fragment comprising a target nucleotide sequence may be crosslinked to multiple other DNA fragments, more than one DNA fragment may be ligated to the fragment comprising the target nucleotide sequence. This may result in combinations of DNA fragments which are in proximity of each other as they are held together by the cross links. Different combinations and/or order of the DNA fragments in ligated DNA fragments may be formed.
  • the recognition sequence of the restriction enzyme or site-directed nuclease is known, which makes it possible to identify the fragments as remains of or reconstituted recognition sequences may indicate the separation between different DNA fragments.
  • the ligation step c) may be performed in the presence of an adaptor, ligating adaptor sequences in between fragments.
  • the adaptor may be ligated in a separate step. This is advantageous because the different fragments can be easily identified by identifying the adaptor sequences which are located in between the fragments. For example, in case DNA fragment ends were blunt ended, the adaptor sequence would be adjacent to each of the DNA fragment ends, indicating the boundary between separate DNA fragments.
  • the term “adaptor” refers to a short double-stranded oligonucleotide molecule with a limited number of base pairs, e.g. about 10 to about 30 base pairs in length, which are designed such that they can be ligated to the ends of fragments.
  • Adaptors are generally composed of two synthetic oligonucleotides which have nucleotide sequences which are partially complementary to each other. When mixing the two synthetic oligonucleotides in solution under appropriate conditions, they will anneal to each other forming a double-stranded structure.
  • one end of the adaptor molecule may be designed such that it is compatible with the end of a restriction fragment and can be ligated thereto; the other end of the adaptor can be designed so that it cannot be ligated, but this does need not to be the case, for instance when an adaptor is to be ligated in between DNA fragments.
  • step d the crosslinking is reversed in step d), which results in a pool of ligated DNA fragments that comprise two or more fragments.
  • a subpopulation of the pool of ligated DNA fragments comprises a DNA fragment which comprises the target nucleotide sequence.
  • the structural/spatial fixation of the DNA is released and the DNA sequence becomes available for subsequent steps, e.g. amplification and/or sequencing, as crosslinked DNA may not be a suitable substrate for such steps.
  • the subsequent step e) may be performed after the reversal of the crosslinking, however, step e) may also be performed while the ligated DNA fragments are still in the crosslinked state.
  • step e) is performed prior to step d).
  • Step e) may be performed following step c) and prior to step d).
  • the method may comprise the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence; c) ligating the fragmented crosslinked DNA generated in step b); e) digesting the ligated DNA generated in step c) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); d) reversing the crosslinking in the mixture of digested and undigested crosslinked DNA generated in step e); f) (i) degrading the digest
  • step d enriching the mixture of digested and undigested DNA generated in step d) or the undigested DNA generated in step f)(i) for DNA comprising the target nucleotide sequence; and g) determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
  • reversing crosslinking comprises breaking the crosslinks such that the DNA that has been crosslinked is no longer crosslinked and is suitable for subsequent amplification and/or sequencing steps. For example, performing a protease K treatment on a sample DNA that has been crosslinked with formaldehyde will digest the protein present in the sample. Because the crosslinked DNA is connected indirectly via protein, the protease treatment in itself may reverse the crosslinking between the DNA. However, the protein fragments that remain connected to the DNA may hamper subsequent sequencing and/or amplification. Hence, reversing the connections between the DNA and the protein may also result in “reversing crosslinking”. The DNA-crosslinker-protein connection may be reversed through a heating step for example by incubating at 70°C.
  • any “reversing crosslinking” method may be contemplated wherein the DNA strands that are connected in a crosslinked sample becomes suitable for sequencing and/or amplification.
  • the ligation products generated in step d) are then digested with at least one nuclease (e.g. a site-directed nuclease such as a CRISPR- Cas nuclease).
  • at least one nuclease e.g. a site-directed nuclease such as a CRISPR- Cas nuclease.
  • the term “digesting” means a process by which a polynucleotide chain is cleaved by a nuclease at a specific site or specific sites which are dictated by the nucleotide sequence.
  • the specific site(s) is/are fusion sequences as described herein.
  • the digesting step provides linearized DNA which has been cleaved by the nuclease. Said linearized DNA may subsequently be degraded using an exonuclease as described herein.
  • ligation products consisting of DNA fragments that were originally adjacent to each other in the linear sample DNA sequence are often over-represented in the sequenced material. Some ligation products may also comprise DNA fragments that were originally immediately adjacent to each other in the linear sample DNA sequence, i.e. a sequence which originally occurred in the linear sample DNA sequence may be reformed during the method steps. Such ligation products comprising DNA fragments that were originally (immediately) adjacent to each other are unlikely to contain sequences of the genomic region of interest originating from meaningful physical distances. The deselection of these less valuable ligation products will thus increase the efficiency with which an entire region of interest is enriched and can be sequenced.
  • the nuclease may be specifically selected or designed to target one or more known fusion sequences resulting from ligation events of the DNA fragment containing the target nucleotide sequence and DNA fragments that originally occurred in immediate proximity to the target nucleotide sequence in the original sequence (e.g. in the genomic region of interest or, in the context of transgene insertion site sequencing, in the vector sequence).
  • the enrichment and sequencing of ligation products containing these fusion sequences of ligation products which are less valuable can be minimised, or preferably prevented.
  • some partially digested DNA sequences comprising the recognition sequence may remain following the non-random fragmentation step b).
  • a partially digested DNA sequence comprising two DNA fragment ends and potential DNA fragments A and B separated by the recognition sequence may remain following step b).
  • the two DNA fragment ends will ligate to each other or to another DNA fragment end in ligation step c), resulting in ligation products which comprise a sequence that was originally present in the linear sample DNA sequence and was not cleaved during the fragmenting step b).
  • the nuclease may digest ligation products comprising DNA fragments that were originally adjacent to each other in the linear sample DNA sequence, ligation products comprising DNA fragments that were originally immediately adjacent to each other in the linear sample DNA sequence (i.e. reconstituted DNA sequences) and/or DNA sequences which were not cleaved in fragmenting step b).
  • This digestion step will result in the digestion of the relatively large number of ligation products that, without this step, result in high sequencing coverage across sequences in immediate vicinity to the target nucleotide sequence.
  • the methods of the invention thus provide more efficient sequencing of an entire genomic region of interest by enabling more preferential sequencing of DNA fragments that originated from larger physical distances from the target nucleotide sequence.
  • ligation event is meant the ligation of two or more DNA fragments to form a ligation product.
  • the term “fusion sequence” refers to the sequence of the site at which two DNA fragments are ligated, i.e. the bridging DNA sequence joining the two DNA fragments in a ligation product.
  • the fusion sequence comprises the recognition site of step b) and a first and second flanking sequence, each of which corresponds to a portion of one of the DNA fragments forming the ligation product (i.e. the portion of the DNA fragment which is immediately adjacent to the recognition sequence).
  • the first and second flanking sequences correspond to a portion of a different DNA fragment forming the ligation product.
  • each fusion sequence is specific for a pair of DNA fragments which have been ligated. In this way, particular ligation products can be targeted via the fusion sequence.
  • the recognition sequence is known.
  • the potential ligation products generated in step c) for a given genomic region of interest or a given viral vector sequence are known and the associated potential fusion sequences are also known.
  • specific ligation products which are less valuable can be targeted.
  • ligation products comprising DNA fragments from a corresponding allele which is less valuable can be targeted.
  • the known sequence of the genomic region of interest (comprising the allele which is less valuable, e.g. a wild type allele) is exploited in the targeting steps.
  • ligation products consisting (exclusively) of DNA fragments from the vector I transgene sequence, can be targeted. This permits the targeting, predominantly, of ligation products formed from episomal copies of the viral vector and a small number of ligation products formed from integrated copies of the viral vector which are less valuable since they do not inform on the vector integration site.
  • the fusion sequence of digesting step e) is of greater length than the recognition sequence of fragmenting step b).
  • the fusion sequence of digesting step e) comprises the recognition sequence of fragmenting step b) and at least 1 (suitably, 2, 3, 4, 5, 6, 7, 8, 9, 10 or 11) additional nucleotide at each end of the recognition sequence.
  • the additional nucleotide(s) at each end of the recognition sequence are also termed “flanking sequences”.
  • the first and second flanking sequences may be of the same length or of two different lengths.
  • the first and second flanking sequences are of sufficient length that the fusion sequence is specific for a particular ligation product.
  • the fusion sequence is of 15 to 25 nucleotides in length, preferably of 20 nucleotides in length.
  • the fusion sequence is of 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 or 25 nucleotides in length.
  • the recognition sequence of fragmenting step b) is of 3 to 12 nucleotides in length and the fusion sequence is of 15 to 25 nucleotides in length.
  • the recognition sequence of fragmenting step b) is of 4 to 8 nucleotides in length and the fusion sequence is of 20 nucleotides in length.
  • the digesting step e) uses a multiplex nuclease digestion which recognises a plurality of specific fusion sequences. In some embodiments, the digesting step e) uses a multiplex nuclease digestion which recognises a plurality of specific fusion sequences, wherein the first and second flanking sequences of each fusion sequence are from a different two separate DNA fragments of step b). Suitably, the plurality of specific fusion sequences are each present within a different ligation product.
  • a multiplex nuclease digestion may be performed which recognises all possible fusion sequences resulting from the ligation of two DNA fragments from the genomic region of interest or, in the context of transgene integration site sequencing, from the ligation of two DNA fragments of viral vector origin.
  • a multiplex CRISPR-Cas nuclease digestion is performed specific for all possible fusion sequences resulting from the ligation of DNA fragment ends of viral origin.
  • a multiplex CRISPR-Cas nuclease digestion is performed specific for all possible fusion sequences of viral origin, i.e. for the ligation products AB, AC, AD, BC, BD and CD.
  • ligation products comprising a sequence of viral origin (e.g. DNA fragment B) and a sequence from the host genome would not be digested.
  • the term “recognises a fusion sequence” means that the nuclease is directed to the fusion sequence and cleaves at or near the fusion sequence. Thus, the nuclease may be specific for a fusion sequence.
  • nuclease preferentially binds to the fusion sequence as opposed to another DNA sequence of the same length.
  • the nuclease may be designed to specifically target a known fusion sequence.
  • the nuclease of digesting step e) is a restriction enzyme. In some embodiments, the nuclease of digesting step e) is a site-directed nuclease. Suitably, a site-directed nuclease as described herein is used.
  • the site-directed nuclease for use in the digesting step e) is a CRISPR-Cas nuclease.
  • a CRISPR-Cas nuclease as described herein is used.
  • step e) comprises digesting the ligated DNA generated in step d) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a site-directed nuclease which is specific for a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b).
  • the site-directed nuclease is a CRISPR-Cas nuclease.
  • the first and second flanking sequences are from two separate DNA fragments which occur in close proximity to one another in the linear DNA sequence of the genomic region of interest.
  • the first and second flanking sequences are from two separate DNA fragments which occur immediately adjacent to one another in the linear DNA sequence of the genomic region of interest.
  • the first and second flanking sequences are from two separate DNA fragments which occur within the linear DNA sequence of the genomic region of interest. Accordingly, the first and second flanking sequences are from two separate DNA fragments which occur within a base pair (bp) distance of the linear DNA sequence of the genomic region of interest which corresponds to the length of the entire genomic region of interest.
  • the first and second flanking sequences may be from two separate DNA fragments which occur within up to 10 kb of one another in the linear genomic DNA sequence.
  • the first and second flanking sequences may be from two separate DNA fragments which occur within about 250 bp, 500 bp, 1 kb, 2 kb, 3 kb, 4 kb, 5 kb, 6 kb, 7kb, 8kb, 9kb or 10 kb of one another in the linear genomic DNA sequence, preferably within about 5 kb of one another in the linear genomic DNA sequence.
  • the first and second flanking sequences are from two separate DNA fragments which occur in close proximity to one another in the linear DNA sequence of the vector.
  • the first and second flanking sequences are from two separate DNA fragments which occur immediately adjacent to one another in the linear DNA sequence of the vector.
  • the first and second flanking sequences are from two separate DNA fragments which occur within the linear DNA sequence of the vector.
  • the first and second flanking sequences are from two separate DNA fragments which occur within a base pair (bp) distance of the linear DNA sequence of the vector which corresponds to the length of the entire vector sequence. For example, if the entire vector sequence comprising a target nucleotide sequence is 10 kb in length, then the first and second flanking sequences may be from two separate DNA fragments which occur within up to 10 kb of one another in the linear vector sequence.
  • the first and second flanking sequences may be from two separate DNA fragments which occur within about 250 bp, 500 bp, 1 kb, 2 kb, 3 kb, 4 kb, 5 kb, 6 kb, 7kb, 8kb, 9kb or 10 kb of one another in the linear vector sequence, preferably within about 5 kb of one another in the linear vector sequence.
  • the method further comprises the steps of:
  • step e) digesting the ligated DNA generated in step c) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using the nuclease of step e);
  • step B’ ligating the mixture of digested and undigested DNA generated in step A’); prior to step d).
  • the further steps A’) and B’) are repeated at least once.
  • DNA fragments have multiple opportunities to ligate to DNA fragments originating from greater physical distances in the crosslinked DNA sequence and this will help increase the ratio of undigested vs digested ligation products resulting from the final nuclease digestion step.
  • the method comprises degrading the digested DNA generated in step e) using an exonuclease; and/or enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) (i) for DNA comprising the target nucleotide sequence.
  • exonuclease means an enzyme that cleaves successive nucleotides, one at a time, from the end of a polynucleotide chain. The cleavage can occur at either the 5’ or the 3’ end of the polynucleotide chain.
  • the term “degrading the digested DNA” means breaking the bonds between the nucleotides in the digested polynucleotide chain (i.e. in the digested DNA) to completely cleave the digested polynucleotide chain into nucleotides.
  • the digested DNA is linear. Therefore, the digested DNA has both a 5’ and a 3’ end such that the exonuclease can cleave the DNA as described herein.
  • step f) (i) leads to the degradation of the digested DNA generated in step g), such that the remaining undigested DNA is processed in the subsequent steps.
  • the method comprises degrading the digested DNA generated in step e) using an exonuclease.
  • universal adapters that prevent exonuclease based degradation of linear DNA molecules are ligated to ligated DNA sequences (e.g. ligated DNA sequences generated in step c)) prior to the digestion step e) and degradation step f).
  • the method comprises degrading the digested DNA generated in step e) using an exonuclease; and enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) (i) for DNA comprising the target nucleotide sequence.
  • the term “enriching ... for DNA comprising the target nucleotide sequence” means a process by which the (absolute) amount and/or proportion of the DNA comprising the target nucleotide sequence is increased compared to the amount and/or proportion of DNA comprising the target nucleotide sequence in the starting material (i.e. in the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) (i)).
  • enrichment by amplification increases the amount and proportion of DNA comprising the target nucleotide sequence. Both enrichment by degradation and capture-based enrichment increase the proportion of DNA comprising the target nucleotide sequence.
  • the methods of the invention are compatible with a wide variety of enrichment approaches.
  • the DNA generated in step e) or step f) (i) comprising the target nucleotide sequence may be amplified using at least one oligonucleotide primer which hybridises to the target nucleotide sequence, and optionally at least one additional primer which hybridises to the at least one adaptor as the step of ligating an adaptor is optional.
  • the enriching step f) (ii) comprises amplifying the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i).
  • oligonucleotide primers or “primers” are used interchangeably, in general, to refer to strands of nucleotides which can prime the synthesis of DNA. DNA polymerase cannot synthesize DNA de novo without primers. A primer hybridises to the DNA, i.e. base pairs are formed. Nucleotides that can form base pairs, that are complementary to one another, are e.g.
  • cytosine and guanine thymine and adenine, adenine and uracil, guanine and uracil.
  • the complementarity between the primer and the existing DNA strand does not have to be 100%, i.e. not all bases of a primer need to base pair with the existing DNA strand.
  • nucleotides are incorporated using the existing strand as a template (template directed DNA synthesis).
  • the synthetic oligonucleotide molecules which are used in an amplification reaction may be referred to as “primers”.
  • amplifying refers to a polynucleotide amplification reaction, namely, a population of polynucleotides that are replicated from one or more starting sequences. Amplifying may refer to a variety of amplification reactions, including but not limited to polymerase chain reaction (PCR), linear polymerase reactions, nucleic acid sequence- based amplification, rolling circle amplification and like reactions.
  • PCR polymerase chain reaction
  • linear polymerase reactions nucleic acid sequence- based amplification
  • rolling circle amplification rolling circle amplification
  • amplifying the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) comprises using at least one primer which hybridises to the DNA fragment comprising the target nucleotide sequence generated in step b).
  • this step further comprises using a plurality of primers which each hybridises to the DNA sequence of one of the DNA fragments comprising the one or more further target nucleotide sequences generated in step b).
  • At least one primer directs amplification towards the recognition sequence of step b).
  • an identifier is included in the at least one primer.
  • identifier refers to a short sequence that can be added to an adaptor or a primer or included in its sequence or otherwise used as label to provide a unique identifier.
  • sequence identifier or tag
  • Typical examples are ZIP sequences, known in the art as commonly used tags for unique detection by hybridization (lannone et al. Cytometry 39:131-140, 2000). Identifiers are useful according to the invention, as by using such an identifier, the origin of a sample (e.g.
  • a PCR sample can be determined upon further processing.
  • the different nucleic acid samples may be identified using different identifiers. For instance, as according to the invention sequencing may be performed using high throughput sequencing, multiple samples may be combined. Identifiers may then assist in identifying the sequences corresponding to the different samples. Identifiers may also be included in adaptors for ligation to DNA fragments assisting in DNA fragment sequences identification. Identifiers preferably differ from each other by at least two base pairs and preferably do not contain two identical consecutive bases to prevent misreads. The identifier function can sometimes be combined with other functionalities such as adaptors or primers.
  • step f) primers are used carrying a moiety, e.g. biotin, for the optional purification of (amplified) ligated DNA fragments through binding to a solid support. Capture-based enrichment using the moiety may then be performed as described below in the context of a hybridisation probe.
  • a moiety e.g. biotin
  • the enriching step f) (ii) may comprise using inverse PCR on a circular template.
  • the method may comprise the steps: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence; c) ligating the fragmented crosslinked DNA generated in step b); d) (i) reversing the crosslinking in the ligated crosslinked DNA generated in step c);
  • step d) circularising the ligated DNA generated in step d) (i); e) digesting the ligated DNA generated in step d) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a site-directed nuclease which is specific for a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); f) (i) optionally, degrading the digested DNA generated in step e) using an exonuclease; and (ii) enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) (i) for DNA comprising the target nucleotide sequence using PCR with inverse primers specific for the target nucleotide sequence; and g) determining at least part of the sequence of the undigested DNA, preferably using high throughput
  • the enriching step f) (ii) may comprise using linear PCR.
  • a site-directed nuclease-based enrichment as described herein can be performed, after which the linear amplification is performed with primers at either end of the digestion site.
  • universal adapters are ligated to both ends of the ligated DNA fragments (e.g. ligated DNA fragments generated in step c)), followed by step e) and optionally step f) (i), before enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) (i) for DNA comprising the target nucleotide sequence using PCR with combinations of a target nucleotide sequence-specific primers and primers specific for the adapters.
  • the enriching step f) (ii) comprises capture-based enrichment of the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i), preferably wherein the capture-based enrichment is specific for a defined sequence at one end of the target nucleotide sequence.
  • the undigested DNA fragments comprising the target nucleotide sequence may be captured with a hybridisation probe (also termed a capture probe) that hybridises to a target nucleotide sequence.
  • the hybridisation probe may be attached directly to a solid support, or may comprise a moiety, e.g. biotin, to allow binding to a solid support suitable for capturing biotin moieties (e.g. beads coated with streptavidin).
  • the undigested DNA fragments comprising a target nucleotide sequence are captured thus allowing separation of ligation products comprising the target nucleotide sequence from ligation products not comprising the target nucleotide sequence.
  • such a capture step allows enrichment for ligation products comprising the target nucleotide sequence.
  • at least one capture probe for the target nucleotide sequence may be used.
  • more than one probe may be used for multiple target nucleotide sequences (e.g. at least one probe for each target nucleotide sequence may be used).
  • one primer corresponding to 1 of 5 target nucleotide sequences may be used as a capture probe (A, B, C, D or E).
  • the 5 primers may be used in a combined fashion (A, B, C, D and E) to capture the genomic region of interest.
  • a capture probe may be used that hybridises to an adaptor sequence comprised in (amplified) undigested DNA fragments.
  • an amplification step and capture step are combined, e.g. first performing a capture step and then an amplification step or vice versa.
  • an exonuclease degradation step and capture-based enrichment step are combined, i.e. first performing an exonuclease degradation step and then a capture-based enrichment step.
  • Site-directed nuclease digestion can also be used for the selective amplification of ligation products of interest. Site-directed digestion can be used to selectively add adaptors to and enable the amplification of linear ligation products comprising the target nucleotide sequence.
  • Site-directed nuclease digestion can also be used for the selective linearization of ligation products of interest. If a proximity ligation protocol (e.g. a TLA protocol as described herein) is used for the generation of circular DNA template (e.g. circular TLA template), a site- directed nuclease can be used to selectively linearize TLA ligation products of interest. Once linearized, these sequences can be selectively enriched using PCR or capture-based approaches as described above. They can also be selectively sequenced, for example, with nanopore sequencing approaches which will very preferentially sequence linearized DNA molecules.
  • a proximity ligation protocol e.g. a TLA protocol as described herein
  • circular DNA template e.g. circular TLA template
  • a site- directed nuclease can be used to selectively linearize TLA ligation products of interest. Once linearized, these sequences can be selectively enriched using PCR or capture-based approaches as described above. They can also be selectively sequenced, for example, with nano
  • the enriching step f) (ii) comprises site-directed nuclease- based enrichment of the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i), preferably wherein the site-directed nuclease- based enrichment comprises digesting the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) using a site-directed nuclease which is specific for a recognition sequence within the target nucleotide sequence followed by amplifying the digested DNA using (inverse) PCR or capture-based enrichment.
  • an amplification step and site-directed nuclease-based enrichment step are combined, e.g. first performing a site-directed nuclease-based enrichment step and then an amplification step or vice versa.
  • a capture step and site-directed nuclease-based enrichment step are combined, e.g. first performing a site-directed nuclease-based enrichment step and then a capture step or vice versa.
  • the number of alleles of interest (for example, the total number of integrated copies of a transgene in a population of cells) can be limited.
  • the term “whole genome amplification” means the use of a non-specific amplification technique which generates an amplified product that is completely representative of the initial starting material. Thus, when a whole genome amplification step is applied to a mixture of unique ligation products resulting from the conventional TLA protocol, each unique ligation product is expected to be amplified.
  • the method further comprises the step of amplifying the whole genome using the digested DNA generated in step e), prior to step f).
  • the following enrichment step can include a new round of TLA (i.e. repeating steps a) to c) after step e), therefore crosslinking, fragmenting, ligating, reversing the crosslinking again) prior to degrading and/or enriching step f).
  • a new round of TLA i.e. repeating steps a) to c) after step e
  • crosslinking, fragmenting, ligating, reversing the crosslinking again prior to degrading and/or enriching step f.
  • Such an approach can, for instance, be useful in populations of cells that contain episomal copies of a vector I transgene sequence and a limited number of integration events.
  • an additional round of TLA on a larger number of copies of integration events will help increase the efficiency and completeness with which the rare integrated copies and integration sites can be sequenced.
  • the method further comprises repeating steps a) to d) following step e) and prior to step f).
  • Repeating the TLA steps could, for example, be applicable in the further analysis of the whole genome amplified product described above.
  • the methods of the invention may further comprise the step of amplifying the whole genome using the digested DNA generated in step e) followed by repeating steps a)-d), prior to step f).
  • each target nucleotide sequence is located in proximity to an instance of the recognition sequence of step b), i.e. in proximity to a DNA fragment end, and the fusion sequence used in the digestion step is specific for the same fragment end.
  • the fusion sequence used in the digestion step is specific for the same fragment end.
  • the invention can be applied in order to enable the enrichment of ligation products wherein only one fusion sequence at one end of a target nucleotide sequence remains undigested following site-directed nuclease-based digestions targeted at less valuable fusion sequences, e.g. the fusion sequence at the other end of the restriction fragment that comprises the target nucleotide sequence.
  • Ligation products comprising DNA fragments used as target nucleotide sequence in which structural changes have occurred will thus remain unfragmented in the digestion step e) at the fragment end in which the sequence change has occurred.
  • the DNA fragment comprising the fusion sequence between the transgene and the host genome will not be digested by a digestion step targeting less valuable transgene-transgene ligation products due to the fact that the fusion with the host genome sequence will result in a novel fragment end not targeted in the digestion step e).
  • ligations products which inform on the integration site will not be digested in digesting step e) and can thus be preferentially sequenced using the method of the invention.
  • the method comprises the additional step of amplifying the ligated DNA generated in step d) by whole genome amplification following reversing the crosslinking in step d), followed by steps f) (ii), e), optionally f) (i) and (g) in that order.
  • the method comprises the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence; c) ligating the fragmented crosslinked DNA generated in step b); d) (i) reversing the crosslinking in the ligated crosslinked DNA generated in step c);
  • step d) amplifying the ligated DNA generated in step d) by whole genome amplification; f) (ii) enriching the amplified DNA generated in step d) (ii) for DNA comprising the target nucleotide sequence; e) digesting the enriched DNA generated in step f) (ii) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); f) (i) optionally, degrading the digested DNA generated in step e) using an exonuclease; and g) determining at least part of the sequence of the remaining undigested DNA, preferably using high throughput sequencing.
  • step f) (ii) comprises amplifying the amplified DNA comprising the target nucleotide sequence generated in step d) (ii) using a primer pair specific for one DNA fragment generated in step b) wherein the primers comprise adapters that prevent exonuclease digestion.
  • the subsequent digestion in step e) of the linear PCR products comprising less valuable ligation products which comprise the specific DNA fragment for which the primers were designed enables the selective removal of less valuable PCR products prior to sequencing.
  • steps d) (ii) to f) (i) are repeated at least once following the first instance of step f) (i) and prior to step g), wherein the further enrichment step f) (ii) enriches for a different target nucleotide sequence.
  • the sequence of at least part of the undigested DNA generated in step f) is determined.
  • the sequence of the undigested DNA may be determined.
  • the undigested DNA may be prepared as a DNA sequencing library and/or sequenced according to standard protocols. Conventional whole genome sequencing or high-throughput sequencing (e.g. NGS) approaches can be used. Determining the sequence is preferably performed using high throughput sequencing technology, as this is more convenient and allows a high number of sequences to be determined to cover the complete genomic region of interest.
  • NGS high-throughput sequencing
  • step g) is performed using whole genome sequencing.
  • the genomic region of interest comprises a transgene integration site, it may be desirable to sequence the whole genome.
  • step g) comprises determining at least part of the sequence of the undigested DNA comprising the target nucleotide sequence.
  • step g) comprises determining the whole sequence of the undigested DNA comprising the target nucleotide sequence.
  • the sequence of at least part of the undigested DNA comprising the target nucleotide sequence may be determined.
  • the sequence of the undigested DNA comprising the target nucleotide sequence may be determined.
  • sequencing refers to determining the order of nucleotides (base sequences) in a nucleic acid sample, e.g. DNA or RNA.
  • bases sequences e.g. DNA or RNA.
  • Many techniques are available such as Sanger sequencing and High throughput sequencing technologies such as offered by Roche, Illumina and Thermo Fisher.
  • the step of determining the sequence of undigested DNA fragments preferably comprises high throughput sequencing.
  • High throughput sequencing methods are well known in the art, and in principle any method may be contemplated to be used in the invention. High throughput sequencing technologies may be performed according to the manufacturer’s instructions (as e.g. provided by Roche, Illumina or Thermo Fisher).
  • sequencing adaptors may be ligated to the (amplified) undigested DNA fragments.
  • the linear or circularized fragment is amplified, by using for example PCR as described herein, the amplified product is linear, allowing the ligation of the adaptors.
  • Suitable ends may be provided for ligating adaptor sequences (e.g. blunt, complementary staggered ends).
  • primer(s) used for PCR or other amplification method may include adaptor sequences, such that amplified products with adaptor sequences are formed in the amplification step f) (ii).
  • the circularized fragment may be fragmented, preferably by using for example a restriction enzyme in between primer binding sites for the inverse PCR reaction, such that DNA fragments ligated with the DNA fragment comprising the target nucleotide sequence remain intact.
  • Sequencing adaptors may also be included in step c) and the steps i)-iii) of the methods of the invention.
  • long reads may be generated in the high throughput sequencing method used. Long reads may allow reading across multiple DNA fragments within undigested DNA fragments (which contain ligated DNA fragments). This way, DNA fragments of step b) may be identified. DNA fragment sequences may be compared to a reference sequence and/or compared with each other.
  • short reads may also be contemplated to read even shorter sequences, for instance, short reads of 50- 100 nucleotides. In case a standard sequencing protocol would be used, this may mean that the information regarding the undigested DNA fragments may be lost. With short reads it may not be possible to identify a complete DNA fragment sequence. In case such short reads are contemplated, it may be envisioned to provide additional processing steps such that separate ligated DNA fragments when fragmented, are ligated or equipped with identifiers, such that from the short reads, contigs may be built for the ligated DNA fragments. Such high throughput sequencing technologies involving short sequence reads may involve paired end sequencing.
  • the short reads from both ends of a DNA molecule used for sequencing may allow coupling of DNA fragments that were ligated. This is because two sequence reads can be coupled spanning a relatively large DNA sequence relative to the sequence that was determined from both ends. This way, contigs may be built for the (amplified) undigested DNA fragments.
  • the step of determining at least part of the sequence of the (amplified) undigested DNA sequence may comprise short sequence reads, but preferably longer sequence reads are determined such that DNA fragment sequences may be identified.
  • the primer sequence may be removed prior to the sequencing step g) (e.g. the high throughput sequencing step).
  • the enrichment step f) (ii) comprises:
  • step e amplifying the undigested DNA fragments comprising the target nucleotide sequence generated in step e) or step f) (i) using at least one primer that preferably (1) contains a 5’ overhang carrying a type III restriction enzyme recognition site and (2) hybridises to the target nucleotide sequence; or
  • step e) amplifying the undigested DNA fragments comprising the target nucleotide sequence generated in step e) or step f) (i) using at least one primer that preferably (1) contains a 5’ overhang carrying a type III restriction enzyme recognition site and (2) hybridises to the target nucleotide sequence and at least one primer which hybridises to the at least one adaptor when an adaptor is present;
  • step g optionally, ligating double-stranded adaptor sequences needed for next generation sequencing, prior to step g).
  • a contig may be built of the genomic region of interest.
  • overlapping reads may be obtained from which the genomic region of interest may be built.
  • the method further comprises the step of building a contig of the genomic region of interest from the determined sequences generated in step g).
  • a contig is used in connection with DNA sequence analysis, and refers to reassembled contiguous stretches of DNA derived from two or more DNA fragments having contiguous nucleotide sequences.
  • a contig may be a set of overlapping DNA fragments that provides a (partial) contiguous sequence of a genomic region of interest.
  • a contig may also be a set of DNA fragments that, when aligned to a reference sequence, may form a contiguous nucleotide sequence.
  • the term “contig” encompasses a series of (ligated) DNA fragment(s) which are ordered in such a way as to have sequence overlap of each (ligated) DNA fragment(s) with at least one of its neighbours.
  • the linked or coupled (ligated) DNA fragment(s) may be ordered either manually or, preferably, using appropriate computer programs such as FPC, PHRAP, CAP3 etc, and may also be grouped into separate contigs.
  • step b) when in step b) a plurality of subsamples is generated, using different restriction enzymes or sitedirected nucleases, overlapping reads will also be obtained. By increasing the plurality of subsamples, the number of overlapping fragments will increase, which may increase the reliability of the contig of the genomic region of interest that is built. From these determined sequences which may overlap, a contig may be built. Alternatively, if sequences do not overlap, e.g. when a single restriction enzyme may have been used in step b), alignment of (undigested) DNA fragments with a reference sequence may allow to build a contig of the genomic region of interest.
  • a contig is built for each ploidy.
  • the step of building a contig comprises the steps of:
  • step b 1) identifying the fragments of step b);
  • the step 2) of assigning the fragments to a genomic region comprises identifying the different ligation products of step e) and coupling of the different ligation products to the identified fragments.
  • the invention may be used to provide for quality control of generated sequence information.
  • sequencing errors may occur.
  • a sequencing error may occur for example during the elongation of the DNA strand, wherein an incorrect (i.e. non- complementary to the template) base is incorporated in the DNA strand.
  • a sequencing error is different from a mutation, as the original DNA which is amplified and/or sequenced would not comprise that incorrect base.
  • DNA fragment sequences may be determined, with (at least part of) sequences of DNA fragments ligated thereto, which sequences may be unique. The uniqueness of the ligated DNA fragments as they are formed in step c) may provide for quality control of the determined sequence in step g).
  • the sequences of multiple genomic regions of interest may be determined.
  • the sequences of a plurality of genomic regions of interest are determined.
  • a target nucleotide sequence For each genomic region of interest, a target nucleotide sequence is provided. In the enrichment step, corresponding primer(s) may be designed for each target nucleotide sequence.
  • the multiple genomic regions of interest may be genomic regions of interest that may also overlap, thereby increasing the size of which the sequence may be determined. For instance, in case a sequence of a genomic region of interest comprising a target nucleotide sequence typically would comprise 1MB, combining partially overlapping genomic regions of interest, e.g.
  • Multiple target nucleotide sequences at defined distances within a genomic region of interest may also be used to increase the average coverage and/or the uniformity of coverage across the genomic region.
  • the genomic region of interest comprises one or more further target nucleotide sequences.
  • the genomic region of interest comprises 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10 further target nucleotide sequences
  • a method for determining the sequence of a genomic region of interest comprising two target nucleotide sequences is provided.
  • the enrichment step now uses not one target nucleotide sequence, but two.
  • This method may involve the same steps as outlined above up until the enrichment step.
  • the enrichment step now uses not one target nucleotide sequence, but several.
  • different primers are used in a PCR reaction, one primer for each target nucleotide sequence.
  • the two primers will amplify the sequence in between the two primer binding sites provided that the primer binding sites have the right orientation.
  • Having a circularized DNA fragment may be advantageous as the chance for the two primer binding sites having the right orientation is higher as compared to a linear DNA fragment (two out of four orientations will amplify, as compared to one in four for a linear ligated DNA fragment).
  • the chance that combinations of primers will produce an amplicon is increased.
  • target nucleotides can be used for a gene of interest (e.g. a transgene).
  • a PCR may be performed by selecting a primer from one target nucleotide sequence (also referred to as viewpoint), e.g. target nucleotide sequence A with another, B.
  • a PCR may be performed using a primer from each target nucleotide sequence, A, B, C, D and E. As these target nucleotides are in physical proximity of each other, performing such an amplification will enrich for the genomic region of interest, provided that the primer binding sites are present in ligation products such that an amplicon can be generated.
  • determining the sequence of a genomic region of interest wherein the genomic region of interest comprises one or more further target nucleotide sequences, and wherein in the amplification step a primer is provided that hybridises with the target nucleotide sequence and one or more primers are provided for the corresponding one or more further target nucleotides, wherein the undigested DNA is amplified, linearized DNA is amplified or circularized DNA is amplified, using the primers.
  • an identifier may be included in at least one of the oligonucleotide primers of step f) (ii) when PCR is used. Identifiers may also be included in adaptor sequences, such as may be used for ligation in between fragments during the ligation steps c) and i)-iii). By including an identifier in the oligonucleotide primer, when analysing a plurality of samples or a plurality of subsamples of crosslinked DNA simultaneously, the origin of each sample may easily be determined. Samples or subsamples of crosslinked DNA may have been processed differently while the original sample of crosslinked DNA is the same, and/or samples of DNA may have been obtained for example from different organisms or patients. Identifiers allow the combination of differently processed samples when the processing of samples may converge, e.g. identical procedural steps are performed. Such convergence of processing may in particular be advantageous when the sequencing step g) involves high throughput sequencing.
  • the methods of the invention are particularly advantageous in the sequencing of integrated transgene sequences and transgene integration sites in samples in which episomal copies of the vector sequence also occur.
  • Conventional targeted sequencing approaches and conventional physical proximity-based protocols cannot distinguish between integrated and episomal (i.e. non-integrated) copies of a vector I transgene sequence. This is a limitation in NGS-based analysis of gene-therapy products as the sequencing quality will often depend on the frequency with which integrations occur and whether these integrations result in the complete clean integration of vector sequences without structural changes.
  • the methods of the invention enable the targeted sequencing of integrated copies of the transgene sequence.
  • a target nucleotide sequence within the transgene is selected.
  • the site-directed nuclease digestion of (all) possible fusion sequences resulting from the ligation of the DNA fragment containing the target nucleotide sequence and other DNA fragments of the vector I transgene sequence results in the preferential enrichment and sequencing of ligation products comprising the DNA fragment containing the target nucleotide sequence and DNA fragments originating from the host genome. This can then result in complete sequence information across integrated vector sequences and vector integration sites.
  • the genomic region of interest comprises a transgene integration site.
  • the target nucleotide sequence may be any sequence of interest.
  • the target nucleotide sequence comprises a transgene or a portion thereof.
  • the target nucleotide sequence comprises a portion of the transgene which is adjacent to the recognition site of step b).
  • the portion of the transgene is of sufficient length to permit specific enrichment of sequences comprising the target nucleotide sequence.
  • the first and second flanking sequences are from two separate DNA fragments from the transgene and/or from the vector used to deliver the transgene.
  • the first and second flanking sequences are each from a separate DNA fragment from the vector; ii) the first flanking sequence is from a DNA fragment from the vector and the second flanking sequence is from a DNA sequence from the transgene; or iii) the first and second flanking sequences are each from a separate DNA fragment from the transgene.
  • the site-directed nuclease of digesting step e) may be specific for a vector-vector, vector-transgene or transgene-transgene ligation event, i.e. the site-directed nuclease of digesting step e) may be specific for ligation products consisting (exclusively) of DNA fragments from the vector sequence.
  • the recognition site of fragmenting step b) occurs once within the vector sequence.
  • the fragmenting step b) will then generate two DNA fragments from the vector sequence.
  • a site-directed nuclease which is specific for a single fusion sequence may then be used in step e).
  • the recognition site of fragmenting step b) may occur multiple times within the vector sequence.
  • the fragmenting step b) will then generate more than two DNA fragments from the vector sequence. For example, if the recognition site of fragmenting step b) occurs three times in the vector sequence, the fragmenting step) will generate four DNA fragments from the vector sequence (e.g. A, B, C and D).
  • a multiplex site-directed nuclease approach which is specific for each fusion sequence within the ligation products consisting of DNA fragments of vector origin may then be used in step e).
  • a multiplex CRISPR digestion may be performed which is specific for all possible fusion sequences of viral origin, i.e. for the ligation products AB, AC, AD, BC, BD and CD.
  • the methods of the invention provide improved efficiency of detection of large structural changes. For example, if the site-directed nuclease digestions of step e) are designed to be specific for ligation events of DNA fragments spanning a 10kb region in the wild-type genome sequence, ligation products resulting from alleles in which structural changes have occurred (e.g. an insertion or translocation) within this 10kb region of interest will more likely contain ligation events that will not be digested. These will therefore be more preferentially sequenced compared to ligation events originating from wild-type alleles in the methods of the invention.
  • the genomic region of interest comprises a transgene integration site.
  • the target nucleotide sequence comprises an allele of the genomic region of interest or a portion thereof.
  • the portion of the allele is of sufficient length to permit specific enrichment of sequences comprising the target nucleotide sequence.
  • a target nucleotide sequence which is in proximity to, but not within, the sequence of the allele of interest may be selected. In this way, the methods of the invention can be performed without requiring sequence information of the allele of interest.
  • the first and second flanking sequences are from separate DNA fragments from the allele.
  • the site-directed nuclease of digesting step e) is specific for an allele-allele ligation event.
  • allele(s) means any of one or more alternative forms of a gene at a particular genomic locus.
  • alleles of a given gene are located at a specific location, or locus (loci plural) on a chromosome.
  • locus plural locus on a chromosome.
  • One allele is present on each chromosome of the pair of homologous chromosomes.
  • two alleles and thus two separate (different) genomic regions of interest may exist.
  • a size selection step may be performed prior to or after the enrichment step f) (ii), according to the methods of the invention.
  • a size selection step may be performed using gel extraction chromatography, gel electrophoresis or density gradient centrifugation, which are methods generally known in the art.
  • DNA is selected of a size between 20- 20,0000 base pairs, preferably 50-10,0000 base pairs, most preferably between 100-3,000 base pairs.
  • a size separation step allows to select for (amplified) ligated DNA fragments in a size range that may be optimal for PCR amplification and/or optimal for the sequencing of long reads by next generation sequencing.
  • size selection involves techniques with which particular size ranges of molecules, e.g. (ligated) DNA fragments or amplified (ligated) DNA fragments, are selected. Techniques that can be used are for instance gel electrophoresis, size exclusion, gel extraction chromatography, but are not limited thereto, as long as molecules with a particular size can be selected, such a technique will suffice.
  • the ligated DNA fragments generated in step c) may be further fragmented prior to digestion and enrichment. Such additional fragmentation (and ligation) steps may be performed prior to step b).
  • the further fragmentation may be a random or a non-random fragmentation as described herein. In this manner, rarer cutters (i.e. enzymes that fragment less frequently) can be used in in step b) and more frequent (including random) fragmentation strategies can be used in a further fragmentation step.
  • the digesting step e) is then based upon a lower (more manageable) number of possible ligation events from the non-random fragmenting step (e.g. step b)) and the further fragmenting step ensures the entire genomic region of interest can be enriched.
  • the non-random fragmenting step (e.g. step b)) and the optional further fragmenting step may be aimed at obtaining ligated DNA fragments of a size which is compatible with the subsequent enrichment step (e.g. amplification step) and/or sequence determination step.
  • a further fragmenting step preferably with an enzyme, may result in ligated fragment ends which are compatible with the optional ligation of an adaptor.
  • the further fragmenting step may be performed after reversing the crosslinking, however, it is also possible to perform the further fragmenting step and/or ligation step while the DNA fragments are still crosslinked. At least one adaptor may be ligated to the obtained ligated DNA fragments generated in the further fragmenting step.
  • the ends of the ligated DNA fragments need to be compatible with ligation of such an adaptor.
  • the ligated DNA fragments may be linear DNA
  • ligation of an adaptor may provide for a primer hybridisation sequence.
  • the adaptor sequence ligated with ligated DNA fragments comprising the target nucleotide sequence will provide for DNA molecules which may be amplified using PCR as described herein.
  • Ligated adapter sequences can also be used as described herein to prevent exonuclease based digestion.
  • the DNA is further fragmented with a restriction enzyme or site-directed nuclease as described herein.
  • the method further comprises the step of: i) (a) further fragmenting the crosslinked DNA provided in step a) or the ligated crosslinked DNA generated in step c); and
  • step i) (b) ligating the fragmented DNA generated in step i) (a), preferably wherein the ligation is performed in the presence of an adaptor, ligating adaptor sequences in between fragments; prior to step b) or prior to step d); or ii) (a) further fragmenting the undigested DNA generated in step f); and
  • step g) optionally, circularising or ligating the fragmented DNA generated in step ii) (a), preferably ligating the fragmented DNA to at least one adaptor, prior to step g).
  • the method further comprises the steps of:
  • step (ii) ligating the fragmented DNA generated in step (i), preferably wherein the ligation is performed in the presence of an adaptor, ligating adaptor sequences in between fragments; prior to step b).
  • the method further comprises the steps of: (i) further fragmenting the ligated crosslinked DNA generated in step c); and
  • step (b) ligating the fragmented DNA generated in step (i), preferably wherein the ligation is performed in the presence of an adaptor, ligating adaptor sequences in between fragments; prior to step d).
  • the method further comprises the steps of:
  • step (b) optionally, circularising or ligating the fragmented DNA generated in step (i), preferably ligating the fragmented DNA to at least one adaptor, prior to step g).
  • the further steps i) (a) and i) (b), ii) (a) and ii) (b), or (i) and (ii) are repeated at least once (suitably, once, twice, three time, four times or five times). In this way, ligated DNA fragments of a size which is compatible with the subsequent steps may be obtained.
  • the further steps i) (a) and (i) are performed using random fragmentation.
  • the further steps i) (a), ii) (a) and (i) are performed using non-random fragmentation at a recognition sequence.
  • both the non-random and further fragmenting steps comprise the use of restriction enzymes or site-directed nucleases
  • the recognition sequence of nonrandom fragmentation step b) is longer than the recognition sequence of the further fragmentation step.
  • the enzyme of step b) thus cuts at a lower frequency than the further fragmentation step. This means that the average DNA fragment size of the further fragmentation step is smaller than the average fragment size generated in step b). This way, in the non-random fragmenting step b), relatively large fragments are formed, which are subsequently ligated and the second enzyme of the further fragmentation step cuts more frequently than the enzyme of step b).
  • the recognition sequence of the fragmenting step b) is of greater length than the recognition sequence of the further fragmenting step i) (a) or step ii) (a).
  • the step d) of reversing the crosslinking may be performed after steps i) (a) and i) (b) and prior to the fragmenting step b).
  • the flanking sequences of the fusion sequence resulting from non-random fragmentation of ligated DNA fragments in step b) are known since these sequences correspond to the original wild-type sequence of the genomic region of interest.
  • the circularisation of the fragmented DNA generated in step b) or the ligated DNA generated in step c) will result in a known fusion sequence that can be targeted by nuclease digestion in step e) as described herein.
  • Any fusion sequence(s) resulting from the circularisation of the fragmented DNA generated in step b) or the ligated DNA generated in step c) in which reshuffling of DNA fragments has occurred will comprise different flanking sequences compared to the flanking sequences of a fusion sequence within ligation products in which no DNA reshuffling has occurred.
  • the method comprises the steps of (Wherein the steps are preferably performed in the order presented in this paragraph): a) (i) providing a sample of crosslinked DNA;
  • a further fragmentation step using a random fragmentation strategy may result in the cleavage of a small number of copies of the target nucleotide sequence. However, sufficient copies of the target nucleotide sequence would remain in order to provide complete sequence information across the genomic region of interest comprising the target nucleotide sequence.
  • the method further comprises the step of circularising the DNA of step d), prior to step e).
  • the obtained ligated DNA fragments of step d), of which crosslinking has been reversed are next circularized. It may be advantageous to reverse crosslinking before the circularization, because it may be unfavourable to circularize crosslinked DNA while crosslinked. However, circularization may also be performed while the ligated DNA fragments are crosslinked. It may even be possible that an additional circularization step is not required, as during the ligation step, circularized ligated DNA fragments are already formed, and hence the circularization step would occur simultaneously with step c). However, it is preferred to perform an additional circularization step. Circularization involves the ligation of the ends of the ligated DNA fragments such that a closed circle is formed.
  • the circularized DNA comprising DNA fragments which comprise the target nucleotide sequence
  • the circularized DNA comprising DNA fragments which comprise the target nucleotide sequence
  • reversing the crosslinking is required, as crosslinked DNA may hamper or prevent amplification.
  • two primers are used that hybridise to the target nucleotide sequence in an inverse PCR reaction. In this way, DNA fragments of the circularized DNA which are ligated to the DNA fragment comprising the target nucleotide sequence may be amplified.
  • a method for making a DNA sequencing library of a genomic region of interest comprising a target nucleotide sequence comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) digesting the ligated DNA generated in step d) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a site-directed nuclease which is specific for a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from
  • step e enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) (i) for DNA comprising the target nucleotide sequence to enrich for undigested DNA comprising the target nucleotide sequence; and g) optionally, determining at least part of the sequence of the undigested DNA to provide at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
  • Steps a) to g) may be performed as described herein.
  • the method of the second aspect of the invention may be performed as described herein with respect to the method of the first aspect of the invention, except that step g) is optional.
  • the method of the second aspect of the invention may be performed as described herein with respect to the method of the third aspect of the invention, except that step i) is optional.
  • the methods of the invention further comprise the step of determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
  • DNA sequencing library means a sequencing-ready DNA library.
  • the methods of the invention generate a compatible library (e.g. an NGS compatible library) for sequencing applications.
  • a DNA sequencing library of a plurality of genomic regions of interest is made.
  • the invention provides a method for determining the sequence of a genomic region of interest comprising a target nucleotide sequence, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a); c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) fragmenting the ligated DNA generated in step d) by non-random fragmentation of the DNA at a recognition sequence; f) circularising the fragmented DNA generated in step e); g) digesting the circularised DNA generated in step f) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step e) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA
  • step g) enriching the mixture of digested and undigested DNA generated in step g) or the undigested DNA generated in step h)(i) for DNA comprising the target nucleotide sequence; and i) determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
  • the method of the third aspect of the invention may be performed as described herein with respect to the first aspect of the invention.
  • the steps of the method of the third aspect of the invention can be carried out as described herein for the corresponding steps of the first aspect of the invention.
  • all embodiments of the invention described herein with respect to the first aspect of the invention are applicable to the third aspect of the invention.
  • methods are provided for identifying the presence or absence of a genetic mutation.
  • a method for identifying the presence or absence of a genetic mutation comprising the steps a)-g) of any of methods of the invention as described above, wherein contigs are built for a plurality of samples, comprising the further steps of: h) aligning the contigs of a plurality of samples; and i) identifying the presence or absence of a genetic mutation in the genomic regions of interest from the plurality of samples.
  • a method for identifying the presence or absence of a genetic mutation comprising the steps a)-g) of any of the methods of the invention as described above, comprising the further steps of: h) aligning the contig to a reference sequence; and i) identifying the presence or absence of a genetic mutation in the genomic region of interest.
  • Genetic mutations can be identified for instance by comparing the contigs of multiple samples, in case one (or more) of the samples comprises a genetic mutation, this may be observed as the sequence of the contig is different when compared to the sequence of the other samples, i.e. the presence of a genetic mutation is identified. In case no sequence differences between contigs of the samples is observed, the absence of genetic mutation is identified.
  • a reference sequence may also be used to which the sequence of a contig may be aligned. When the sequence of the contig of the sample is different from the sequence of the reference sequence, a genetic mutation is observed, i.e. the presence of a genetic mutation is identified. In case no sequence differences between the contig of the sample or samples and the reference sequence is observed, the absence of genetic mutation is identified.
  • a method is provided for identifying the presence or absence of a genetic mutation, according to any of the methods as described above, without the further step of building a contig.
  • Such a method comprises the steps a)-g) of any of the methods as described above and the further steps of: h) aligning the determined sequences of the (amplified) undigested DNA fragments generated in step g) to a reference sequence; and i) identifying the presence or absence of a genetic mutation in the determined sequences.
  • a method for identifying the presence or absence of a genetic mutation wherein of a plurality of samples sequences of (amplified) undigested DNA fragments are determined, comprising the steps a)-g) of any of the methods as described above, comprising the further steps of: h) aligning the determined sequences (generated in step g)) of the (amplified) undigested DNA fragments of a plurality of samples; i) identifying the presence or absence of a genetic mutation in the determined sequences. Ratio of alleles or cells carrying a genetic mutation or transgene
  • a sample of crosslinked DNA is provided from heterogeneous cell populations (e.g. cells with different origin or cells from an organism which comprises normal cells and genetically mutated cells (e.g. cancer cells)
  • heterogeneous cell populations e.g. cells with different origin or cells from an organism which comprises normal cells and genetically mutated cells (e.g. cancer cells)
  • for each genomic region of interest corresponding to different genomic environment which may e.g. be different genomic environments from different alleles in a cell or different genomic environments from different cells
  • contigs may be built.
  • the ratio of fragments or ligation products carrying an allele, transgene or genetic mutation may be determined, which may correlate to the ratio of alleles or cells carrying the genetic mutation or the transgene. Since the ligation of DNA fragments is a random process, the collection and order of DNA fragments that are part of the ligation products may be unique and represent a single cell and/or a single genomic region of interest from a cell.
  • identifying ligation products comprising the fragment with the allele, genetic mutation or transgene may also comprise identifying ligation products with a unique order and collection of DNA fragments.
  • the ratio of alleles or cells carrying a genetic mutation or transgene may be of importance in evaluation of therapies, e.g. in case patients are undergoing therapy for cancer, such as gene therapy. Cancer cells may carry a particular genetic mutation or cells may carry a particular transgene. The percentage of cells carrying such a mutation or the transgene may be a measure for the success or failure of a therapy.
  • methods are provided for determining the ratio of fragments carrying an allele, genetic mutation or transgene, and/or the ratio of ligation products carrying a genetic mutation.
  • a genetic mutation is defined as a particular genetic mutation or a selection of particular genetic mutations.
  • a method for determining the ratio of fragments carrying an allele, genetic mutation or transgene from a cell population suspected of being heterologous comprising the steps a)-g) of any of the methods as described above, comprising the further steps of: h) identifying the fragments of step b); i) identifying the presence or absence of an allele, genetic mutation or transgene in the fragments; j) determining the number of fragments carrying the allele, genetic mutation or transgene; k) determining the number of fragments not carrying the allele, genetic mutation or transgene;
  • a method for determining the ratio of ligation products carrying a fragment with a allele, genetic mutation or transgene from a cell population suspected of being heterologous comprising the steps a)-g) of any of the methods as described above, comprising the further steps of: h) identifying the fragments of step b); i) identifying the presence or absence of an allele, genetic mutation or transgene in the fragments; j) identifying the ligation products of step c) carrying the fragments with or without the allele, genetic mutation or transgene; k) determining the number of ligation products carrying the fragments with the allele, genetic mutation or transgene; l) determining the number of ligation products carrying the fragments without the allele, genetic mutation or transgene; m) calculating the ratio of ligation products carrying the allele, genetic mutation or transgene.
  • the presence or absence of an allele, genetic mutation or transgene may be identified in step i) by aligning to a reference sequence and/or by comparing DNA fragment sequences of a plurality of samples.
  • an identified genetic mutation may be a SNP, single nucleotide polymorphism, an insertion, an inversion and/or a translocation.
  • the number of fragments and/or ligation products from a sample carrying the deletion and/or insertion may be compared with a reference sample in order to identify the deletion and/or insertion.
  • a deletion, insertion, inversion and/or translocation may also be identified based on the presence of chromosomal breakpoints in analyzed fragments.
  • the presence or absence of methylated nucleotides is determined in DNA fragments, ligated DNA fragments, and/or genomic regions of interest.
  • the DNA of step a)-e) may be treated with bisulphite.
  • Treatment of DNA with bisulphite converts cytosine residues to uracil, but leaves 5-methylcytosine residues unaffected.
  • bisulphite treatment introduces specific changes in the DNA sequence that depend on the methylation status of individual cytosine residues, yielding single- nucleotide resolution information about the methylation status of a segment of DNA.
  • methylated nucleotides may be identified.
  • sequences from a plurality of samples treated with bisulphite may also be aligned, or a sequence from a sample treated with bisulphite may be aligned to a reference sequence.
  • Example 1 Illustrative example of targeted sequencing of integrated viruses and integration sites
  • Cultured cells are washed with PBS and fixated with PBS/10% FCS/2% formaldehyde for 10 minutes at RT. The cells are subsequently washed and collected, and taken up in lysis buffer (50mM Tris-HCI pH7.5, 150mM NaCI, 5mM EDTA, 0.5% NP-40, 1% TX-100 and IX. Complete protease inhibitors (Roche #11245200) and incubate 10 minutes on ice. Cells are subsequently washed and taken up in MilllQ.
  • lysis buffer 50mM Tris-HCI pH7.5, 150mM NaCI, 5mM EDTA, 0.5% NP-40, 1% TX-100 and IX. Complete protease inhibitors (Roche #11245200) and incubate 10 minutes on ice. Cells are subsequently washed and taken up in MilllQ.
  • the fixated lysed cells are digested with a restriction enzyme of which, as shown in Figure 1A, the restriction site sequence occurs twice in the viral genome sequence. This means that the fragmentation of the viral genome with this restriction enzyme will result in three fragments (shown in different shades).
  • a target nucleotide sequence (outlined with a black box) is chosen in proximity to one of the restriction sites.
  • the chosen restriction enzyme will also fragment the human genome.
  • the restriction enzyme is heat-inactivated and subsequently a ligation step is performed using T4 DNA Ligase (Roche, #799009). Fragmenting 2 and ligating 2
  • a second round of fragmentation and ligation is performed using a fragmentation approach which fragments more frequently than the restriction enzyme used in the first fragmentation step.
  • the digested and ligated sample is digested with a fragmenting strategy that, on average, will result in fragments of around 3000 bp in size (as a result of which the resulting circularised DNA can be amplified effectively).
  • the resulting linear DNA is circularised with a ligase enzyme.
  • the DNA fragment end containing the target nucleotide sequence generated in the first fragmentation will, given the physical proximity of only sequences originating from the same individual copy of the virus, have very preferentially ligated to fragment ends A, B and C as shown in Figure 1A.
  • the DNA fragment end containing the target nucleotide sequence generated in the first fragmentation will have much more frequently ligated to fragment ends originating from its integration site in the host genome, i.e. ligated to fragment ends from the host genome.
  • a multiplex CRISPR-Cas nuclease digestion is performed which is specific for the fusion sequences resulting from the ligation of the DNA fragment end containing the target nucleotide sequence with DNA fragment ends A, B, and C, i.e. the ligation products target nucleotide sequence-A, target nucleotide sequence-B and target nucleotide sequence-C.
  • An exonuclease treatment is performed to digest all linearized DNA generated in the selective digestion step.
  • the primers used for the PCR-enrichment are designed as inverted unique primers specific for the target nucleotide sequence.
  • the amplified undigested DNA can be library prepped and sequenced according to standard protocols.
  • Example 2 Illustrative example of whole genome sequencing
  • Cultured cells are washed with PBS and fixated with PBS/10% FCS/2% formaldehyde for 10 minutes at RT. The cells are subsequently washed and collected, and taken up in lysisbuffer (50mM Tris-HCI pH7.5, 150mM NaCI, 5mM EDTA, 0.5% NP-40, 1% TX-100 and IX. Complete protease inhibitors (Roche #11245200) and incubate 10 minutes on ice. Cells are subsequently washed and taken up in MilllQ.
  • lysisbuffer 50mM Tris-HCI pH7.5, 150mM NaCI, 5mM EDTA, 0.5% NP-40, 1% TX-100 and IX. Complete protease inhibitors (Roche #11245200) and incubate 10 minutes on ice. Cells are subsequently washed and taken up in MilllQ.
  • the fixated lysed cells are digested with a restriction enzyme of which, as shown in Figure 2A, the restriction site sequence occurs twice in the viral genome sequence. This means that the fragmentation of the viral genome with this restriction enzyme will result in three fragments (shown in different shades).
  • a target nucleotide sequence (outlined in a black box) is chosen in proximity to one of the restriction sites.
  • the chosen restriction enzyme will also fragment the human genome. As shown in Figure 2B, multiple restriction sites will thus also occur in the human genome in proximity to any integration site of the virus.
  • the restriction enzyme is heat-inactivated and subsequently a ligation step is performed using T4 DNA Ligase (Roche, #799009).
  • Adapters are ligated to the ends of the ligation products.
  • the adapters used prevent exonuclease based degradation of these ends.
  • the viral vector I transgene fragment ends generated in the first fragmentation will, given the physical proximity of only sequences originating from the same individual copy of the virus, have very preferentially ligated to other viral vector I transgene fragment ends, i.e. have very preferentially ligated to fragment ends A, B, C and D as shown in Figure 2A.
  • the fragment end containing the target nucleotide sequence generated in the first fragmentation will have much more frequently ligated to fragment ends originating from its integration site in the host genome, i.e. ligated to fragment ends from the host genome.
  • a multiplex CRISPR-Cas nuclease digestion is performed which is specific for all possible sequences resulting from the religation of two fragment ends of viral origin; i.e. the ligation products AB, AC, AD, BC, BD and CD.
  • An exonuclease treatment is performed to digest all DNA fragmented in the selective digestion step.
  • the remaining DNA can be library prepped and sequenced according to standard protocols.
  • Example 3 Illustrative example of targeted sequencing of integrated AAV and integration sites pSUB201 is a well established AAV plasmid construct (Nicola J Philpott et al. (2002) J Virol. 76:5411-21).
  • Figure 3 provides a circular map of the construct and shows the positions of unique restriction sites within the construct (https://www.addgene.org/browse/sequence_vdb/4231/). Any of these restriction sites may be used in the methods of the invention.
  • the unique Hindi 11 restriction site can be used as the recognition site in the non-random fragmentation step b).
  • the majority of ligations in crosslinked non-integrated copies of the vector will consist of a ligation of vector sequences at either end of the Hindi 11 restriction site.
  • the Hindi 11 fragment ends will have opportunity to ligate to Hindi 11 fragment ends originating from the integration site in the human genome, i.e. to fragment ends from the human genome.
  • a nuclease for example a site-directed nuclease such as a CRISPR-Cas nuclease
  • a site-directed nuclease such as a CRISPR-Cas nuclease
  • a fusion sequence which comprises the restriction site and nucleotides from DNA fragment ends of viral origin
  • the sequence of the unique Hindlll restriction site (AAGCTT) is shown in bold and italics in the pSUB201 sequence below.
  • the illustrative fusion sequence (GACGCGGAAGCTTCGATCAA) comprises the recognition sequence and a first and second flanking sequence, wherein each flanking sequence is of 7 nucleotides in length.
  • AAAAGTATTA CAGGGTCATA ATGTTTTTGG TACAACCGAT TTAGCTTTAT 5950
  • a method for making a DNA sequencing library of a genomic region of interest comprising a target nucleotide sequence comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence; c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) digesting the ligated DNA generated in step d) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); f) (i) degrading the digested DNA generated in step e) using an exonuclease;
  • step e) enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f)(i) for DNA comprising the target nucleotide sequence; and g) optionally, determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
  • a method for determining the sequence of a genomic region of interest comprising a target nucleotide sequence comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence; c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) digesting the ligated DNA generated in step d) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); f) (i) degrading the digested DNA generated in step e) using an exonuclease; and/
  • step e) enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f)(i) for DNA comprising the target nucleotide sequence; and g) determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
  • fragmenting step b) comprises fragmenting with a site-directed nuclease, preferably a CRISPR-Cas nuclease.
  • fragmenting step b) comprises fragmenting a plurality of subsamples, each subsample having a different recognition sequence.
  • the method further comprises the step of: i) (a) further fragmenting the crosslinked DNA provided in step a) or the ligated crosslinked DNA generated in step c); and (b) ligating the fragmented DNA generated in step i) (a), preferably wherein the ligation is performed in the presence of an adaptor, ligating adaptor sequences in between fragments; prior to step d); or ii) (a) further fragmenting the undigested DNA generated in step f); and
  • step g) optionally, circularising or ligating the fragmented DNA generated in step ii) (a), preferably ligating the fragmented DNA to at least one adaptor, prior to step g).
  • the recognition sequence of fragmenting step b) is of 4 to 12 nucleotides in length, preferably of 6 to 8 nucleotides in length.
  • step e) is performed prior to step d).
  • the fusion sequence is of 15 to 25 nucleotides in length, preferably of 20 nucleotides in length.
  • nuclease of digesting step e) is a restriction enzyme.
  • nuclease of digesting step e) is a site-directed nuclease, preferably wherein the site-directed nuclease is a CRISPR-Cas nuclease.
  • step e) digesting the ligated DNA generated in step c) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using the nuclease of step e);
  • step B’ ligating the mixture of digested and undigested DNA generated in step A’); prior to step d).
  • step f) (ii) comprises amplifying the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i).
  • amplifying the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) comprises using at least one primer which hybridises to the DNA fragment comprising the target nucleotide sequence generated in step b) and optionally further using a plurality of primers which each hybridises to the DNA sequence of one of the DNA fragments comprising the one or more further target nucleotide sequences generated in step b).
  • step f) (ii) comprises capture-based enrichment of the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i), preferably wherein the capture-based enrichment is specific for a defined sequence at one end of the target nucleotide sequence.
  • the enriching step f) (ii) comprises site-directed nuclease-based enrichment of the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i), preferably wherein the site-directed nuclease-based enrichment comprises digesting the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) using a site-directed nuclease which is specific for a recognition sequence within the target nucleotide sequence followed by amplifying the digested DNA using PCR.
  • the method further comprises the step of size selection prior to or after the enrichment step f) (ii), preferably wherein the size selection step comprises using gel extraction chromatography, gel electrophoresis or density gradient centrifugation.
  • DNA is selected of a size between 20-20,0000 base pairs, preferably 50-10,0000 base pairs, most preferably between 100- 3,000 base pairs.
  • step g) is performed using whole genome sequencing.
  • step g) comprises determining at least part of the sequence of the undigested DNA comprising the target nucleotide sequence.
  • step of building a contig comprises the steps of:
  • step b 1) identifying the fragments of step b);
  • step 2) of assigning the fragments to a genomic region comprises identifying the different ligation products of step e) and coupling of the different ligation products to the identified fragments.
  • first and second flanking sequences are from two separate DNA fragments which occur in close proximity to one another in the linear DNA sequence of the genomic region of interest.
  • genomic region of interest comprises a transgene integration site.
  • first and second flanking sequences are from two separate DNA fragments from the transgene and/or from the vector used to deliver the transgene.
  • first and second flanking sequences are each from a separate DNA fragment from the vector; ii) the first flanking sequence is from a DNA fragment from the vector and the second flanking sequence is from a DNA sequence from the transgene; or iii) the first and second flanking sequences are each from a separate DNA fragment from the transgene.
  • a method for determining the sequence of a genomic region of interest comprising a target nucleotide sequence comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a); c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) fragmenting the ligated DNA generated in step d) by non-random fragmentation of the DNA at a recognition sequence; f) circularising the fragmented DNA generated in step e); g) digesting the ligated DNA generated in step f) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step e) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Microbiology (AREA)
  • Biomedical Technology (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Virology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Plant Pathology (AREA)
  • Immunology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to methods for making a DNA sequencing library of and for determining the sequence of a genomic region of interest comprising a target nucleotide sequence comprising, fragmenting a sample of crosslinked DNA, ligating the fragmented crosslinked DNA, reversing the crosslinking, digesting the DNA with a nuclease, degrading the digested DNA and/or enriching for the target nucleotide sequence and determining at least part of the sequence of the undigested DNA which comprises the target nucleotide sequence.

Description

METHOD FOR TARGETED SEQUENCING
FIELD OF THE INVENTION
The present invention relates to the field of molecular biology and more in particular to DNA technology. The invention in more detail relates to the sequencing of DNA. The invention relates to strategies for determining (part of) a DNA sequence of a genomic region of interest. In particular, the invention relates to the determination of the sequence of parts of a genome that are in a spatial configuration with each other.
BACKGROUND TO THE INVENTION
The gene and cell therapy fields are advancing rapidly, with a potential to treat and cure a wide range of diseases. Gene therapy is the introduction of exogenous nucleic acids (e.g. transgenes) into cells or organisms to achieve a therapeutic effect. In the majority of cases, gene therapy involves the integration of one or more copies of a transgene into the host cell genome. Although targeted transgene integration using CRISPR-Cas9 and other genome editing techniques hold great promise, the majority of current gene therapies use retroviral, lentiviral or non-viral vectors that are non-targeted and can integrate at multiple sites, with some predilection for open chromatin and transcriptionally active regions. Replicationdefective retrovirus- based gene transfer vectors, in particular lentiviruses which efficiently target and integrate into both dividing and non- or slowly dividing cells (e.g. stem cells and neurons), are the vector of choice for many applications. Several lentivirus-based gene therapy clinical trials are ongoing, including targeting hematopoietic stem cells and T cells for a variety of diseases and adoptive T cell therapy (e.g. CAR-T) for cancer treatment.
The first clinical trials of gene transfer into humans raised issues regarding the safety of early viral vectors due to vector toxicity and activation of proto-oncogenes caused by vector- mediated insertional mutagenesis. Whilst there have been no reports of insertional mutagenesis caused by lentivirus integration since the development of self-inactivating lentiviral vectors, it remains important to study vector integration sites to assess safety, prolonged genomic toxicity and posttranscriptional deregulation events. In addition, analysis of vector integration sites can provide critical information on the clonality of gene-modified cells and potential biological impacts of specific transgene insertion sites, including enhanced therapeutic efficacy (e.g. enhanced chimeric antigen receptor (CAR) T cell function through transgene disruption of TET2).
Several methods have been developed to investigate and report locations of newly integrated viral DNA using next generation sequencing (NGS). In general, the analysis of vector integration sites involves primer binding within the vector genome, elongation into the flanking host genome sequences, ligation of a common adapter on the host part, and subsequent PCRs polymerase chain reaction (PCR) amplification of the flanking host genomic sequences, followed by sequencing of the PCR amplicons. Most commonly, unidirectional linear amplification-mediated PCR (LAM-PCR) is combined with paired-end Illumina next-generation sequencing to quantify, analyse and map vector integration sites.
The methods established to detect vector integration sites using NGS are limited by low yield, lengthy protocols or inadvertent bias introduced during restriction digest (if applied), ligation, and PCR.
Considerable effort has been devoted to develop “target enrichment” strategies for sequencing, in which target genomic regions from a DNA sample are selected and subsequently sequenced. Some of the most widely used targeted sequencing approaches include capture-based enrichment and PCR-based amplification. For example, in the use of capture probes, on an array or in solution, probes of 60-120 bases in length are used to capture the genomic region of interest via hybridisation. Alternatively, performing a PCR reaction using a single primer pair will amplify, and thus enrich for, a genomic region.
These enrichment strategies have limitations:
• sequence information throughout the genomic region of interest is required beforehand to design probes and/or primers to capture and/or amplify the region of interest;
• the assays are biased by using sequence data for the probes and/or primers which largely cover the genomic region of interest;
• the strategies do not detect sequences that deviate too much from the designed template sequences and will therefore not detect insertions; and
• the approaches typically require fragmenting DNA into fragments which are a few hundred base pairs in length before the analysis, leading to loss of information regarding rearrangements within the region of interest.
Targeted sequencing approaches have also been developed that rely on the physical proximity of sequences as the basis of selection. For example, Targeted Locus Amplification (TLA) enables targeted enrichment and complete sequencing of a genomic region of interest comprising a target nucleotide sequence, i.e. of the linear chromosome template surrounding a target nucleotide sequence. TLA and other physical proximity protocols are based on the concept that, in general, the chance of different fragments being crosslinked correlates inversely with the linear distance, i.e. the frequency of intra-chromosomal crosslinking is on average always higher than that of sequences from physically distant positions in the linear genome sequence or from other DNA fragments (e.g. different isolated DNA molecules, different chromosomes, episomal copies of a vector etc.). Thus, DNA fragments that ligate to the DNA fragment comprising the target nucleotide sequence are representative of the genomic region of interest comprising the target nucleotide sequence.
The TLA approach involves crosslinking of DNA, fragmenting the crosslinked DNA (e.g. with a restriction enzyme), followed by ligation of the crosslinked DNA fragments. The ligated DNA fragments comprising the target nucleotide sequence, and thus the genomic region of interest, may be enriched, e.g. by PCR. The sequence of the genomic region of interest on the linear chromosome template can subsequently be determined using (high throughput) sequencing technologies well known in the art.
Conventional targeted sequencing approaches and conventional physical proximity-based protocols cannot distinguish between integrated and episomal copies of a vector I transgene sequence. This is a limitation in the NGS-based analysis of gene therapy products as the quality will often depend on the frequency with which complete correct integrations of the vector I transgene sequence without structural changes in the vector I transgene sequence and/or integration sites occur.
Therefore, there is a need in the art for improved methods for sequencing vector integration sites which avoid the above issues.
SUMMARY OF THE INVENTION
The present invention provides an improved physical proximity based approach for sequencing of genomic regions of interest using the principles of physical proximity protocols and the deselection of ligation products which are less valuable. In particular, the invention provides a method for making a targeted DNA sequencing library (i.e. a DNA library suitable for sequencing) of a genomic region of interest comprising a target nucleotide sequence and a method for determining the sequence of a genomic region of interest comprising a target nucleotide sequence (e.g. from a targeted DNA sequencing library).
According to the invention, sample DNA is crosslinked and then fragmented with a fragmenting strategy that results in known fragment end sequences (e.g. restriction enzymes) and the crosslinked DNA fragments are then ligated. The resulting ligation products are then digested at least once with a nuclease which targets the fusion sequences resulting from one or more known ligation product(s) which are less valuable in terms of the sequence analysis, prior to an enrichment step and sequencing step. Ligation products which are less valuable may, for example, be of the DNA fragment containing the target nucleotide sequence and one or more DNA fragments that originally occurred in immediate or close physical proximity to the target nucleotide sequence on the linear chromosome template.
In this manner, the enrichment and sequencing of ligation products containing these fusion sequences, i.e. ligation products which are less valuable, can be minimised or prevented.
In conventional physical proximity protocols the majority of ligation events will occur between DNA fragments that originally occurred in close physical proximity (e.g. adjacent to each other) on the linear chromosome template. Such ligation events are over-represented in the sequence material and unlikely to contain sequences of the genomic region of interest originating from meaningful physical distances. This results in high sequencing coverage across sequences in immediate vicinity to the target nucleotide sequence, but lower sequencing coverage across the remainder of the genomic region of interest.
By contrast, the targeted digestion of the methods of the invention result in the digestion of the relatively large number of ligation products that, without this step, would result in high sequencing coverage across sequences in immediate vicinity to the target nucleotide sequence.
Crucially, the selective digestion of these specific fusion sequences will still result in complete sequence information across a genomic region of interest since ligation products will consist of multiple DNA fragments and since DNA fragment ends in the genomic region of interest will, after the ligation step, have ligated to multiple (ends of) other DNA fragments.
The ligations of all DNA fragments thus also result in ligation products that remain undigested following the digestion step using a nuclease which is directed to the specific fusion sequence.
All DNA fragments in a genomic region of interest will therefore still be present in the combination of ligation products that comprise the target nucleotide sequence which remain following the nuclease digestion step.
Thus, the methods of the invention provide more efficient sequencing of an entire genomic region of interest. This is achieved by more preferentially sequencing those DNA sequences within the genomic region of interest with larger physical distances from the target nucleotide sequence in the linear chromosome template than in conventional physical proximity protocols. The methods of the invention therefore improve the sequencing efficiency of any genomic region of interest. This includes regions comprising repeats (e.g. concatemerized sequences) as the method of the invention improves the efficiency with which sequences at either end of such a repeat are sequenced. This also includes regions comprising large structural changes. For example, when the site-directed nuclease digestions are designed to target ligation events of DNA fragments spanning a 10 kb genomic region of interest in the wild-type DNA sequence, ligation products resulting from alleles in which structural changes have occurred (e.g. an insertion or translocation) within this 10kb region will more frequently contain ligation products that will not be digested (e.g. products of ligation events between the target nucleotide sequence and the inserted or translocated sequence). These ligation products which are not digested will therefore be more preferentially sequenced compared to ligation events originating from wild-type alleles that are digested.
The methods of the invention also present particular advantages in the sequencing of integrated transgene sequences and vector integration sites in samples comprising integrated and episomal copies of a vector (which also comprise copies of the integrated transgene sequences). Conventional targeted sequencing approaches and physical proximity approaches cannot distinguish between DNA fragments originating from integrated and episomal copies of a vector I transgene sequence and this limits the efficiency and quality with which integrated copies and their integration sites can be sequenced.
The method of the invention enables the targeted sequencing of integrated transgene copies by exploiting the differences between the ligation products originating from integrated and episomal copies of a vector I transgene sequence. Episomal copies of the vector I transgene sequences occur in DNA molecules which are physically separated from each other and the host genome. The invention is based upon the concept that, in a physical proximity approach, ligation products from episomal copies of a vector I transgene sequence frequently exclusively contain ligation events between DNA fragments from the episomal copies of the vector I transgene, e.g. between the DNA fragment containing the target nucleotide sequence and DNA fragments from the episomal copies of the vector I transgene sequence. On the other hand, integrated copies of the vector sequence are located within a much longer stretch of DNA originating from the host genome such that ligation events are more likely to occur between sequences at either end of the fragmentation site (i.e. recognition sequence) in the vector sequence and sequences originating from its integration site. Thus, ligation products from integrated copies of the vector I transgene sequence will much more frequently contain ligation events between the DNA fragment containing the target nucleotide sequence (e.g. transgene or portion thereof) and DNA fragments from the integration site in the host genome, i.e. sequences of the host genome flanking the transgene. In this manner, the selective nuclease digestion of possible combinations of DNA fragment ends from the vector I transgene sequence will preferentially result in the depletion of ligation events originating from episomal copies.
In the methods of the invention, one or more of the potential fusion sequences resulting from the ligation of the DNA fragment containing the target nucleotide sequence and DNA fragments of the vector I transgene sequence is/are targeted by nuclease digestion. Thus, ligation products which exclusively contain vector / transgene sequences (e.g. vector-vector, vector-transgene and/or transgene-transgene ligation events) are targeted for digestion. Hence, nuclease digestion of one or more of the potential fusion sequences resulting from the ligation of the DNA fragment containing the target nucleotide sequence and DNA fragments of the vector I transgene sequence results in the very preferential enrichment and sequencing of ligation events of the DNA fragment containing the target nucleotide sequence and DNA fragments from the host genome. This then results in complete sequence information across integrated vectors and integration sites in the host genome.
An important advantage of the methods of the invention is the provision of complete sequence information across the genomic region of interest despite the deselection of ligation products which are less valuable. In the context of vector integration sites, this enables the quality control of integrated sequences and will also provide the breakpoint sequence between the integrated vector I transgene and the host genome at the exact position of the integration site.
As used herein, the terms “vector integration site” and “transgene integration site” are used interchangeably to refer to genomic locus within which the vector I transgene has integrated.
The methods of the invention can also be applied to sequencing of targeted transgene integration (e.g. using CRISPR-Cas9 or other genome editing techniques) and to in vivo analysis of integration sites.
Thus, the methods of the invention facilitate safety evaluation of preclinical lentiviral vector gene therapies by providing vector integration site analysis with improved confidence.
Accordingly, in a first aspect the invention provides a method for making a DNA sequencing library of a genomic region of interest comprising a target nucleotide sequence, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence; c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) digesting the ligated DNA generated in step d) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); f) (i) degrading the digested DNA generated in step e) using an exonuclease; and/or
(ii) enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f)(i) for DNA comprising the target nucleotide sequence; and g) optionally, determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
In a second aspect, the invention provides a method for determining the sequence of a genomic region of interest comprising a target nucleotide sequence, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence; c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) digesting the ligated DNA generated in step d) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); f) (i) degrading the digested DNA generated in step e) using an exonuclease; and/or (ii) enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f)(i) for DNA comprising the target nucleotide sequence; and g) determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
In some embodiments, the genomic region of interest comprises one or more further target nucleotide sequences.
In some embodiments, a DNA sequencing library of a plurality genomic regions of interest is made.
In some embodiments, the sequences of a plurality of genomic regions of interest are determined.
In some embodiments, the fragmenting step b) comprises fragmenting with a restriction enzyme.
In some embodiments, the fragmenting step b) comprises fragmenting with a site-directed nuclease, preferably a CRISPR-Cas nuclease.
In some embodiments, the fragmenting step b) comprises fragmenting a plurality of subsamples, each subsample having a different recognition sequence.
In some embodiments, the method further comprises the step of: i) (a) further fragmenting the crosslinked DNA provided in step a) or the ligated crosslinked DNA generated in step c); and
(b) ligating the fragmented DNA generated in step i) (a), preferably wherein the ligation is performed in the presence of an adaptor, ligating adaptor sequences in between fragments; prior to step d); or ii) (a) further fragmenting the undigested DNA generated in step f); and
(b) optionally, circularising or ligating the fragmented DNA generated in step ii) (a), preferably ligating the fragmented DNA to at least one adaptor, prior to step g). In some embodiments, the further steps i) (a) and i) (b) are repeated at least once.
In some embodiments, the further step i) (a) is performed using random fragmentation.
In some embodiments, the further steps i) (a) and ii) (a) are performed using non-random fragmentation at a recognition sequence.
In some embodiments, the recognition sequence of the fragmenting step b) is of greater length than the recognition sequence used in the further fragmenting step i) (a) or step ii) (a).
In some embodiments, the average length of the DNA fragments generated in fragmenting step b) is greater than the average length of the DNA fragments generated in further fragmenting step i) (a) or step ii) (a).
In some embodiments, the recognition sequence of fragmenting step b) is of 4 to 12 nucleotides in length, preferably of 6 to 8 nucleotides in length.
In some embodiments, step e) is performed prior to step d).
In some embodiments, the method further comprises the step of circularising the DNA of step d), prior to step e).
In some embodiments, the fusion sequence of digesting step e) is of greater length than the recognition sequence of fragmenting step b).
In some embodiments, the fusion sequence is of 15 to 25 nucleotides in length, preferably of 20 nucleotides in length.
In some embodiments, the nuclease of digesting step e) is a restriction enzyme.
In some embodiments, the nuclease of digesting step e) is a site-directed nuclease. In some preferred embodiments, the site-directed nuclease is a CRISPR-Cas nuclease.
In some embodiments, the digesting step e) uses a multiplex nuclease digestion specific for a plurality of specific fusion sequences.
In some embodiments, the method further comprises the steps of:
A’) digesting the ligated DNA generated in step c) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using the nuclease of step e); and
B’) ligating the mixture of digested and undigested DNA generated in step A’); prior to step d).
In some embodiments, the further steps A’) and B’) are repeated at least once.
In some embodiments, the enriching step f) (ii) comprises amplifying the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i).
In some embodiments, amplifying the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i) comprises using at least one primer which hybridises to the DNA fragment comprising the target nucleotide sequence generated in step b) and optionally further using a plurality of primers which each hybridises to the DNA sequence of one of the DNA fragments comprising the one or more further target nucleotide sequences generated in step b).
In some embodiments, at least one primer directs amplification towards the recognition sequence of step b).
In some embodiments, an identifier is included in the at least one primer.
In some embodiments, the enriching step f) (ii) comprises capture-based enrichment of the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i), preferably wherein the capture-based enrichment is specific for a defined sequence at one end of the target nucleotide sequence.
In some embodiments, the enriching step f) (ii) comprises site-directed nuclease-based enrichment of the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i), preferably wherein the site-directed nuclease-based enrichment comprises digesting the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) using a site-directed nuclease which is specific for a recognition sequence within the target nucleotide sequence followed by amplifying the digested DNA using PCR. Preferably, the PCR uses at least one primer facing a fragment end generated by the nuclease of step e).
In some embodiments, the enriching step f) (ii) comprises site-directed nuclease-based enrichment of the undigested circularised DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i), preferably wherein the site-directed nuclease- based enrichment comprises digesting the mixture of digested and undigested circularised DNA generated in step e) using a site-directed nuclease which is specific for a recognition sequence within the target nucleotide sequence followed by amplifying the digested linearised DNA using PCR. In some embodiments, the method further comprises the step of size selection prior to or after the enrichment step f) (ii), preferably wherein the size selection step comprises using gel extraction chromatography, gel electrophoresis or density gradient centrifugation.
In some embodiments, DNA is selected of a size between 20-20,0000 base pairs, preferably 50-10,0000 base pairs, most preferably between 100-3,000 base pairs.
In some embodiments, step g) is performed using whole genome sequencing.
In some embodiments, step g) comprises determining at least part of the sequence of the undigested DNA comprising the target nucleotide sequence.
In some embodiments, the method further comprises the step of building a contig of the genomic region of interest from the determined sequences generated in step g).
In some embodiments, when the cell ploidy of the genomic region of interest is greater than 1 , a contig is built for each ploidy.
In some embodiments, the step of building a contig comprises the steps of:
1) identifying the fragments of step b);
2) assigning the fragments to a genomic region;
3) building a contig for the genomic region.
In some embodiments, the step 2) of assigning the fragments to a genomic region comprises identifying the different ligation products of step e) and coupling of the different ligation products to the identified fragments.
In some embodiments, the first and second flanking sequences are from two separate DNA fragments which occur in close proximity to one another in the linear DNA sequence of the genomic region of interest.
In some embodiments, the genomic region of interest comprises a transgene integration site.
In some embodiments, the target nucleotide sequence comprises a transgene.
In some embodiments, the first and second flanking sequences are from two separate DNA fragments from the transgene and/or from the vector used to deliver the transgene.
In some embodiments: i) the first and second flanking sequences are each from a separate DNA fragment from the vector; ii) the first flanking sequence is from a DNA fragment from the vector and the second flanking sequence is from a DNA sequence from the transgene; or iii) the first and second flanking sequences are each from a separate DNA fragment from the transgene.
In some embodiments, the target nucleotide sequence comprises an allele of the genomic region of interest.
In some embodiments, the first and second flanking sequences are from separate DNA fragments from the allele.
In a third aspect, the invention provides a method for determining the sequence of a genomic region of interest comprising a target nucleotide sequence, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a); c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) fragmenting the ligated DNA generated in step d) by non-random fragmentation of the DNA at a recognition sequence; f) circularising the fragmented DNA generated in step e); g) digesting the circularised DNA generated in step f) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step e) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); h) (i) degrading the digested DNA generated in step g) using an exonuclease; and/or (ii) enriching the mixture of digested and undigested DNA generated in step g) or the undigested DNA generated in step h)(i) for DNA comprising the target nucleotide sequence; and i) determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
BRIEF DESCRIPTION OF THE FIGURES
Figure 1 : Schematic of an illustrative approach for targeted sequencing of integrated viruses and integration sites in accordance with the invention. A. A viral genome sequence with annotation of a restriction site which occurs twice in the viral genome sequence. The fragmentation of the viral genome with the restriction enzyme which recognizes this restriction site will result in three fragments, shown in different shades. Resulting fragment ends A, B and C generated in the fragmentation of the viral genome are shown. A target nucleotide sequence (outlined with a black box) is chosen in proximity to one of the restriction sites. In episomal copies, ligation of the DNA fragment end in proximity to the target nucleotide sequence with other DNA fragments from the vector sequence will result in known fusion sequences which may be targeted with nucleases. B. A portion of the human genome sequence showing the presence of multiple restriction sites (recognised by the same restriction enzyme as in Figure 1A) in the human genome in proximity to any integration site of the virus. In integrated copies, the DNA fragment end in proximity to the target nucleotide sequence will frequently ligate to DNA fragments from the integration site, i.e. the host genome.
Figure 2: Schematic of an illustrative approach for whole genome sequencing in accordance with the invention. A. A viral genome sequence with annotation of a restriction site which occurs twice in the viral genome sequence. The fragmentation of the viral genome with the restriction enzyme which recognizes this restriction site will result in three fragments, shown in different shades. Resulting fragment ends A, B, C and D generated in the fragmentation of the viral genome are shown. B. A portion of the human genome sequence showing the presence of multiple restriction sites (recognised by the same restriction enzyme as in Figure 2A) in the human genome in proximity to any integration site of the virus. The selective digestion of transgene-transgene, transgene-vector and vectorvector fusion sequences will help minimise or prevent the sequencing of episomal copies. Figure 3: Plasmid map of pSUB201 showing the position of unique restriction sites within the plasmid sequence.
DETAILED DESCRIPTION OF THE INVENTION
In the following description and examples, a number of terms are used. In order to provide a clear and consistent understanding of the specification and claims, including the scope to be given such terms, the following definitions are provided. Unless otherwise defined herein, all technical and scientific terms used have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
Methods of carrying out the conventional techniques used in methods of the invention will be evident to the skilled worker. The practice of conventional techniques in molecular biology, biochemistry, computational chemistry, cell culture, recombinant DNA, bioinformatics, genomics, sequencing and related fields are well-known to those of skill in the art and are discussed, for example, in the following literature references: Sambrook et al., Molecular Cloning. A Laboratory Manual, 2nd Edition, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N. Y., 1989; Ausubel et al., Current Protocols in Molecular Biology, John Wiley & Sons, New York, 1987 and periodic updates; and the series Methods in Enzymology, Academic Press, San Diego.
As used herein, the singular forms "a," "an" and "the" include plural referents unless the context clearly dictates otherwise. For example, a method for isolating "a" DNA molecule, as used above, includes isolating a plurality of molecules (e.g. 10's, 100's, 1000 's, 10's of thousands, 100's of thousands, millions, or more molecules).
As used herein, the term “nucleic acid” may include any polymer or oligomer of pyrimidine and purine bases, preferably cytosine, thymine, and uracil, and adenine and guanine, respectively (see Albert L. Lehninger, Principles of Biochemistry, at 793-800 (Worth Pub. 1982) which is herein incorporated by reference in its entirety for all purposes). The present invention contemplates any deoxyribonucleotide, ribonucleotide or peptide nucleic acid component, and any chemical variants thereof, such as methylated, hydroxymethylated or glycosylated forms of these bases, and the like. The polymers or oligomers may be heterogeneous or homogenous in composition, and may be isolated from naturally occurring sources or may be artificially or synthetically produced. In addition, the nucleic acids may be DNA or RNA, or a mixture thereof, and may exist permanently or transitionally in single- stranded or double-stranded form, including homoduplex, heteroduplex, and hybrid states. As used herein, with the term “aligning” and “alignment” is meant the comparison of two or more nucleotide sequence based on the presence of short or long stretches of identical or similar nucleotides. Methods and computer programs for alignment are well known in the art. One computer program which may be used or adapted for aligning is "Align 2", authored by Genentech, Inc., which was filed with user documentation in the United States Copyright Office, Washington, D.C. 20559, on Dec. 10, 1991.
Method for determining the sequence of a genomic region of interest
According to a first aspect of the invention, a method is provided for determining the sequence of a genomic region of interest comprising a target nucleotide sequence, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) digesting the ligated DNA generated in step d) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); f) (i) degrading the digested DNA generated in step e) using an exonuclease such that only undigested DNA remains; and/or
(ii) enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) (i) for DNA comprising the target nucleotide sequence to enrich for undigested DNA comprising the target nucleotide sequence; and g) determining at least part of the sequence of the undigested DNA to provide at least part of the sequence of the undigested DNA remaining after step f), preferably using high throughput sequencing. As used herein, the term “genomic region of interest” refers to a DNA sequence of an organism of which it is desirable to determine, at least part of, the DNA sequence. For instance, a genomic region which comprises, or is suspected of comprising, an allele associated with a disease may be a genomic region of interest. Another example is a genomic region which comprises a vector insertion site. In this case, the whole genome sequence may be determined following deselection of episomal copies of the vector I transgene sequence. Thus, in some embodiments, the genomic region of interest is the whole genome.
As used herein, the term “target nucleotide sequence” refers to a DNA sequence of interest within a genomic region of interest. For example, the target nucleotide sequence may be a transgene or a portion thereof. Suitably, the target nucleotide sequence may be an allele or a portion thereof. The target nucleotide sequence may be used for the design of the nonrandom fragmenting strategies described herein as well as in the enrichment steps described herein.
By fragmenting a sample of crosslinked DNA, the DNA fragments that originate from a genomic region of interest remain in proximity of each other because they are crosslinked. When these crosslinked DNA fragments are subsequently ligated, DNA fragments of the genomic region of interest, which are in the proximity of each other due to the crosslinks, are ligated. This type of ligation is also referred to as proximity ligation. DNA fragments comprising the target nucleotide sequence may ligate with DNA fragments within a large linear distance at the sequence level. By determining (at least part of) the sequence of ligation products that comprise the fragment comprising the target nucleotide sequence, sequences of DNA fragments within the spatial surrounding of the genomic region of interest are obtained. Each individual target nucleotide sequence is likely to be crosslinked to multiple other DNA fragments. As a consequence, often more than one DNA fragment may be ligated to a fragment comprising the target nucleotide sequence and, in a sample comprising multiple copies of a genomic region of interest, each individual DNA fragment comprising the target nucleotide sequence may ligate to different combinations of DNA fragments originating from the genomic region of interest. By combining (partial) sequences of the (amplified) ligation products in which DNA fragments were ligated with a fragment comprising the target nucleotide sequence, a sequence of the genomic region of interest may be built. A DNA fragment ligated with the fragment comprising the target nucleotide sequence includes any fragment which may be present in ligation products.
As used herein, the term “ligation product” means a DNA sequence which is generated by ligating DNA fragments together. Thus, a ligation product comprises at least two DNA fragments. In the context of the present invention, the DNA fragments which are subsequently ligated have been produced by a previous fragmentation step.
Methods are known in the art that involve crosslinking DNA, as well as fragmenting and ligating the DNA fragments (e.g. WO 2007/004057 or WO 2008/08845). Thus, approaches for performing steps a)-d) of the methods of the invention are known.
The methods of the invention have the advantages that extensive sequence information is not required to focus on the genomic region of interest and the method is not sequence- biased (i.e. bias by using oligonucleotides and/or probes which cover the transgene of interest, allelic sequence of interest, or flanking sequences surrounding the sequence of interest, is avoided).
In some embodiments, the methods of the invention may be used in the analysis of the 3D folding of regions of interest. Methods for the analysis of the 3D folding of regions of interest are known in the art (see, for example, Sungalee et al (2021) Nature Genetics 53: 650-662). Thus, the methods of the invention can be applied to the analysis of the 3D folding of regions of interest.
Sample of crosslinked DNA
In step a) a sample of crosslinked DNA is provided.
As used herein, the term “sample DNA” refers to a sample that is obtained from an organism or from a tissue of an organism, or from tissue and/or cell culture, which comprises DNA. A sample DNA from an organism may be obtained from any type of organism, e.g. microorganisms, viruses, plants, fungi, animals, humans and bacteria, or combinations thereof. For example, a tissue sample from a human patient suspected of a bacterial and/or viral infection may comprise human cells, but also viruses and/or bacteria. The sample may comprise cells and/or cell nuclei. Suitably, the sample may comprise or consist of isolated DNA.
In some embodiments, the sample DNA is from a patient or a person who may be at risk of, suspected of having, or has a particular disease, for example cancer, a viral infection (e.g. HIV-1) or any other condition which warrants the investigation of their DNA.
In some embodiments, the sample DNA is from a patient or a person who is undergoing or has undergone gene therapy, for example using a lentiviral vector. Samples may be taken from a patient and/or from diseased tissue, and may also be derived from other organisms or from separate sections of the same organism, such as samples from one patient, one sample from healthy tissue and one sample from diseased tissue. Samples may thus be analysed according to the invention and compared with a reference sample, or different samples may be analysed and compared with each other. For example, for a patient being suspected of having cancer, a biopsy may be obtained from the suspected tumour. Another biopsy may be obtained from non-diseased tissue. Both tissue biopsies may be analysed according to the invention. Genomic regions of interest may be those containing a gene associated with the cancer type (e.g. the BRCA1 and BRCA2 gene, which are 83 and 86 kb long, respectively (reviewed in Mazoyer, 2005, Human Mutation 25:415-422), for suspected breast cancer). By determining the sequence of the genomic region of interest according to the invention and comparing the sequences of the genomic region from the different biopsies with each other and/or with a reference gene sequence (e.g. a reference BRCA gene sequence), genetic mutations may be found that will assist in diagnosing the patient and/or determining treatment of the patient and/or predicting prognosis of disease progression.
As used herein, the term “crosslinking” means reacting DNA at two different positions, such that these two different positions may be connected. The connection between the two different positions may be direct, forming a covalent bond between DNA strands. Two DNA strands may be crosslinked directly using UV-irradiation, forming covalent bonds directly between DNA strands. The connection between the two different positions may be indirect, via an agent, e.g. a crosslinker molecule. A first DNA section may be connected to a first reactive group of a crosslinker molecule comprising two reactive groups, that second reactive group of the crosslinker molecule may be connected to a second DNA section, thereby crosslinking the first and second DNA section indirectly via the crosslinker molecule. A crosslink may also be formed indirectly between two DNA strands via more than one molecule. For example, a typical crosslinker molecule that may be used is formaldehyde. Formaldehyde induces protein-protein and DNA-protein crosslinks. Formaldehyde thus may crosslink different DNA strands to each other via their associated proteins. For example, formaldehyde can react with a protein and DNA, connecting a protein and DNA via the crosslinker molecule.
Hence, two DNA sections may be crosslinked using formaldehyde forming a connection between a first DNA section (DNA1) and a protein, the protein may form a second connection with another formaldehyde molecule that connects to a second DNA section (DNA2), thus forming a crosslink which may be depicted as DNA1-crosslinker-protein- crosslinker-DNA2. In any case, it is understood that crosslinking according to the invention involves forming connections (directly or indirectly) between strands of DNA that are in physical proximity of each other. DNA strands may be in physical proximity of each other in the cell, as DNA is highly organised, while being separated from a linear sequence point of view e.g. by 100kb. As long as the crosslinking method is compatible with subsequent fragmenting and ligation steps, such crosslinking may be contemplated for the purpose of the invention.
As used herein, the term a “sample of crosslinked DNA” refers to sample DNA which has been subjected to crosslinking. Crosslinking the sample DNA has the effect that the three- dimensional state of the DNA within the sample remains largely intact. This way, DNA strands that are in physical proximity of each other remain in each others’ vicinity. Thus, crosslinking the sample DNA as it is present in the sample results in largely maintaining the three dimensional architecture of the DNA.
Non-random fragmentation at a recognition sequence
The sample of crosslinked DNA is fragmented in step b). By fragmenting the crosslinked DNA, DNA fragments are produced which are held together by the crosslinks.
The fragmenting step b) may comprise fragmenting with one or more restriction enzymes, or combinations thereof.
The fragmenting step b) may also comprise fragmenting with one or more site-directed nucleases, or combinations thereof.
Fragmenting with a restriction enzyme or site-directed nuclease is advantageous as it may allow control of the average fragment size. The fragments that are formed may have compatible overhangs or blunt ends that allow ligation of the fragments in the subsequent step c).
Furthermore, when dividing a sample of cross-linked DNA into a plurality of subsamples, for each subsample restriction enzymes or site-directed nucleases with different recognition sites may be used. This is advantageous because by using different restriction enzymes or site-directed nucleases having different recognition sites, different DNA fragments can be obtained from each subsample.
Accordingly, in some embodiments, the fragmenting step b) comprises fragmenting a plurality of subsamples, each subsample having a different recognition sequence. As used herein, the term “fragmenting DNA” includes any technique that, when applied to DNA, which may be crosslinked DNA or not, or any other DNA, results in DNA fragments. Techniques well known in the art are sonication, shearing and/or enzymatic restriction, but other techniques can also be envisaged. Fragmenting techniques may result in random fragmentation of the DNA (e.g. sonication or shearing). Suitably, the fragmenting technique may result in non-random (i.e. targeted) fragmentation of the DNA (e.g. restriction enzymes or site-directed nucleases). Where a given step of the methods of the invention involves random and/or non-random fragmentation, this is specified herein. The methods of the invention may also be performed using DNA fragmentation techniques that preferentially fragment either methylated DNA or unmethylated DNA and may be used for the targeted sequencing of either methylated or unmethylated alleles of genomic regions of interest, respectively. For example, certain restriction enzymes preferentially fragment methylated DNA (as compared to unmethylated DNA) whilst other restriction enzymes preferentially fragment unmethylated DNA (as compared to methylated DNA).
Thus, the methods of the invention are applicable to the sequencing of alleles in which known sequences are either methylated or unmethylated. For example, the promoter sequences of actively transcribed genes are typically unmethylated, whereas the corresponding gene body sequences can contain enriched levels of methylation. Thus, the digestion of unmethylated DNA or methylated DNA will result in the deselection of either the promoter or corresponding gene body sequence, respectively. Accordingly, the methods of the invention permit the enrichment and sequencing of the promoter or corresponding gene body sequence.
In conventional DNA methylation analyses bisulphite treatment is used. Treatment of DNA with bisulphite converts cytosine residues to uracil, but leaves 5-methylcytosine residues unaffected. Such analyses do not enable the selective sequencing of alleles in which target nucleotide sequences have or have not been methylated. In addition, these approaches do not provide information regarding which genetic variants occur in methylated alleles of interest or unmethylated alleles of interest beyond the length of the sequencing reads generated in the sequencing analysis.
Suitably, the methods of the invention may be used in combination with bisulfite treatment. In this way, the methods may be used for the sequencing and quantification of epigenetic changes in alleles in which known sequences are either methylated or unmethylated.
Examples of nucleases that can be used in such analyses are site-specific methyl-directed (MD) DNA endonucleases (e.g. Glal). These enzymes recognise and cleave methylated DNA sequences only and do not cleave unmethylated DNA sequences (Tarasova et al. (2008) BMC Mol. Biol. 9: 7). Suitably, a restriction enzyme or site-directed nuclease may be used.
Examples of methylation-sensitive restriction enzymes that fragment unmethylated DNA include:
• Dpnl and Dpnll for N6-methyladenine detection within GATC recognition site; and
• Hpall and Mspl for C5-methylcytosine detection within CCGG recognition site.
In the application of the methods of the invention to the analysis of the 3D folding of regions of interest, nucleases which preferentially fragment methylated and/or unmethylated DNA may be used.
By “random fragmentation” of the DNA it is meant that the fragmenting technique results in DNA fragments with unknown end sequences. Sonication results in the fragmenting of DNA at random sites, which can be either blunt ended, or can have 3’- or 5’- overhangs, as these DNA breakage points occur randomly, the DNA may be repaired (enzymatically), filling in possible 3’- or 5’-overhangs, such that DNA fragments are obtained which have blunt ends that allow ligation of the fragments to adaptors and/or to each other in a subsequent step. Alternatively, the overhangs may also be made blunt ended by removing overhanging nucleotides, using e.g. exonucleases.
In the methods of the invention, the sample of crosslinked DNA is fragmented in step b) by non-random fragmentation of the DNA at a recognition sequence.
By “non-random fragmentation” of the DNA it is meant that the fragmenting technique results in DNA fragments with known end sequences, i.e. that the fragmenting is targeted. Suitably, non-random fragmentation involves fragmenting at a specific recognition sequence. Suitably, the non-random fragmentation of the DNA may be performed using a site-directed nuclease or a restriction enzyme which targets a specific recognition sequence. In the context of the use of a restriction enzyme, the specific recognition sequence is also referred to herein as a “restriction enzyme site” or “restriction site”.
As used herein, the term “recognition sequence” means a specific nucleotide sequence which is recognised by a fragmenting technique (e.g. a site-directed nuclease or restriction enzyme) and directs cleavage of the DNA molecule at or near the recognition sequence. The specific nucleotide sequence which is recognized may determine the frequency of cleaving, e.g. a nucleotide sequence of 6 nucleotides occurs on average every 4096 nucleotides, whereas a nucleotide sequence of 4 nucleotides occurs much more frequently, on average every 256 nucleotides. In some embodiments, the recognition sequence of fragmenting step b) is of 4 to 12 nucleotides in length, preferably of 4 to 8 nucleotides in length, more preferably 4 to 6 nucleotides in length.
In one embodiment, the recognition sequence of fragmenting step b) is of 3 (suitably, of 4, 5, 6, 7, 8, 9, 10, 11 or 12) nucleotides in length.
In one embodiment, the recognition sequence of fragmenting step b) is of 4 nucleotides in length.
In one embodiment, the recognition sequence of fragmenting step b) is of 5 nucleotides in length.
In one embodiment, the recognition sequence of fragmenting step b) is of 6 nucleotides in length.
In one embodiment, the recognition sequence of fragmenting step b) is of 7 nucleotides in length.
In one embodiment, the recognition sequence of fragmenting step b) is of 8 nucleotides in length.
In some embodiments, the fragmenting step b) comprises fragmenting with a restriction enzyme.
As used herein, the terms “restriction endonuclease” and “restriction enzyme” are used interchangeably to mean an enzyme that recognizes a specific nucleotide sequence (i.e. recognition sequence) in a double-stranded DNA molecule, and will cleave both strands of the DNA molecule at or near every recognition sequence, leaving a blunt end or a 3’- or 5’- overhanging end.
In some embodiments, the fragmenting step b) comprises fragmenting with a site-directed nuclease, preferably a clustered regularly interspaced short palindromic repeats (CRISPR)- CRISPR associated protein (Cas) nuclease.
As used herein, the term “site-directed nuclease” means a DNA-cutting enzyme (nuclease) which is directed to recognize a predetermined specific nucleotide sequence (i.e. recognition sequence) and to cleave both strands of the DNA molecule at or near every recognition sequence. The site-directed nuclease may be engineered to target a desired recognition sequence. Suitably, the site-directed nuclease may be a zinc-finger nuclease (ZFN), transcription activator-like effector nuclease (TALEN), Argonaute (Ago) protein (Song et al:, Nucleic Acids Research; 2020; 48(4); e19) or a CRISPR-Cas nuclease (e.g. Cas9).
Preferably, the site-directed nuclease is a CRISPR-Cas nuclease.
CRISPR is a family of DNA sequences found in the genomes of prokaryotic organisms such as bacteria and archaea which play a key role in the anti-bacteriophage defence system of prokaryotes and provide a form of acquired immunity. As the name suggests, CRISPR loci (also termed CRISPR arrays) comprise regularly spaced repeat sequences with each repeat separated by a unique spacer sequence. These spacer sequences are typically derived from the genome of invading bacteriophages or extrachromosomal DNA (e.g. plasmids) and are used to detect and destroy DNA having a similar sequence during subsequent invasions.
A CRISPR array is accompanied by a set of homologous genes that make up CRISPR- associated systems (cas) genes. To date, 93 cas genes have been identified and grouped into 35 families based on sequence similarity of the encoded proteins. The Cas proteins comprise helicase and/or nuclease motifs, and are involved in the maintenance and formation of the dynamic structure of the CRISPR loci as well as recognition and degradation of invading DNA.
There are several CRISPR system subtypes. Type II CRISPR-Cas systems require a transactivating CRISPR RNA (tracrRNA) which plays a role in the maturation of CRISPR RNAs (crRNAs). crRNAs are transcribed from the CRISPR locus, specifically, the spacer sequences are used to generate crRNA. tracrRNA is a small trans-encoded RNA which is partially complementary to and base pairs with a pre-crRNA, thereby forming an RNA duplex. This duplex is cleaved by the RNA-specific ribonuclease RNase III to form a crRNA/tracrRNA hybrid which acts as a guide RNA (gRNA). By contrast, Type V CRISPR- Cas systems require the crRNA but not tracrRNA. In such systems, crRNAs are transcribed from the CRISPR locus and then incorporated into effector complexes comprising a CRISPR-Cas nuclease, where the crRNA acts as a gRNA. Thus, depending upon the CRISPR/Cas system, either a crRNA alone or a crRNA/tracrRNA hybrid forms the gRNA. A gRNA guides the Cas enzyme to destroy DNA (e.g. bacteriophage DNA or plasmids) having a specific sequence (i.e. guides the Cas nuclease to provide immunity against repeat invasions). To achieve this, the gRNA associates with a CRISPR-Cas nuclease and directs the enzyme to recognize and cleave the DNA target complementary to the gRNA sequence.
The CRISPR-Cas nuclease will typically bind to and cleave the DNA sequence adjacent to a protospacer adjacent motif (PAM). A PAM is a 2-6-base pair DNA sequence. PAMs were initially identified as a consensus sequence found adjacent to the protospacer (the sequence in the bacteriophage or plasmid which will form the spacer in the CRISPR locus). There is a close correlation between the sequence identity of PAM and the CRISPR subtype. The PAM is a component of the invading bacteriophage or plasmid, but is not found in the bacterial host genome and hence is not a component of the bacterial CRISPR locus. Thus, the bacterial CRISPR loci does not contain a PAM sequence, and thus will not be cut by the CRISPR-Cas nuclease, but the protospacer in the invading virus or plasmid (or the target sequence in e.g. genome editing approaches) will contain the PAM sequence, and thus will be cleaved by the nuclease. As such, the PAM is a targeting component which prevents the CRISPR locus from being targeted and destroyed by the CRISPR-Cas nuclease. To improve the ability of CRISPR-Cas nucleases to target DNA sequences at any desired genome location, the nucleases can be engineered to recognize different PAMs.
Some CRISPR-Cas nucleases of Type II and Type V systems have been exploited in gene editing technologies, e.g. CRISPR-associated protein 9 (Cas9) and CRISPR-associated protein 12a (Cas12a).
The native Cas9 endonuclease is a four-component system that includes the crRNA and tracrRNA. The Cas9 endonuclease has been re-engineered into a two-component system by fusing the two RNA molecules into a single gRNA. By manipulating the nucleotide sequence of the gRNA, the synthetic Cas9 system can be directed to target any DNA sequence for cleavage. Cas9 provides a “blunt” cut in the target DNA strand. Cas9 cuts the DNA 3 base pairs upstream of the PAM site, such that the NHEJ pathway results in indel mutations that destroy the recognition sequence in a cell, thereby preventing further rounds of cutting. The canonical Cas9 PAM is the sequence 5'-NGG-3', where "N" is any nucleobase followed by two guanine ("G") nucleobases.
The nuclease Cas12a (formerly known as Cpf1) requires only the crRNA for successful targeting and provides a 3’ or 5’ overhanging end in the double stranded target DNA following cleavage. Thus, Cas12a may be more suitable for multiplexed CRISPR than Cas9, as more of the small crRNAs can be packaged in one vector than can Cas9's gRNAs. In addition, Cas12a cleaves DNA 18-23 base pairs downstream from the PAM site. This means there is no disruption to the recognition sequence after repair in a cell, and so Cas12a enables multiple rounds of DNA cleavage. Moreover, the sticky 5' end left by Cas12a can be used for DNA assembly or ligation that is much more target-specific than traditional restriction enzyme cloning. Similarly to Cas9, Cas12a can be directed to target any DNA sequence for cleavage by manipulating the nucleotide sequence of the gRNA.
In some embodiments, step b) comprises fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence using (synthetic) Cas9 or Cas12a. Methods for manipulating a gRNA sequence to direct a CRISPR-Cas nuclease to target a desired DNA sequence for cleavage are known in the art (see, for example, Jinek M, et al. (2012) Science 337: 816-821; and Kim H et al., (2017) Nature Communications 8: 14406). Thus, designing a suitable gRNA and directing a CRISPR-Cas nuclease to target a desired DNA sequence for cleavage is within the ambit of the skilled person.
Ligation
In the next step c), the fragments are ligated.
As used herein, “ligating” involves the joining of separate DNA fragments. The DNA fragments may be blunt ended, or may have compatible overhangs (also termed sticky overhangs or sticky ends) such that the overhangs can hybridise with each other. The joining of the DNA fragments may be enzymatic, with a ligase enzyme, DNA ligase. However, a non-enzymatic ligation may also be used, as long as DNA fragments are joined, i.e. forming a covalent bond. Typically a phosphodiester bond between the hydroxyl and phosphate group of the separate strands is formed.
Since a fragment comprising a target nucleotide sequence may be crosslinked to multiple other DNA fragments, more than one DNA fragment may be ligated to the fragment comprising the target nucleotide sequence. This may result in combinations of DNA fragments which are in proximity of each other as they are held together by the cross links. Different combinations and/or order of the DNA fragments in ligated DNA fragments may be formed. In case the DNA fragments are obtained via enzymatic restriction or using a site- directed nuclease, the recognition sequence of the restriction enzyme or site-directed nuclease is known, which makes it possible to identify the fragments as remains of or reconstituted recognition sequences may indicate the separation between different DNA fragments.
Irrespective of what fragmenting method is used, the ligation step c) may be performed in the presence of an adaptor, ligating adaptor sequences in between fragments. Alternatively the adaptor may be ligated in a separate step. This is advantageous because the different fragments can be easily identified by identifying the adaptor sequences which are located in between the fragments. For example, in case DNA fragment ends were blunt ended, the adaptor sequence would be adjacent to each of the DNA fragment ends, indicating the boundary between separate DNA fragments.
As used herein, the term “adaptor” refers to a short double-stranded oligonucleotide molecule with a limited number of base pairs, e.g. about 10 to about 30 base pairs in length, which are designed such that they can be ligated to the ends of fragments. Adaptors are generally composed of two synthetic oligonucleotides which have nucleotide sequences which are partially complementary to each other. When mixing the two synthetic oligonucleotides in solution under appropriate conditions, they will anneal to each other forming a double-stranded structure. After annealing, one end of the adaptor molecule may be designed such that it is compatible with the end of a restriction fragment and can be ligated thereto; the other end of the adaptor can be designed so that it cannot be ligated, but this does need not to be the case, for instance when an adaptor is to be ligated in between DNA fragments.
Reversing the crosslinking
Next, the crosslinking is reversed in step d), which results in a pool of ligated DNA fragments that comprise two or more fragments. A subpopulation of the pool of ligated DNA fragments comprises a DNA fragment which comprises the target nucleotide sequence. By reversing the crosslinking, the structural/spatial fixation of the DNA is released and the DNA sequence becomes available for subsequent steps, e.g. amplification and/or sequencing, as crosslinked DNA may not be a suitable substrate for such steps. The subsequent step e) may be performed after the reversal of the crosslinking, however, step e) may also be performed while the ligated DNA fragments are still in the crosslinked state.
Thus, in some embodiments, step e) is performed prior to step d). Step e) may be performed following step c) and prior to step d). Suitably, the method may comprise the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence; c) ligating the fragmented crosslinked DNA generated in step b); e) digesting the ligated DNA generated in step c) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); d) reversing the crosslinking in the mixture of digested and undigested crosslinked DNA generated in step e); f) (i) degrading the digested DNA generated in step d) using an exonuclease; and/ or
(ii) enriching the mixture of digested and undigested DNA generated in step d) or the undigested DNA generated in step f)(i) for DNA comprising the target nucleotide sequence; and g) determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
As used herein, “reversing crosslinking” comprises breaking the crosslinks such that the DNA that has been crosslinked is no longer crosslinked and is suitable for subsequent amplification and/or sequencing steps. For example, performing a protease K treatment on a sample DNA that has been crosslinked with formaldehyde will digest the protein present in the sample. Because the crosslinked DNA is connected indirectly via protein, the protease treatment in itself may reverse the crosslinking between the DNA. However, the protein fragments that remain connected to the DNA may hamper subsequent sequencing and/or amplification. Hence, reversing the connections between the DNA and the protein may also result in “reversing crosslinking”. The DNA-crosslinker-protein connection may be reversed through a heating step for example by incubating at 70°C. Since in a sample DNA large amounts of protein is present, it is often desirable to digest the protein with a protease in addition. Hence, any “reversing crosslinking” method may be contemplated wherein the DNA strands that are connected in a crosslinked sample becomes suitable for sequencing and/or amplification.
Digesting the DNA using a nuclease
Prior to enrichment and sequencing steps, the ligation products generated in step d) are then digested with at least one nuclease (e.g. a site-directed nuclease such as a CRISPR- Cas nuclease).
As used herein, the term “digesting” means a process by which a polynucleotide chain is cleaved by a nuclease at a specific site or specific sites which are dictated by the nucleotide sequence. Suitably, the specific site(s) is/are fusion sequences as described herein. Thus, the digesting step provides linearized DNA which has been cleaved by the nuclease. Said linearized DNA may subsequently be degraded using an exonuclease as described herein.
In physical proximity protocols (e.g. TLA protocol), ligation products consisting of DNA fragments that were originally adjacent to each other in the linear sample DNA sequence are often over-represented in the sequenced material. Some ligation products may also comprise DNA fragments that were originally immediately adjacent to each other in the linear sample DNA sequence, i.e. a sequence which originally occurred in the linear sample DNA sequence may be reformed during the method steps. Such ligation products comprising DNA fragments that were originally (immediately) adjacent to each other are unlikely to contain sequences of the genomic region of interest originating from meaningful physical distances. The deselection of these less valuable ligation products will thus increase the efficiency with which an entire region of interest is enriched and can be sequenced.
The nuclease may be specifically selected or designed to target one or more known fusion sequences resulting from ligation events of the DNA fragment containing the target nucleotide sequence and DNA fragments that originally occurred in immediate proximity to the target nucleotide sequence in the original sequence (e.g. in the genomic region of interest or, in the context of transgene insertion site sequencing, in the vector sequence). In this way, the enrichment and sequencing of ligation products containing these fusion sequences of ligation products which are less valuable can be minimised, or preferably prevented.
In physical proximity protocols (e.g. TLA protocol), some partially digested DNA sequences comprising the recognition sequence may remain following the non-random fragmentation step b). For example, a partially digested DNA sequence comprising two DNA fragment ends and potential DNA fragments A and B separated by the recognition sequence may remain following step b). The two DNA fragment ends will ligate to each other or to another DNA fragment end in ligation step c), resulting in ligation products which comprise a sequence that was originally present in the linear sample DNA sequence and was not cleaved during the fragmenting step b).
Thus, the nuclease may digest ligation products comprising DNA fragments that were originally adjacent to each other in the linear sample DNA sequence, ligation products comprising DNA fragments that were originally immediately adjacent to each other in the linear sample DNA sequence (i.e. reconstituted DNA sequences) and/or DNA sequences which were not cleaved in fragmenting step b).
This digestion step will result in the digestion of the relatively large number of ligation products that, without this step, result in high sequencing coverage across sequences in immediate vicinity to the target nucleotide sequence. The methods of the invention thus provide more efficient sequencing of an entire genomic region of interest by enabling more preferential sequencing of DNA fragments that originated from larger physical distances from the target nucleotide sequence. By “ligation event” is meant the ligation of two or more DNA fragments to form a ligation product.
As used herein, the term “fusion sequence” refers to the sequence of the site at which two DNA fragments are ligated, i.e. the bridging DNA sequence joining the two DNA fragments in a ligation product. In the context of the present invention, the fusion sequence comprises the recognition site of step b) and a first and second flanking sequence, each of which corresponds to a portion of one of the DNA fragments forming the ligation product (i.e. the portion of the DNA fragment which is immediately adjacent to the recognition sequence). The first and second flanking sequences correspond to a portion of a different DNA fragment forming the ligation product. Thus, each fusion sequence is specific for a pair of DNA fragments which have been ligated. In this way, particular ligation products can be targeted via the fusion sequence.
Since the fragmentation step b) is a non-random fragmentation, the recognition sequence is known. As a result, the potential ligation products generated in step c) for a given genomic region of interest or a given viral vector sequence are known and the associated potential fusion sequences are also known. Thus, specific ligation products which are less valuable can be targeted. For example, in the sequencing of an allele of interest, ligation products comprising DNA fragments from a corresponding allele which is less valuable can be targeted. As such, the known sequence of the genomic region of interest (comprising the allele which is less valuable, e.g. a wild type allele) is exploited in the targeting steps. Alternatively, in the sequencing of a transgene integration site, ligation products consisting (exclusively) of DNA fragments from the vector I transgene sequence, can be targeted. This permits the targeting, predominantly, of ligation products formed from episomal copies of the viral vector and a small number of ligation products formed from integrated copies of the viral vector which are less valuable since they do not inform on the vector integration site.
In some embodiments, the fusion sequence of digesting step e) is of greater length than the recognition sequence of fragmenting step b). Suitably, the fusion sequence of digesting step e) comprises the recognition sequence of fragmenting step b) and at least 1 (suitably, 2, 3, 4, 5, 6, 7, 8, 9, 10 or 11) additional nucleotide at each end of the recognition sequence. The additional nucleotide(s) at each end of the recognition sequence are also termed “flanking sequences”.
The first and second flanking sequences may be of the same length or of two different lengths. Preferably, the first and second flanking sequences are of sufficient length that the fusion sequence is specific for a particular ligation product. In some embodiments, the fusion sequence is of 15 to 25 nucleotides in length, preferably of 20 nucleotides in length. Suitably, the fusion sequence is of 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 or 25 nucleotides in length.
In some embodiments, the recognition sequence of fragmenting step b) is of 3 to 12 nucleotides in length and the fusion sequence is of 15 to 25 nucleotides in length.
In some embodiments, the recognition sequence of fragmenting step b) is of 4 to 8 nucleotides in length and the fusion sequence is of 20 nucleotides in length.
In some embodiments, the digesting step e) uses a multiplex nuclease digestion which recognises a plurality of specific fusion sequences. In some embodiments, the digesting step e) uses a multiplex nuclease digestion which recognises a plurality of specific fusion sequences, wherein the first and second flanking sequences of each fusion sequence are from a different two separate DNA fragments of step b). Suitably, the plurality of specific fusion sequences are each present within a different ligation product.
A multiplex nuclease digestion may be performed which recognises all possible fusion sequences resulting from the ligation of two DNA fragments from the genomic region of interest or, in the context of transgene integration site sequencing, from the ligation of two DNA fragments of viral vector origin.
For example, to preferentially linearize ligation products originating from episomal vector copies, a multiplex CRISPR-Cas nuclease digestion is performed specific for all possible fusion sequences resulting from the ligation of DNA fragment ends of viral origin. In the case of a viral vector comprising DNA sequences A, B, C and D each separated by the same recognition sequence, a multiplex CRISPR-Cas nuclease digestion is performed specific for all possible fusion sequences of viral origin, i.e. for the ligation products AB, AC, AD, BC, BD and CD. By contrast, ligation products comprising a sequence of viral origin (e.g. DNA fragment B) and a sequence from the host genome would not be digested.
As used herein, the term “recognises a fusion sequence” means that the nuclease is directed to the fusion sequence and cleaves at or near the fusion sequence. Thus, the nuclease may be specific for a fusion sequence.
By “specific for a fusion sequence” is meant that the nuclease preferentially binds to the fusion sequence as opposed to another DNA sequence of the same length. Thus, the nuclease may be designed to specifically target a known fusion sequence.
In some embodiments, the nuclease of digesting step e) is a restriction enzyme. In some embodiments, the nuclease of digesting step e) is a site-directed nuclease. Suitably, a site-directed nuclease as described herein is used.
In some embodiments, the site-directed nuclease for use in the digesting step e) is a CRISPR-Cas nuclease. Suitably, a CRISPR-Cas nuclease as described herein is used.
In some embodiments, step e) comprises digesting the ligated DNA generated in step d) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a site-directed nuclease which is specific for a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b). Preferably, the site-directed nuclease is a CRISPR-Cas nuclease.
In some embodiments, the first and second flanking sequences are from two separate DNA fragments which occur in close proximity to one another in the linear DNA sequence of the genomic region of interest. Suitably, the first and second flanking sequences are from two separate DNA fragments which occur immediately adjacent to one another in the linear DNA sequence of the genomic region of interest. Suitably, the first and second flanking sequences are from two separate DNA fragments which occur within the linear DNA sequence of the genomic region of interest. Accordingly, the first and second flanking sequences are from two separate DNA fragments which occur within a base pair (bp) distance of the linear DNA sequence of the genomic region of interest which corresponds to the length of the entire genomic region of interest. For example, if a 10 kb genomic region of interest comprising a target nucleotide sequence is selected, then the first and second flanking sequences may be from two separate DNA fragments which occur within up to 10 kb of one another in the linear genomic DNA sequence.
Suitably, the first and second flanking sequences may be from two separate DNA fragments which occur within about 250 bp, 500 bp, 1 kb, 2 kb, 3 kb, 4 kb, 5 kb, 6 kb, 7kb, 8kb, 9kb or 10 kb of one another in the linear genomic DNA sequence, preferably within about 5 kb of one another in the linear genomic DNA sequence.
In some embodiments, in the context of vector insertion site sequencing, the first and second flanking sequences are from two separate DNA fragments which occur in close proximity to one another in the linear DNA sequence of the vector. Suitably, the first and second flanking sequences are from two separate DNA fragments which occur immediately adjacent to one another in the linear DNA sequence of the vector. Suitably, the first and second flanking sequences are from two separate DNA fragments which occur within the linear DNA sequence of the vector. Accordingly, the first and second flanking sequences are from two separate DNA fragments which occur within a base pair (bp) distance of the linear DNA sequence of the vector which corresponds to the length of the entire vector sequence. For example, if the entire vector sequence comprising a target nucleotide sequence is 10 kb in length, then the first and second flanking sequences may be from two separate DNA fragments which occur within up to 10 kb of one another in the linear vector sequence.
Suitably, the first and second flanking sequences may be from two separate DNA fragments which occur within about 250 bp, 500 bp, 1 kb, 2 kb, 3 kb, 4 kb, 5 kb, 6 kb, 7kb, 8kb, 9kb or 10 kb of one another in the linear vector sequence, preferably within about 5 kb of one another in the linear vector sequence.
In some embodiments, the method further comprises the steps of:
A’) digesting the ligated DNA generated in step c) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using the nuclease of step e); and
B’) ligating the mixture of digested and undigested DNA generated in step A’); prior to step d).
In some embodiments, the further steps A’) and B’) are repeated at least once. In this manner, DNA fragments have multiple opportunities to ligate to DNA fragments originating from greater physical distances in the crosslinked DNA sequence and this will help increase the ratio of undigested vs digested ligation products resulting from the final nuclease digestion step.
Degrading and/or enriching the DNA
Next, the method comprises degrading the digested DNA generated in step e) using an exonuclease; and/or enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) (i) for DNA comprising the target nucleotide sequence.
As used herein, the term “exonuclease” means an enzyme that cleaves successive nucleotides, one at a time, from the end of a polynucleotide chain. The cleavage can occur at either the 5’ or the 3’ end of the polynucleotide chain.
As used herein, the term “degrading the digested DNA” means breaking the bonds between the nucleotides in the digested polynucleotide chain (i.e. in the digested DNA) to completely cleave the digested polynucleotide chain into nucleotides. The digested DNA is linear. Therefore, the digested DNA has both a 5’ and a 3’ end such that the exonuclease can cleave the DNA as described herein.
Thus, step f) (i) leads to the degradation of the digested DNA generated in step g), such that the remaining undigested DNA is processed in the subsequent steps.
In some preferred embodiments, the method comprises degrading the digested DNA generated in step e) using an exonuclease.
In some preferred embodiments, universal adapters that prevent exonuclease based degradation of linear DNA molecules are ligated to ligated DNA sequences (e.g. ligated DNA sequences generated in step c)) prior to the digestion step e) and degradation step f).
In some preferred embodiments, the method comprises degrading the digested DNA generated in step e) using an exonuclease; and enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) (i) for DNA comprising the target nucleotide sequence.
As used herein, the term “enriching ... for DNA comprising the target nucleotide sequence” means a process by which the (absolute) amount and/or proportion of the DNA comprising the target nucleotide sequence is increased compared to the amount and/or proportion of DNA comprising the target nucleotide sequence in the starting material (i.e. in the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) (i)). In this regard, enrichment by amplification increases the amount and proportion of DNA comprising the target nucleotide sequence. Both enrichment by degradation and capture-based enrichment increase the proportion of DNA comprising the target nucleotide sequence.
The methods of the invention are compatible with a wide variety of enrichment approaches.
The DNA generated in step e) or step f) (i) comprising the target nucleotide sequence may be amplified using at least one oligonucleotide primer which hybridises to the target nucleotide sequence, and optionally at least one additional primer which hybridises to the at least one adaptor as the step of ligating an adaptor is optional.
Thus, in some embodiments, the enriching step f) (ii) comprises amplifying the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i). As used herein, the terms “oligonucleotide primers” or “primers” are used interchangeably, in general, to refer to strands of nucleotides which can prime the synthesis of DNA. DNA polymerase cannot synthesize DNA de novo without primers. A primer hybridises to the DNA, i.e. base pairs are formed. Nucleotides that can form base pairs, that are complementary to one another, are e.g. cytosine and guanine, thymine and adenine, adenine and uracil, guanine and uracil. The complementarity between the primer and the existing DNA strand does not have to be 100%, i.e. not all bases of a primer need to base pair with the existing DNA strand. From the 3’-end of a primer hybridised with the existing DNA strand, nucleotides are incorporated using the existing strand as a template (template directed DNA synthesis). The synthetic oligonucleotide molecules which are used in an amplification reaction may be referred to as “primers”.
As used herein, the term “amplifying” refers to a polynucleotide amplification reaction, namely, a population of polynucleotides that are replicated from one or more starting sequences. Amplifying may refer to a variety of amplification reactions, including but not limited to polymerase chain reaction (PCR), linear polymerase reactions, nucleic acid sequence- based amplification, rolling circle amplification and like reactions.
In some embodiments, amplifying the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i) comprises using at least one primer which hybridises to the DNA fragment comprising the target nucleotide sequence generated in step b). In some embodiments, when the genomic region of interest comprises one or more further target nucleotide sequences, this step further comprises using a plurality of primers which each hybridises to the DNA sequence of one of the DNA fragments comprising the one or more further target nucleotide sequences generated in step b).
In some embodiments, at least one primer directs amplification towards the recognition sequence of step b).
In some embodiments, an identifier is included in the at least one primer.
As used herein, the term “identifier” refers to a short sequence that can be added to an adaptor or a primer or included in its sequence or otherwise used as label to provide a unique identifier. Such a sequence identifier (or tag) can be a unique base sequence of varying but defined length, typically from 4-16 bp used for identifying a specific nucleic acid sample. For instance 4 bp tags allow 4(exp4) = 256 different tags. Typical examples are ZIP sequences, known in the art as commonly used tags for unique detection by hybridization (lannone et al. Cytometry 39:131-140, 2000). Identifiers are useful according to the invention, as by using such an identifier, the origin of a sample (e.g. a PCR sample) can be determined upon further processing. In the case of combining processed products originating from different nucleic acid samples, the different nucleic acid samples may be identified using different identifiers. For instance, as according to the invention sequencing may be performed using high throughput sequencing, multiple samples may be combined. Identifiers may then assist in identifying the sequences corresponding to the different samples. Identifiers may also be included in adaptors for ligation to DNA fragments assisting in DNA fragment sequences identification. Identifiers preferably differ from each other by at least two base pairs and preferably do not contain two identical consecutive bases to prevent misreads. The identifier function can sometimes be combined with other functionalities such as adaptors or primers.
In an alternative embodiment, in any of the methods as described herein, in step f) (ii) primers are used carrying a moiety, e.g. biotin, for the optional purification of (amplified) ligated DNA fragments through binding to a solid support. Capture-based enrichment using the moiety may then be performed as described below in the context of a hybridisation probe.
In one embodiment, the enriching step f) (ii) may comprise using inverse PCR on a circular template. Thus, the method may comprise the steps: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence; c) ligating the fragmented crosslinked DNA generated in step b); d) (i) reversing the crosslinking in the ligated crosslinked DNA generated in step c);
(ii) circularising the ligated DNA generated in step d) (i); e) digesting the ligated DNA generated in step d) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a site-directed nuclease which is specific for a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); f) (i) optionally, degrading the digested DNA generated in step e) using an exonuclease; and (ii) enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) (i) for DNA comprising the target nucleotide sequence using PCR with inverse primers specific for the target nucleotide sequence; and g) determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
In an alternative embodiment, the enriching step f) (ii) may comprise using linear PCR. Suitably, prior to PCR, a site-directed nuclease-based enrichment as described herein can be performed, after which the linear amplification is performed with primers at either end of the digestion site.
Alternatively, universal adapters are ligated to both ends of the ligated DNA fragments (e.g. ligated DNA fragments generated in step c)), followed by step e) and optionally step f) (i), before enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) (i) for DNA comprising the target nucleotide sequence using PCR with combinations of a target nucleotide sequence-specific primers and primers specific for the adapters.
In some embodiments, the enriching step f) (ii) comprises capture-based enrichment of the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i), preferably wherein the capture-based enrichment is specific for a defined sequence at one end of the target nucleotide sequence.
In one embodiment, the undigested DNA fragments comprising the target nucleotide sequence may be captured with a hybridisation probe (also termed a capture probe) that hybridises to a target nucleotide sequence. The hybridisation probe may be attached directly to a solid support, or may comprise a moiety, e.g. biotin, to allow binding to a solid support suitable for capturing biotin moieties (e.g. beads coated with streptavidin). In any case, the undigested DNA fragments comprising a target nucleotide sequence are captured thus allowing separation of ligation products comprising the target nucleotide sequence from ligation products not comprising the target nucleotide sequence. Hence, such a capture step allows enrichment for ligation products comprising the target nucleotide sequence. For a genomic region of interest comprising a target nucleotide sequence, at least one capture probe for the target nucleotide sequence may be used. For a genomic region of interest comprising a plurality of target nucleotide sequences, more than one probe may be used for multiple target nucleotide sequences (e.g. at least one probe for each target nucleotide sequence may be used). For example, one primer corresponding to 1 of 5 target nucleotide sequences may be used as a capture probe (A, B, C, D or E). Alternatively, the 5 primers may be used in a combined fashion (A, B, C, D and E) to capture the genomic region of interest.
In one embodiment, a capture probe may be used that hybridises to an adaptor sequence comprised in (amplified) undigested DNA fragments.
In one embodiment, an amplification step and capture step are combined, e.g. first performing a capture step and then an amplification step or vice versa.
Since capture-based enrichment protocols do not discriminate between digested and undigested DNA, in some preferred embodiments an exonuclease degradation step and capture-based enrichment step are combined, i.e. first performing an exonuclease degradation step and then a capture-based enrichment step.
Site-directed nuclease digestion can also be used for the selective amplification of ligation products of interest. Site-directed digestion can be used to selectively add adaptors to and enable the amplification of linear ligation products comprising the target nucleotide sequence.
Site-directed nuclease digestion can also be used for the selective linearization of ligation products of interest. If a proximity ligation protocol (e.g. a TLA protocol as described herein) is used for the generation of circular DNA template (e.g. circular TLA template), a site- directed nuclease can be used to selectively linearize TLA ligation products of interest. Once linearized, these sequences can be selectively enriched using PCR or capture-based approaches as described above. They can also be selectively sequenced, for example, with nanopore sequencing approaches which will very preferentially sequence linearized DNA molecules.
Thus, in some embodiments, the enriching step f) (ii) comprises site-directed nuclease- based enrichment of the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i), preferably wherein the site-directed nuclease- based enrichment comprises digesting the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) using a site-directed nuclease which is specific for a recognition sequence within the target nucleotide sequence followed by amplifying the digested DNA using (inverse) PCR or capture-based enrichment.
In one embodiment, an amplification step and site-directed nuclease-based enrichment step are combined, e.g. first performing a site-directed nuclease-based enrichment step and then an amplification step or vice versa. In one embodiment, a capture step and site-directed nuclease-based enrichment step are combined, e.g. first performing a site-directed nuclease-based enrichment step and then a capture step or vice versa.
After the steps of reversing the crosslinking and site-directed nuclease-based deselection, it can be beneficial to include a whole genome amplification step prior to any enrichment steps. In some of the analyses for which the present invention provides advantages, the number of alleles of interest (for example, the total number of integrated copies of a transgene in a population of cells) can be limited. Thus, in order to improve any further enrichment of the allele(s) of interest, it can be beneficial to perform whole genome amplification and thereby generate multiple copies of each allele of interest. Methods for amplifying the whole genome are known in the art (see, for example, Kittier et al. (2002) Analytical Biochemistry 300: 237-244; and Hard et al. (2021) bioRxiv 439527; doi: https://doi.Org/10.1101/2021.04.13.439527).
As used herein, the term “whole genome amplification” means the use of a non-specific amplification technique which generates an amplified product that is completely representative of the initial starting material. Thus, when a whole genome amplification step is applied to a mixture of unique ligation products resulting from the conventional TLA protocol, each unique ligation product is expected to be amplified.
Thus, in some embodiments, the method further comprises the step of amplifying the whole genome using the digested DNA generated in step e), prior to step f).
There can be instances in which, after the steps of reversing the crosslinking and site- directed nuclease-based deselection, the following enrichment step can include a new round of TLA (i.e. repeating steps a) to c) after step e), therefore crosslinking, fragmenting, ligating, reversing the crosslinking again) prior to degrading and/or enriching step f). Such an approach can, for instance, be useful in populations of cells that contain episomal copies of a vector I transgene sequence and a limited number of integration events. After an initial round of deselection of episomal copies (i.e. after performing steps a)- e) once) an additional round of TLA on a larger number of copies of integration events will help increase the efficiency and completeness with which the rare integrated copies and integration sites can be sequenced.
Thus, in some embodiments, the method further comprises repeating steps a) to d) following step e) and prior to step f). Repeating the TLA steps could, for example, be applicable in the further analysis of the whole genome amplified product described above. Thus, the methods of the invention may further comprise the step of amplifying the whole genome using the digested DNA generated in step e) followed by repeating steps a)-d), prior to step f).
There can be instances in which it is advantageous to perform whole genome amplification of the ligation products generated after steps a) to d), and then perform multiple separate targeted enrichments directed to different target nucleotide sequences as described herein followed by a nuclease digestion step targeting a fusion sequence. Preferably, each target nucleotide sequence is located in proximity to an instance of the recognition sequence of step b), i.e. in proximity to a DNA fragment end, and the fusion sequence used in the digestion step is specific for the same fragment end. In this manner, a single ligation product can produce multiple amplicons that will all have originated from valuable ligation events. This approach is particularly advantageous in the analysis of transgene integration sites because this approach allows sequence information to be determined for all transgene sequences that have integrated and for the entire sequence originating from each individual integration site.
Accordingly, the invention can be applied in order to enable the enrichment of ligation products wherein only one fusion sequence at one end of a target nucleotide sequence remains undigested following site-directed nuclease-based digestions targeted at less valuable fusion sequences, e.g. the fusion sequence at the other end of the restriction fragment that comprises the target nucleotide sequence.
Ligation products comprising DNA fragments used as target nucleotide sequence in which structural changes have occurred will thus remain unfragmented in the digestion step e) at the fragment end in which the sequence change has occurred. For example, if a transgene sequence has integrated, the DNA fragment comprising the fusion sequence between the transgene and the host genome will not be digested by a digestion step targeting less valuable transgene-transgene ligation products due to the fact that the fusion with the host genome sequence will result in a novel fragment end not targeted in the digestion step e). In other words, ligations products which inform on the integration site will not be digested in digesting step e) and can thus be preferentially sequenced using the method of the invention.
Accordingly, in some embodiments, the method comprises the additional step of amplifying the ligated DNA generated in step d) by whole genome amplification following reversing the crosslinking in step d), followed by steps f) (ii), e), optionally f) (i) and (g) in that order. Thus, in some embodiments, the method comprises the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence; c) ligating the fragmented crosslinked DNA generated in step b); d) (i) reversing the crosslinking in the ligated crosslinked DNA generated in step c);
(ii) amplifying the ligated DNA generated in step d) by whole genome amplification; f) (ii) enriching the amplified DNA generated in step d) (ii) for DNA comprising the target nucleotide sequence; e) digesting the enriched DNA generated in step f) (ii) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); f) (i) optionally, degrading the digested DNA generated in step e) using an exonuclease; and g) determining at least part of the sequence of the remaining undigested DNA, preferably using high throughput sequencing.
In one embodiment, step f) (ii) comprises amplifying the amplified DNA comprising the target nucleotide sequence generated in step d) (ii) using a primer pair specific for one DNA fragment generated in step b) wherein the primers comprise adapters that prevent exonuclease digestion. Thus, the subsequent digestion in step e) of the linear PCR products comprising less valuable ligation products which comprise the specific DNA fragment for which the primers were designed enables the selective removal of less valuable PCR products prior to sequencing.
In one embodiment, steps d) (ii) to f) (i) are repeated at least once following the first instance of step f) (i) and prior to step g), wherein the further enrichment step f) (ii) enriches for a different target nucleotide sequence. This is advantageous in the analysis of integrated transgene sequences because all DNA fragments generated in step b) which comprise a transgene sequence can be used and thus all ligation events between a DNA fragment originating from the transgene and the host genome will contribute to generated sequence information.
Determining the sequence of ligated DNA fragments
Next, the sequence of at least part of the undigested DNA generated in step f) is determined. Suitably, the sequence of the undigested DNA may be determined.
The undigested DNA may be prepared as a DNA sequencing library and/or sequenced according to standard protocols. Conventional whole genome sequencing or high-throughput sequencing (e.g. NGS) approaches can be used. Determining the sequence is preferably performed using high throughput sequencing technology, as this is more convenient and allows a high number of sequences to be determined to cover the complete genomic region of interest.
In some embodiments, step g) is performed using whole genome sequencing. In particular, when the genomic region of interest comprises a transgene integration site, it may be desirable to sequence the whole genome.
In some embodiments, step g) comprises determining at least part of the sequence of the undigested DNA comprising the target nucleotide sequence. Suitably, step g) comprises determining the whole sequence of the undigested DNA comprising the target nucleotide sequence.
The sequence of at least part of the undigested DNA comprising the target nucleotide sequence may be determined. Suitably, the sequence of the undigested DNA comprising the target nucleotide sequence may be determined.
As used herein, the term “sequencing” refers to determining the order of nucleotides (base sequences) in a nucleic acid sample, e.g. DNA or RNA. Many techniques are available such as Sanger sequencing and High throughput sequencing technologies such as offered by Roche, Illumina and Thermo Fisher.
The step of determining the sequence of undigested DNA fragments preferably comprises high throughput sequencing. High throughput sequencing methods are well known in the art, and in principle any method may be contemplated to be used in the invention. High throughput sequencing technologies may be performed according to the manufacturer’s instructions (as e.g. provided by Roche, Illumina or Thermo Fisher). In general, sequencing adaptors may be ligated to the (amplified) undigested DNA fragments. In case the linear or circularized fragment is amplified, by using for example PCR as described herein, the amplified product is linear, allowing the ligation of the adaptors. Suitable ends may be provided for ligating adaptor sequences (e.g. blunt, complementary staggered ends). Alternatively, primer(s) used for PCR or other amplification method, may include adaptor sequences, such that amplified products with adaptor sequences are formed in the amplification step f) (ii). In case the circularized fragment is not amplified, the circularized fragment may be fragmented, preferably by using for example a restriction enzyme in between primer binding sites for the inverse PCR reaction, such that DNA fragments ligated with the DNA fragment comprising the target nucleotide sequence remain intact. Sequencing adaptors may also be included in step c) and the steps i)-iii) of the methods of the invention.
Preferably long reads may be generated in the high throughput sequencing method used. Long reads may allow reading across multiple DNA fragments within undigested DNA fragments (which contain ligated DNA fragments). This way, DNA fragments of step b) may be identified. DNA fragment sequences may be compared to a reference sequence and/or compared with each other.
Hence, it is not required to provide for a complete sequence of the undigested DNA fragments (i.e. ligation products). It is preferred to at least sequence across (multiple) DNA fragments, such that DNA fragment sequences are determined.
It may also be contemplated to read even shorter sequences, for instance, short reads of 50- 100 nucleotides. In case a standard sequencing protocol would be used, this may mean that the information regarding the undigested DNA fragments may be lost. With short reads it may not be possible to identify a complete DNA fragment sequence. In case such short reads are contemplated, it may be envisioned to provide additional processing steps such that separate ligated DNA fragments when fragmented, are ligated or equipped with identifiers, such that from the short reads, contigs may be built for the ligated DNA fragments. Such high throughput sequencing technologies involving short sequence reads may involve paired end sequencing. By using paired end sequencing and short sequence reads, the short reads from both ends of a DNA molecule used for sequencing, which DNA molecule may comprise different DNA fragments, may allow coupling of DNA fragments that were ligated. This is because two sequence reads can be coupled spanning a relatively large DNA sequence relative to the sequence that was determined from both ends. This way, contigs may be built for the (amplified) undigested DNA fragments.
However, using short reads may be contemplated without identifying DNA fragments, because from the short sequence reads a genomic region of interest may be built, especially when the genomic region of interest has been amplified. Information regarding DNA fragments and/or separate genomic regions of interest (for instance of a diploid cell) may be lost, but DNA mutations may still be identified.
Thus, the step of determining at least part of the sequence of the (amplified) undigested DNA sequence, may comprise short sequence reads, but preferably longer sequence reads are determined such that DNA fragment sequences may be identified. In addition, it may also be contemplated to use different high throughput sequencing strategies for the (amplified) undigested DNA fragments, e.g. combining short sequence reads from paired end sequencing with the ends relatively far apart with longer sequence reads, this way, contigs may be build for the (amplified) undigested DNA fragments.
When analyzing (short) sequence reads, it may be of interest to prevent sequencing the primers used in the enrichment step. Thus, in an alternative embodiment of the methods described herein, the primer sequence may be removed prior to the sequencing step g) (e.g. the high throughput sequencing step).
Thus, in one embodiment, the enrichment step f) (ii) comprises:
(a) (i) amplifying the undigested DNA fragments comprising the target nucleotide sequence generated in step e) or step f) (i) using at least one primer that preferably (1) contains a 5’ overhang carrying a type III restriction enzyme recognition site and (2) hybridises to the target nucleotide sequence; or
(ii) amplifying the undigested DNA fragments comprising the target nucleotide sequence generated in step e) or step f) (i) using at least one primer that preferably (1) contains a 5’ overhang carrying a type III restriction enzyme recognition site and (2) hybridises to the target nucleotide sequence and at least one primer which hybridises to the at least one adaptor when an adaptor is present;
(b) digesting the amplified nucleotide sequences of interest with a type III restriction enzyme, followed by a size selection step to remove the released double-strand primer sequences;
(c) fragmenting the DNA, preferably by sonication; and
(d) optionally, ligating double-stranded adaptor sequences needed for next generation sequencing, prior to step g).
In the methods of the invention, from determined sequences generated in step g), a contig may be built of the genomic region of interest. When sequences of the DNA fragments are determined, overlapping reads may be obtained from which the genomic region of interest may be built. By increasing the sample size, e.g. increasing the number of cells analysed, the reliability of the genomic region of interest that is built may be increased.
Thus, in some embodiments, the method further comprises the step of building a contig of the genomic region of interest from the determined sequences generated in step g).
As used herein, the term "contig" is used in connection with DNA sequence analysis, and refers to reassembled contiguous stretches of DNA derived from two or more DNA fragments having contiguous nucleotide sequences. Thus, a contig may be a set of overlapping DNA fragments that provides a (partial) contiguous sequence of a genomic region of interest. A contig may also be a set of DNA fragments that, when aligned to a reference sequence, may form a contiguous nucleotide sequence. For example, the term "contig" encompasses a series of (ligated) DNA fragment(s) which are ordered in such a way as to have sequence overlap of each (ligated) DNA fragment(s) with at least one of its neighbours. The linked or coupled (ligated) DNA fragment(s), may be ordered either manually or, preferably, using appropriate computer programs such as FPC, PHRAP, CAP3 etc, and may also be grouped into separate contigs.
Alternatively, when in step b) a plurality of subsamples is generated, using different restriction enzymes or sitedirected nucleases, overlapping reads will also be obtained. By increasing the plurality of subsamples, the number of overlapping fragments will increase, which may increase the reliability of the contig of the genomic region of interest that is built. From these determined sequences which may overlap, a contig may be built. Alternatively, if sequences do not overlap, e.g. when a single restriction enzyme may have been used in step b), alignment of (undigested) DNA fragments with a reference sequence may allow to build a contig of the genomic region of interest.
In some embodiments, when the cell ploidy of the genomic region of interest is greater than 1 , a contig is built for each ploidy.
In some embodiments, the step of building a contig comprises the steps of:
1) identifying the fragments of step b);
2) assigning the fragments to a genomic region;
3) building a contig for the genomic region. In some embodiments, the step 2) of assigning the fragments to a genomic region comprises identifying the different ligation products of step e) and coupling of the different ligation products to the identified fragments.
In one embodiment, the invention may be used to provide for quality control of generated sequence information. In the analysis of the sequences as provided by a method of high throughput sequencing, sequencing errors may occur. A sequencing error may occur for example during the elongation of the DNA strand, wherein an incorrect (i.e. non- complementary to the template) base is incorporated in the DNA strand. A sequencing error is different from a mutation, as the original DNA which is amplified and/or sequenced would not comprise that incorrect base. According to the invention, DNA fragment sequences may be determined, with (at least part of) sequences of DNA fragments ligated thereto, which sequences may be unique. The uniqueness of the ligated DNA fragments as they are formed in step c) may provide for quality control of the determined sequence in step g). When undigested DNA fragments are amplified, and sequenced at a sufficient depth, multiple copies of the same unique (ligated) DNA fragment(s) will be sequenced. Sequences of copies that originate from the same original undigested DNA fragment may be compared and amplification and/or sequencing errors may be identified.
Further embodiments
Multiple target nucleotide sequences
According to the methods of the invention, from a sample of crosslinked DNA, the sequences of multiple genomic regions of interest may be determined.
Thus, in some embodiments, the sequences of a plurality of genomic regions of interest are determined.
For each genomic region of interest, a target nucleotide sequence is provided. In the enrichment step, corresponding primer(s) may be designed for each target nucleotide sequence. The multiple genomic regions of interest may be genomic regions of interest that may also overlap, thereby increasing the size of which the sequence may be determined. For instance, in case a sequence of a genomic region of interest comprising a target nucleotide sequence typically would comprise 1MB, combining partially overlapping genomic regions of interest, e.g. with an overlap of 0.1MB, each with a corresponding target nucleotide sequence, combining 5 genomic regions of interest would result in a sequence of 4.6 MB (0.9 + 3 * ( 0.1 + 0.8) + 0.1+ 0.9 = 4.6MB), thereby greatly extending the size of the genomic region of interest of which the sequence may be determined or otherwise analysed. Multiple target nucleotide sequences at defined distances within a genomic region of interest may also be used to increase the average coverage and/or the uniformity of coverage across the genomic region.
Thus, in some embodiments, the genomic region of interest comprises one or more further target nucleotide sequences. Suitably, the genomic region of interest comprises 1, 2, 3, 4, 5, 6, 7, 8, 9 or 10 further target nucleotide sequences
In one embodiment, a method for determining the sequence of a genomic region of interest comprising two target nucleotide sequences is provided. The enrichment step now uses not one target nucleotide sequence, but two.
This method may involve the same steps as outlined above up until the enrichment step. The enrichment step now uses not one target nucleotide sequence, but several. When amplification is used for the enrichment step, different primers are used in a PCR reaction, one primer for each target nucleotide sequence. When two primer binding sites from two of the target nucleotide sequences are present in a ligation product, the two primers will amplify the sequence in between the two primer binding sites provided that the primer binding sites have the right orientation. Having a circularized DNA fragment may be advantageous as the chance for the two primer binding sites having the right orientation is higher as compared to a linear DNA fragment (two out of four orientations will amplify, as compared to one in four for a linear ligated DNA fragment). By combining multiple target nucleotide sequences and corresponding primers in a single amplification, the chance that combinations of primers will produce an amplicon is increased.
For example, different target nucleotides can be used for a gene of interest (e.g. a transgene). A PCR may be performed by selecting a primer from one target nucleotide sequence (also referred to as viewpoint), e.g. target nucleotide sequence A with another, B. Also, a PCR may be performed using a primer from each target nucleotide sequence, A, B, C, D and E. As these target nucleotides are in physical proximity of each other, performing such an amplification will enrich for the genomic region of interest, provided that the primer binding sites are present in ligation products such that an amplicon can be generated.
Hence, methods are provided for determining the sequence of a genomic region of interest according to the invention, wherein the genomic region of interest comprises one or more further target nucleotide sequences, and wherein in the amplification step a primer is provided that hybridises with the target nucleotide sequence and one or more primers are provided for the corresponding one or more further target nucleotides, wherein the undigested DNA is amplified, linearized DNA is amplified or circularized DNA is amplified, using the primers.
In addition, an identifier may be included in at least one of the oligonucleotide primers of step f) (ii) when PCR is used. Identifiers may also be included in adaptor sequences, such as may be used for ligation in between fragments during the ligation steps c) and i)-iii). By including an identifier in the oligonucleotide primer, when analysing a plurality of samples or a plurality of subsamples of crosslinked DNA simultaneously, the origin of each sample may easily be determined. Samples or subsamples of crosslinked DNA may have been processed differently while the original sample of crosslinked DNA is the same, and/or samples of DNA may have been obtained for example from different organisms or patients. Identifiers allow the combination of differently processed samples when the processing of samples may converge, e.g. identical procedural steps are performed. Such convergence of processing may in particular be advantageous when the sequencing step g) involves high throughput sequencing.
Transc/ene integration site
As discussed above, the methods of the invention are particularly advantageous in the sequencing of integrated transgene sequences and transgene integration sites in samples in which episomal copies of the vector sequence also occur. Conventional targeted sequencing approaches and conventional physical proximity-based protocols cannot distinguish between integrated and episomal (i.e. non-integrated) copies of a vector I transgene sequence. This is a limitation in NGS-based analysis of gene-therapy products as the sequencing quality will often depend on the frequency with which integrations occur and whether these integrations result in the complete clean integration of vector sequences without structural changes.
The methods of the invention enable the targeted sequencing of integrated copies of the transgene sequence. In this regard, a target nucleotide sequence within the transgene is selected. The site-directed nuclease digestion of (all) possible fusion sequences resulting from the ligation of the DNA fragment containing the target nucleotide sequence and other DNA fragments of the vector I transgene sequence results in the preferential enrichment and sequencing of ligation products comprising the DNA fragment containing the target nucleotide sequence and DNA fragments originating from the host genome. This can then result in complete sequence information across integrated vector sequences and vector integration sites.
Accordingly, in some embodiments, the genomic region of interest comprises a transgene integration site. The target nucleotide sequence may be any sequence of interest.
In some embodiments, the target nucleotide sequence comprises a transgene or a portion thereof. Suitably, the target nucleotide sequence comprises a portion of the transgene which is adjacent to the recognition site of step b). Suitably, the portion of the transgene is of sufficient length to permit specific enrichment of sequences comprising the target nucleotide sequence.
In some embodiments, the first and second flanking sequences are from two separate DNA fragments from the transgene and/or from the vector used to deliver the transgene.
In some embodiments: i) the first and second flanking sequences are each from a separate DNA fragment from the vector; ii) the first flanking sequence is from a DNA fragment from the vector and the second flanking sequence is from a DNA sequence from the transgene; or iii) the first and second flanking sequences are each from a separate DNA fragment from the transgene.
Thus, the site-directed nuclease of digesting step e) may be specific for a vector-vector, vector-transgene or transgene-transgene ligation event, i.e. the site-directed nuclease of digesting step e) may be specific for ligation products consisting (exclusively) of DNA fragments from the vector sequence.
Preferably, the recognition site of fragmenting step b) occurs once within the vector sequence. The fragmenting step b) will then generate two DNA fragments from the vector sequence. A site-directed nuclease which is specific for a single fusion sequence may then be used in step e).
The recognition site of fragmenting step b) may occur multiple times within the vector sequence. The fragmenting step b) will then generate more than two DNA fragments from the vector sequence. For example, if the recognition site of fragmenting step b) occurs three times in the vector sequence, the fragmenting step) will generate four DNA fragments from the vector sequence (e.g. A, B, C and D). A multiplex site-directed nuclease approach which is specific for each fusion sequence within the ligation products consisting of DNA fragments of vector origin may then be used in step e). For example in the case of a viral vector comprising DNA sequences A, B, C and D each separated by the same recognition sequence, a multiplex CRISPR digestion may be performed which is specific for all possible fusion sequences of viral origin, i.e. for the ligation products AB, AC, AD, BC, BD and CD.
Allele
The methods of the invention provide improved efficiency of detection of large structural changes. For example, if the site-directed nuclease digestions of step e) are designed to be specific for ligation events of DNA fragments spanning a 10kb region in the wild-type genome sequence, ligation products resulting from alleles in which structural changes have occurred (e.g. an insertion or translocation) within this 10kb region of interest will more likely contain ligation events that will not be digested. These will therefore be more preferentially sequenced compared to ligation events originating from wild-type alleles in the methods of the invention.
Accordingly, in some embodiments, the genomic region of interest comprises a transgene integration site.
In some embodiments, the target nucleotide sequence comprises an allele of the genomic region of interest or a portion thereof. Suitably, the portion of the allele is of sufficient length to permit specific enrichment of sequences comprising the target nucleotide sequence.
In other embodiments, a target nucleotide sequence which is in proximity to, but not within, the sequence of the allele of interest may be selected. In this way, the methods of the invention can be performed without requiring sequence information of the allele of interest.
In some embodiments, the first and second flanking sequences are from separate DNA fragments from the allele. Thus, the site-directed nuclease of digesting step e) is specific for an allele-allele ligation event.
As used herein, the term “allele(s)” means any of one or more alternative forms of a gene at a particular genomic locus. In a diploid cell of an organism, alleles of a given gene are located at a specific location, or locus (loci plural) on a chromosome. One allele is present on each chromosome of the pair of homologous chromosomes. Thus, in a diploid cell, two alleles and thus two separate (different) genomic regions of interest may exist.
Size selection
Prior to or after the enrichment step f) (ii), according to the methods of the invention, a size selection step may be performed. Such a size selection step may be performed using gel extraction chromatography, gel electrophoresis or density gradient centrifugation, which are methods generally known in the art. Preferably, DNA is selected of a size between 20- 20,0000 base pairs, preferably 50-10,0000 base pairs, most preferably between 100-3,000 base pairs. A size separation step allows to select for (amplified) ligated DNA fragments in a size range that may be optimal for PCR amplification and/or optimal for the sequencing of long reads by next generation sequencing. Sequencing of reads of >1000 nucleotides is currently commercially available, recent advances by companies such as the Single Molecule Real Time (SMRT™) DNA Sequencing technology developed by Pacific Biosciences (http://www.pacificbiosciences.com/) indicate that reads of beyond 10,000 nucleotides are possible. Using nanopore sequencing even longer reads are generated (https://nanoporetech.com/).
As used herein, “size selection” involves techniques with which particular size ranges of molecules, e.g. (ligated) DNA fragments or amplified (ligated) DNA fragments, are selected. Techniques that can be used are for instance gel electrophoresis, size exclusion, gel extraction chromatography, but are not limited thereto, as long as molecules with a particular size can be selected, such a technique will suffice.
Further fragmentation
In some embodiments, the ligated DNA fragments generated in step c) may be further fragmented prior to digestion and enrichment. Such additional fragmentation (and ligation) steps may be performed prior to step b). The further fragmentation may be a random or a non-random fragmentation as described herein. In this manner, rarer cutters (i.e. enzymes that fragment less frequently) can be used in in step b) and more frequent (including random) fragmentation strategies can be used in a further fragmentation step. The digesting step e) is then based upon a lower (more manageable) number of possible ligation events from the non-random fragmenting step (e.g. step b)) and the further fragmenting step ensures the entire genomic region of interest can be enriched.
The non-random fragmenting step (e.g. step b)) and the optional further fragmenting step may be aimed at obtaining ligated DNA fragments of a size which is compatible with the subsequent enrichment step (e.g. amplification step) and/or sequence determination step. In addition, a further fragmenting step, preferably with an enzyme, may result in ligated fragment ends which are compatible with the optional ligation of an adaptor. The further fragmenting step may be performed after reversing the crosslinking, however, it is also possible to perform the further fragmenting step and/or ligation step while the DNA fragments are still crosslinked. At least one adaptor may be ligated to the obtained ligated DNA fragments generated in the further fragmenting step. The ends of the ligated DNA fragments need to be compatible with ligation of such an adaptor. As the ligated DNA fragments may be linear DNA, ligation of an adaptor may provide for a primer hybridisation sequence. The adaptor sequence ligated with ligated DNA fragments comprising the target nucleotide sequence will provide for DNA molecules which may be amplified using PCR as described herein.
Ligated adapter sequences can also be used as described herein to prevent exonuclease based digestion.
Preferably, the DNA is further fragmented with a restriction enzyme or site-directed nuclease as described herein.
Accordingly, in some embodiments, the method further comprises the step of: i) (a) further fragmenting the crosslinked DNA provided in step a) or the ligated crosslinked DNA generated in step c); and
(b) ligating the fragmented DNA generated in step i) (a), preferably wherein the ligation is performed in the presence of an adaptor, ligating adaptor sequences in between fragments; prior to step b) or prior to step d); or ii) (a) further fragmenting the undigested DNA generated in step f); and
(b) optionally, circularising or ligating the fragmented DNA generated in step ii) (a), preferably ligating the fragmented DNA to at least one adaptor, prior to step g).
In one embodiment, the method further comprises the steps of:
(i) further fragmenting the crosslinked DNA provided in step a); and
(ii) ligating the fragmented DNA generated in step (i), preferably wherein the ligation is performed in the presence of an adaptor, ligating adaptor sequences in between fragments; prior to step b).
In one embodiment, the method further comprises the steps of: (i) further fragmenting the ligated crosslinked DNA generated in step c); and
(b) ligating the fragmented DNA generated in step (i), preferably wherein the ligation is performed in the presence of an adaptor, ligating adaptor sequences in between fragments; prior to step d).
In one embodiment, the method further comprises the steps of:
(i) further fragmenting the undigested DNA generated in step f); and
(b) optionally, circularising or ligating the fragmented DNA generated in step (i), preferably ligating the fragmented DNA to at least one adaptor, prior to step g).
In some embodiments, the further steps i) (a) and i) (b), ii) (a) and ii) (b), or (i) and (ii) are repeated at least once (suitably, once, twice, three time, four times or five times). In this way, ligated DNA fragments of a size which is compatible with the subsequent steps may be obtained.
In some embodiments, the further steps i) (a) and (i) are performed using random fragmentation.
In some embodiments, the further steps i) (a), ii) (a) and (i) are performed using non-random fragmentation at a recognition sequence.
If both the non-random and further fragmenting steps comprise the use of restriction enzymes or site-directed nucleases, it is preferred that the recognition sequence of nonrandom fragmentation step b) is longer than the recognition sequence of the further fragmentation step. The enzyme of step b) thus cuts at a lower frequency than the further fragmentation step. This means that the average DNA fragment size of the further fragmentation step is smaller than the average fragment size generated in step b). This way, in the non-random fragmenting step b), relatively large fragments are formed, which are subsequently ligated and the second enzyme of the further fragmentation step cuts more frequently than the enzyme of step b).
Thus, in some embodiments, the recognition sequence of the fragmenting step b) is of greater length than the recognition sequence of the further fragmenting step i) (a) or step ii) (a). In some embodiments, the step d) of reversing the crosslinking may be performed after steps i) (a) and i) (b) and prior to the fragmenting step b). In ligation products in which no reshuffling of DNA fragments has occurred in step i) and/or step b), the flanking sequences of the fusion sequence resulting from non-random fragmentation of ligated DNA fragments in step b) are known since these sequences correspond to the original wild-type sequence of the genomic region of interest. Thus, the circularisation of the fragmented DNA generated in step b) or the ligated DNA generated in step c) will result in a known fusion sequence that can be targeted by nuclease digestion in step e) as described herein. Any fusion sequence(s) resulting from the circularisation of the fragmented DNA generated in step b) or the ligated DNA generated in step c) in which reshuffling of DNA fragments has occurred will comprise different flanking sequences compared to the flanking sequences of a fusion sequence within ligation products in which no DNA reshuffling has occurred. Therefore, due to the different flanking sequences, ligation products in which reshuffling of DNA fragments has occurred will not be digested in the targeted nuclease step e). In this manner, the efficiency with which a region of interest is sequenced can be further increased.
In one embodiment, the method comprises the steps of (Wherein the steps are preferably performed in the order presented in this paragraph): a) (i) providing a sample of crosslinked DNA;
(ii) fragmenting the crosslinked DNA provided in step a) (i);
(iii) ligating the fragmented DNA generated in step a) (ii), d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); b) fragmenting the ligated DNA generated in step d) by non-random fragmentation of the DNA at a recognition sequence; c) circularising the fragmented DNA generated in step b); e) digesting the circularised DNA generated in step c) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); f) (i) degrading the digested DNA generated in step e) using an exonuclease; and/or (ii) enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f)(i) for DNA comprising the target nucleotide sequence; and g) determining at least part of the sequence of the undigested DNA remaining after step f), preferably using high throughput sequencing.
The non-random fragmentation of ligation products in which no DNA reshuffling has occurred during the initial steps of fragmenting and ligation on crosslinked DNA (i.e. during step a)) will result in linear DNA fragments with known ends. The circularisation of these DNA fragments with known ends will result in a known fusion sequence. It can therefore be beneficial to remove these circularised DNA products to minimise or prevent them from being enriched or sequenced.
A further fragmentation step using a random fragmentation strategy may result in the cleavage of a small number of copies of the target nucleotide sequence. However, sufficient copies of the target nucleotide sequence would remain in order to provide complete sequence information across the genomic region of interest comprising the target nucleotide sequence.
Circularized ligated fragments
In some embodiments, the method further comprises the step of circularising the DNA of step d), prior to step e). In this embodiment, the obtained ligated DNA fragments of step d), of which crosslinking has been reversed, are next circularized. It may be advantageous to reverse crosslinking before the circularization, because it may be unfavourable to circularize crosslinked DNA while crosslinked. However, circularization may also be performed while the ligated DNA fragments are crosslinked. It may even be possible that an additional circularization step is not required, as during the ligation step, circularized ligated DNA fragments are already formed, and hence the circularization step would occur simultaneously with step c). However, it is preferred to perform an additional circularization step. Circularization involves the ligation of the ends of the ligated DNA fragments such that a closed circle is formed.
The circularized DNA, comprising DNA fragments which comprise the target nucleotide sequence, may subsequently be enriched for circularised DNA comprising the target nucleotide sequence. For example, the circularized DNA, comprising DNA fragments which comprise the target nucleotide sequence, may be amplified using at least one primer which hybridises to the target nucleotide sequence. For the amplification step, reversing the crosslinking is required, as crosslinked DNA may hamper or prevent amplification. Preferably, two primers are used that hybridise to the target nucleotide sequence in an inverse PCR reaction. In this way, DNA fragments of the circularized DNA which are ligated to the DNA fragment comprising the target nucleotide sequence may be amplified.
Method for making a DNA sequencing library
Library construction plays an important role for high-throughput NGS. A plethora of library construction methods have been developed, including the traditional ligation-based methods and the more recently developed transposase-based Nextera method. As described above, whilst significant improvements have been made in genome sequencing approaches, methodologies currently used for sequencing of transgene insertion sites suffer from various limitations. The methods of the invention overcome these limitations.
According to a second aspect of the invention, there is provided a method for making a DNA sequencing library of a genomic region of interest comprising a target nucleotide sequence, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence to form fragmented crosslinked DNA; c) ligating the fragmented crosslinked DNA generated in step b) to form ligated crosslinked DNA; d) reversing the crosslinking in the ligated crosslinked DNA generated in step c) to form ligated DNA; e) digesting the ligated DNA generated in step d) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a site-directed nuclease which is specific for a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); f) (i) degrading the digested DNA generated in step e) using an exonuclease such that only undigested DNA remains; and/ or
(ii) enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) (i) for DNA comprising the target nucleotide sequence to enrich for undigested DNA comprising the target nucleotide sequence; and g) optionally, determining at least part of the sequence of the undigested DNA to provide at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
Steps a) to g) may be performed as described herein.
Suitably, the method of the second aspect of the invention may be performed as described herein with respect to the method of the first aspect of the invention, except that step g) is optional.
Suitably, the method of the second aspect of the invention may be performed as described herein with respect to the method of the third aspect of the invention, except that step i) is optional.
In one embodiment, the methods of the invention further comprise the step of determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
As used herein, the term “DNA sequencing library” means a sequencing-ready DNA library. Thus, the methods of the invention generate a compatible library (e.g. an NGS compatible library) for sequencing applications.
In some embodiments, a DNA sequencing library of a plurality of genomic regions of interest is made.
Further method for determining the sequence of a genomic region of interest
In a third aspect, the invention provides a method for determining the sequence of a genomic region of interest comprising a target nucleotide sequence, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a); c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) fragmenting the ligated DNA generated in step d) by non-random fragmentation of the DNA at a recognition sequence; f) circularising the fragmented DNA generated in step e); g) digesting the circularised DNA generated in step f) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step e) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); h) (i) degrading the digested DNA generated in step g) using an exonuclease; and/or
(ii) enriching the mixture of digested and undigested DNA generated in step g) or the undigested DNA generated in step h)(i) for DNA comprising the target nucleotide sequence; and i) determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
The method of the third aspect of the invention may be performed as described herein with respect to the first aspect of the invention. Thus, the steps of the method of the third aspect of the invention can be carried out as described herein for the corresponding steps of the first aspect of the invention. In addition, all embodiments of the invention described herein with respect to the first aspect of the invention are applicable to the third aspect of the invention.
Identifying mutations
In alternative aspects of the invention, methods are provided for identifying the presence or absence of a genetic mutation.
In a first embodiment, a method is provided for identifying the presence or absence of a genetic mutation, comprising the steps a)-g) of any of methods of the invention as described above, wherein contigs are built for a plurality of samples, comprising the further steps of: h) aligning the contigs of a plurality of samples; and i) identifying the presence or absence of a genetic mutation in the genomic regions of interest from the plurality of samples.
Alternatively, a method for identifying the presence or absence of a genetic mutation is provided, comprising the steps a)-g) of any of the methods of the invention as described above, comprising the further steps of: h) aligning the contig to a reference sequence; and i) identifying the presence or absence of a genetic mutation in the genomic region of interest.
Genetic mutations can be identified for instance by comparing the contigs of multiple samples, in case one (or more) of the samples comprises a genetic mutation, this may be observed as the sequence of the contig is different when compared to the sequence of the other samples, i.e. the presence of a genetic mutation is identified. In case no sequence differences between contigs of the samples is observed, the absence of genetic mutation is identified. Alternatively, a reference sequence may also be used to which the sequence of a contig may be aligned. When the sequence of the contig of the sample is different from the sequence of the reference sequence, a genetic mutation is observed, i.e. the presence of a genetic mutation is identified. In case no sequence differences between the contig of the sample or samples and the reference sequence is observed, the absence of genetic mutation is identified.
It is not required to build a contig for identifying the presence or absence of a genetic mutation. As long as DNA fragments sequences may be aligned, with each other or with a reference sequence, the presence or absence of a genetic mutation may be identified. Thus, in alternative embodiments of the invention, a method is provided for identifying the presence or absence of a genetic mutation, according to any of the methods as described above, without the further step of building a contig.
Such a method comprises the steps a)-g) of any of the methods as described above and the further steps of: h) aligning the determined sequences of the (amplified) undigested DNA fragments generated in step g) to a reference sequence; and i) identifying the presence or absence of a genetic mutation in the determined sequences.
Alternatively, a method is provided for identifying the presence or absence of a genetic mutation, wherein of a plurality of samples sequences of (amplified) undigested DNA fragments are determined, comprising the steps a)-g) of any of the methods as described above, comprising the further steps of: h) aligning the determined sequences (generated in step g)) of the (amplified) undigested DNA fragments of a plurality of samples; i) identifying the presence or absence of a genetic mutation in the determined sequences. Ratio of alleles or cells carrying a genetic mutation or transgene
As already mentioned above, when a sample of crosslinked DNA is provided from heterogeneous cell populations (e.g. cells with different origin or cells from an organism which comprises normal cells and genetically mutated cells (e.g. cancer cells)), for each genomic region of interest corresponding to different genomic environment (which may e.g. be different genomic environments from different alleles in a cell or different genomic environments from different cells) contigs may be built. In addition, the ratio of fragments or ligation products carrying an allele, transgene or genetic mutation may be determined, which may correlate to the ratio of alleles or cells carrying the genetic mutation or the transgene. Since the ligation of DNA fragments is a random process, the collection and order of DNA fragments that are part of the ligation products may be unique and represent a single cell and/or a single genomic region of interest from a cell.
Thus, identifying ligation products comprising the fragment with the allele, genetic mutation or transgene may also comprise identifying ligation products with a unique order and collection of DNA fragments. The ratio of alleles or cells carrying a genetic mutation or transgene may be of importance in evaluation of therapies, e.g. in case patients are undergoing therapy for cancer, such as gene therapy. Cancer cells may carry a particular genetic mutation or cells may carry a particular transgene. The percentage of cells carrying such a mutation or the transgene may be a measure for the success or failure of a therapy. In alternative embodiments, methods are provided for determining the ratio of fragments carrying an allele, genetic mutation or transgene, and/or the ratio of ligation products carrying a genetic mutation. In this embodiment, a genetic mutation is defined as a particular genetic mutation or a selection of particular genetic mutations.
In one aspect, a method is provided for determining the ratio of fragments carrying an allele, genetic mutation or transgene from a cell population suspected of being heterologous comprising the steps a)-g) of any of the methods as described above, comprising the further steps of: h) identifying the fragments of step b); i) identifying the presence or absence of an allele, genetic mutation or transgene in the fragments; j) determining the number of fragments carrying the allele, genetic mutation or transgene; k) determining the number of fragments not carrying the allele, genetic mutation or transgene;
I) calculating the ratio of fragments carrying the allele, genetic mutation or transgene.
In another aspect, a method is provided for determining the ratio of ligation products carrying a fragment with a allele, genetic mutation or transgene from a cell population suspected of being heterologous comprising the steps a)-g) of any of the methods as described above, comprising the further steps of: h) identifying the fragments of step b); i) identifying the presence or absence of an allele, genetic mutation or transgene in the fragments; j) identifying the ligation products of step c) carrying the fragments with or without the allele, genetic mutation or transgene; k) determining the number of ligation products carrying the fragments with the allele, genetic mutation or transgene; l) determining the number of ligation products carrying the fragments without the allele, genetic mutation or transgene; m) calculating the ratio of ligation products carrying the allele, genetic mutation or transgene.
In the methods of these embodiments, the presence or absence of an allele, genetic mutation or transgene may be identified in step i) by aligning to a reference sequence and/or by comparing DNA fragment sequences of a plurality of samples.
In the methods according to the invention, an identified genetic mutation may be a SNP, single nucleotide polymorphism, an insertion, an inversion and/or a translocation. In case a deletion and/or insertion is observed, the number of fragments and/or ligation products from a sample carrying the deletion and/or insertion may be compared with a reference sample in order to identify the deletion and/or insertion. A deletion, insertion, inversion and/or translocation may also be identified based on the presence of chromosomal breakpoints in analyzed fragments.
In another embodiment, in the methods as described above, the presence or absence of methylated nucleotides is determined in DNA fragments, ligated DNA fragments, and/or genomic regions of interest. For example, the DNA of step a)-e) may be treated with bisulphite. Treatment of DNA with bisulphite converts cytosine residues to uracil, but leaves 5-methylcytosine residues unaffected. Thus, bisulphite treatment introduces specific changes in the DNA sequence that depend on the methylation status of individual cytosine residues, yielding single- nucleotide resolution information about the methylation status of a segment of DNA. By dividing samples into subsamples, wherein one of the samples is treated, and the other is not, methylated nucleotides may be identified. Alternatively, sequences from a plurality of samples treated with bisulphite may also be aligned, or a sequence from a sample treated with bisulphite may be aligned to a reference sequence.
This disclosure is not limited by the exemplary methods and materials disclosed herein, and any methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of this disclosure. Numeric ranges are inclusive of the numbers defining the range. Unless otherwise indicated, any nucleic acid sequences are written left to right in 5' to 3' orientation; amino acid sequences are written left to right in amino to carboxy orientation, respectively.
Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limits of that range is also specifically disclosed. Each smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in that stated range is encompassed within this disclosure. The upper and lower limits of these smaller ranges may independently be included or excluded in the range, and each range where either, neither or both limits are included in the smaller ranges is also encompassed within this disclosure, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in this disclosure.
It must be noted that as used herein and in the appended claims, the singular forms "a", "an", and "the" include plural referents unless the context clearly dictates otherwise.
The terms "comprising", "comprises" and "comprised of' as used herein are synonymous with "including", "includes" or "containing", "contains", and are inclusive or open-ended and do not exclude additional, non-recited members, elements or method steps. The terms "comprising", "comprises" and "comprised of' also include the term "consisting of'.
The publications discussed herein are provided solely for their disclosure prior to the filing date of the present application. Nothing herein is to be construed as an admission that such publications constitute prior art to the claims appended hereto. The invention will now be further described by way of Examples, which are meant to serve to assist one of ordinary skill in the art in carrying out the invention and are not intended in any way to limit the scope of the invention.
EXAMPLES
Example 1 : Illustrative example of targeted sequencing of integrated viruses and integration sites
This is an example of an approach to preferentially sequence integrated copies of a virus vector sequence in a human cell sample that also contains episomal (i.e. non-integrated) copies of the virus vector sequence.
Fixation and cell lysis
Cultured cells are washed with PBS and fixated with PBS/10% FCS/2% formaldehyde for 10 minutes at RT. The cells are subsequently washed and collected, and taken up in lysis buffer (50mM Tris-HCI pH7.5, 150mM NaCI, 5mM EDTA, 0.5% NP-40, 1% TX-100 and IX. Complete protease inhibitors (Roche #11245200) and incubate 10 minutes on ice. Cells are subsequently washed and taken up in MilllQ.
Fragmenting 1
In a first fragmentation step, the fixated lysed cells are digested with a restriction enzyme of which, as shown in Figure 1A, the restriction site sequence occurs twice in the viral genome sequence. This means that the fragmentation of the viral genome with this restriction enzyme will result in three fragments (shown in different shades). A target nucleotide sequence (outlined with a black box) is chosen in proximity to one of the restriction sites.
The chosen restriction enzyme will also fragment the human genome.
As shown in Figure 1 B, multiple restriction sites will thus also occur in the human genome in proximity to any integration site of the virus.
Ligating 1
The restriction enzyme is heat-inactivated and subsequently a ligation step is performed using T4 DNA Ligase (Roche, #799009). Fragmenting 2 and ligating 2
A second round of fragmentation and ligation is performed using a fragmentation approach which fragments more frequently than the restriction enzyme used in the first fragmentation step.
Reversing cross-linking
To the sample, Prot K (10 mg/ml) is added and incubated at 65°C. RNase A (10 mg/ml, Roche #10109169001) is subsequently added and the sample is incubated at 37°C. Next, phenol-chloroform extraction is performed, and the supernatant comprising the DNA is precipitated and pelleted. The pellet is dissolved in 10 mM Tris-HCI pH 7.5.
Fragmenting 3
The digested and ligated sample is digested with a fragmenting strategy that, on average, will result in fragments of around 3000 bp in size (as a result of which the resulting circularised DNA can be amplified effectively).
Circularisation
The resulting linear DNA is circularised with a ligase enzyme.
Selective digestion of religation events prior to enrichment
In non-integrated copies of the viral vector I transgene sequence, the DNA fragment end containing the target nucleotide sequence generated in the first fragmentation will, given the physical proximity of only sequences originating from the same individual copy of the virus, have very preferentially ligated to fragment ends A, B and C as shown in Figure 1A.
In integrated copies, the DNA fragment end containing the target nucleotide sequence generated in the first fragmentation will have much more frequently ligated to fragment ends originating from its integration site in the host genome, i.e. ligated to fragment ends from the host genome.
To preferentially linearize ligation events originating from non-integrated copies of the viral vector I transgene sequence, a multiplex CRISPR-Cas nuclease digestion is performed which is specific for the fusion sequences resulting from the ligation of the DNA fragment end containing the target nucleotide sequence with DNA fragment ends A, B, and C, i.e. the ligation products target nucleotide sequence-A, target nucleotide sequence-B and target nucleotide sequence-C. Exonuclease
An exonuclease treatment is performed to digest all linearized DNA generated in the selective digestion step.
Amplifying ligated DNA fragments: PCR
The primers used for the PCR-enrichment are designed as inverted unique primers specific for the target nucleotide sequence.
This will result in the amplification of the remaining circularized DNA which consists of ligation events originating from integrated copies of the virus.
Sequencing the amplified ligated DNA fragments
The amplified undigested DNA can be library prepped and sequenced according to standard protocols.
Example 2: Illustrative example of whole genome sequencing
This is an example of an approach to preferentially sequence whole genomes including integrated copies of a virus sequence in a human cell sample that also contains episomal (i.e. non-integrated) copies of the virus.
Fixation and cell lysis
Cultured cells are washed with PBS and fixated with PBS/10% FCS/2% formaldehyde for 10 minutes at RT. The cells are subsequently washed and collected, and taken up in lysisbuffer (50mM Tris-HCI pH7.5, 150mM NaCI, 5mM EDTA, 0.5% NP-40, 1% TX-100 and IX. Complete protease inhibitors (Roche #11245200) and incubate 10 minutes on ice. Cells are subsequently washed and taken up in MilllQ.
Fragmenting 1
In a first fragmenting step, the fixated lysed cells are digested with a restriction enzyme of which, as shown in Figure 2A, the restriction site sequence occurs twice in the viral genome sequence. This means that the fragmentation of the viral genome with this restriction enzyme will result in three fragments (shown in different shades). A target nucleotide sequence (outlined in a black box) is chosen in proximity to one of the restriction sites.
The chosen restriction enzyme will also fragment the human genome. As shown in Figure 2B, multiple restriction sites will thus also occur in the human genome in proximity to any integration site of the virus.
Ligating 1
The restriction enzyme is heat-inactivated and subsequently a ligation step is performed using T4 DNA Ligase (Roche, #799009).
Reversing cross-linking
To the sample, Prot K (10 mg/ml) is added and incubated at 65°C. RNase A (10 mg/ml, Roche #10109169001) is subsequently added and the sample is incubated at 37°C. Next, phenol-chloroform extraction is performed, and the supernatant comprising the DNA is precipitated and pelleted. The pellet is dissolved in 10 mM Tris-HCI pH 7.5.
Adapter ligation
Adapters are ligated to the ends of the ligation products. The adapters used prevent exonuclease based degradation of these ends.
Selective digestion of religation events prior to sequencing
In non-integrated copies of the viral vector I transgene sequence, the viral vector I transgene fragment ends generated in the first fragmentation will, given the physical proximity of only sequences originating from the same individual copy of the virus, have very preferentially ligated to other viral vector I transgene fragment ends, i.e. have very preferentially ligated to fragment ends A, B, C and D as shown in Figure 2A.
In integrated copies of the viral vector I transgene sequence, the fragment end containing the target nucleotide sequence generated in the first fragmentation will have much more frequently ligated to fragment ends originating from its integration site in the host genome, i.e. ligated to fragment ends from the host genome.
To preferentially linearize ligation events originating from non-integrated copies a multiplex CRISPR-Cas nuclease digestion is performed which is specific for all possible sequences resulting from the religation of two fragment ends of viral origin; i.e. the ligation products AB, AC, AD, BC, BD and CD. Exonuclease
An exonuclease treatment is performed to digest all DNA fragmented in the selective digestion step.
Sequencing the remaining DNA fragments
The remaining DNA can be library prepped and sequenced according to standard protocols.
Example 3: Illustrative example of targeted sequencing of integrated AAV and integration sites pSUB201 is a well established AAV plasmid construct (Nicola J Philpott et al. (2002) J Virol. 76:5411-21).
Figure 3 provides a circular map of the construct and shows the positions of unique restriction sites within the construct (https://www.addgene.org/browse/sequence_vdb/4231/). Any of these restriction sites may be used in the methods of the invention.
By way of specific example, the unique Hindi 11 restriction site can be used as the recognition site in the non-random fragmentation step b).
For reasons described above, the majority of ligations in crosslinked non-integrated copies of the vector will consist of a ligation of vector sequences at either end of the Hindi 11 restriction site. By contrast, in integrated copies, the Hindi 11 fragment ends will have opportunity to ligate to Hindi 11 fragment ends originating from the integration site in the human genome, i.e. to fragment ends from the human genome.
A nuclease (for example a site-directed nuclease such as a CRISPR-Cas nuclease) which is specific for a fusion sequence which comprises the restriction site and nucleotides from DNA fragment ends of viral origin will thus result in the very preferential fragmentation of ligation events originating from non-integrated copies of the vector. Since ligation events from integrated copies of the vector and consisting of vector and human genome sequences will not have been fragmented in this targeted digestion these can now, using strategies described above, be more efficiently enriched and/or sequenced.
The sequence of the unique Hindlll restriction site (AAGCTT) is shown in bold and italics in the pSUB201 sequence below. An illustrative example of a corresponding fusion sequence for use in the nuclease digestion step e) is shown underlined in the pSUB201 sequence below. The illustrative fusion sequence (GACGCGGAAGCTTCGATCAA) comprises the recognition sequence and a first and second flanking sequence, wherein each flanking sequence is of 7 nucleotides in length.
PSUB201 nucleic acid sequence:
1 CAGCAGCTGC GCGCTCGCTC GCTCACTGAG GCCGCCCGGG CAAAGCCCGG 50 51 GCGTCGGGCG ACCTTTGGTC GCCCGGCCTC AGTGAGCGAG CGAGCGCGCA 100
101 GAGAGGGAGT GGCCAACTCC ATCACTAGGG GTTCCTTGTA GTTAATGATT 150
151 AACCCGCCAT GCTACTTATC TACGTAGCCA TGCTCTAGAG TCCTGTATTA 200
201 GAGGTCACGT GAGTGTTTTG CGACATTTTG CGACACCATG TGGTCACGCT 250
251 GGGTATTTAA GCCCGAGTGA GCACGCAGGG TCTCCATTTT GAAGCGGGAG 300 301 GTTTGAACGC GCAGCCGCCA TGCCGGGGTT TTACGAGATT GTGATTAAGG 350
351 TCCCCAGCGA CCTTGACGGG CATCTGCCCG GCATTTCTGA CAGCTTTGTG 400
401 AACTGGGTGG CCGAGAAGGA ATGGGAGTTG CCGCCAGATT CTGACATGGA 450
451 TCTGAATCTG ATTGAGCAGG CACCCCTGAC CGTGGCCGAG AAGCTGCAGC 500
501 GCGACTTTCT GACGGAATGG CGCCGTGTGA GTAAGGCCCC GGAGGCCCTT 550 551 TTCTTTGTGC AATTTGAGAA GGGAGAGAGC TACTTCCACA TGCACGTGCT 600
601 CGTGGAAACC ACCGGGGTGA AATCCATGGT TTTGGGACGT TTCCTGAGTC 650
651 AGATTCGCGA AAAACTGATT CAGAGAATTT ACCGCGGGAT CGAGCCGACT 700
701 TTGCCAAACT GGTTCGCGGT CACAAAGACC AGAAATGGCG CCGGAGGCGG 750
751 GAACAAGGTG GTGGATGAGT GCTACATCCC CAATTACTTG CTCCCCAAAA 800 801 CCCAGCCTGA GCTCCAGTGG GCGTGGACTA ATATGGAACA GTATTTAAGC 850
851 GCCTGTTTGA ATCTCACGGA GCGTAAACGG TTGGTGGCGC AGCATCTGAC 900
901 GCACGTGTCG CAGACGCAGG AGCAGAACAA AGAGAATCAG AATCCCAATT 950
951 CTGATGCGCC GGTGATCAGA TCAAAAACTT CAGCCAGGTA CATGGAGCTG 1000
1001 GTCGGGTGGC TCGTGGACAA GGGGATTACC TCGGAGAAGC AGTGGATCCA 1050 1051 GGAGGACCAG GCCTCATACA TCTCCTTCAA TGCGGCCTCC AACTCGCGGT 1100
1101 CCCAAATCAA GGCTGCCTTG GACAATGCGG GAAAGATTAT GAGCCTGACT 1150
1151 AAAACCGCCC CCGACTACCT GGTGGGCCAG CAGCCCGTGG AGGACATTTC 1200
1201 CAGCAATCGG ATTTATAAAA TTTTGGAACT AAACGGGTAC GATCCCCAAT 1250
1251 ATGCGGCTTC CGTCTTTCTG GGATGGGCCA CGAAAAAGTT CGGCAAGAGG 1300 1301 AACACCATCT GGCTGTTTGG GCCTGCAACT ACCGGGAAGA CCAACATCGC 1350
1351 GGAGGCCATA GCCCACACTG TGCCCTTCTA CGGGTGCGTA AACTGGACCA 1400
1401 ATGAGAACTT TCCCTTCAAC GACTGTGTCG ACAAGATGGT GATCTGGTGG 1450
1451 GAGGAGGGGA AGATGACCGC CAAGGTCGTG GAGTCGGCCA AAGCCATTCT 1500
1501 CGGAGGAAGC AAGGTGCGCG TGGACCAGAA ATGCAAGTCC TCGGCCCAGA 1550 1551 TAGACCCGAC TCCCGTGATC GTCACCTCCA ACACCAACAT GTGCGCCGTG 1600
1601 ATTGACGGGA ACTCAACGAC CTTCGAACAC CAGCAGCCGT TGCAAGACCG 1650 1651 GATGTTCAAA TTTGAACTCA CCCGCCGTCT GGATCATGAC TTTGGGAAGG 1700
1701 TCACCAAGCA GGAAGTCAAA GACTTTTTCC GGTGGGCAAA GGATCACGTG 1750
1751 GTTGAGGTGG AGCATGAATT CTACGTCAAA AAGGGTGGAG CCAAGAAAAG 1800
1801 ACCCGCCCCC AGTGACGCAG ATATAAGTGA GCCCAAACGG GTGCGCGAGT 1850
1851 CAGTTGCGCA GCCATCGACG TCAGACGCGG AAGCTTCGAT CAACTACGCA 1900
1901 GACAGGTACC AAAACAAATG TTCTCGTCAC GTGGGCATGA ATCTGATGCT 1950
1951 GTTTCCCTGC AGACAATGCG AGAGAATGAA TCAGAATTCA AATATCTGCT 2000
2001 TCACTCACGG AC AG AAAG AC TGTTTAGAGT GCTTTCCCGT GTCAGAATCT 2050
2051 CAACCCGTTT CTGTCGTCAA AAAGGCGTAT CAGAAACTGT GCTACATTCA 2100
2101 TCATATCATG GGAAAGGTGC CAGACGCTTG CACTGCCTGC GATCTGGTCA 2150
2151 ATGTGGATTT GGATGACTGC ATCTTTGAAC AATAAATGAT TTAAATCAGG 2200
2201 TATGGCTGCC GATGGTTATC TTCCAGATTG GCTCGAGGAC ACTCTCTCTG 2250
2251 AAGGAATAAG ACAGTGGTGG AAGCTCAAAC CTGGCCCACC ACCACCAAAG 2300
2301 CCCGCAGAGC GGCATAAGGA CGACAGCAGG GGTCTTGTGC TTCCTGGGTA 2350
2351 CAAGTACCTC GGACCCTTCA ACGGACTCGA CAAGGGAGAG CCGGTCAACG 2400
2401 AGGCAGACGC CGCGGCCCTC GAGCACGTCA AAGCCTACGA CCGGCAGCTC 2450
2451 GACAGCGGAG ACAACCCGTA CCTCAAGTAC AACCACGCCG ACGCGGAGTT 2500
2501 TCAGGAGCGC CTTAAAGAAG ATACGTCTTT TGGGGGCAAC CTCGGACGAG 2550
2551 CAGTCTTCCA GGCGAAAAAG AGGGTTCTTG AACCTCTGGG CCTGGTTGAG 2600
2601 GAACCTGTTA AGACGGCTCC GGGAAAAAAG AGGCCGGTAG AGCACTCTCC 2650
2651 TGTGGAGCCA GACTCCTCCT CGGGAACCGG AAAGGCGGGC CAGCAGCCTG 2700
2701 CAAGAAAAAG ATTGAATTTT GGTCAGACTG GAGACGCAGA CTCAGTACCT 2750
2751 GACCCCCAGC CTCTCGGACA GCCACCAGCA GCCCCCTCTG GTCTGGGAAC 2800
2801 TAATACGATG GCTACAGGCA GTGGCGCACC AATGGCAGAC AATAACGAGG 2850
2851 GCGCCGACGG AGTGGGTAAT TCCTCGGGAA ATTGGCATTG CGATTCCACA 2900
2901 TGGATGGGCG ACAGAGTCAT CACCACCAGC ACCCGAACCT GGGCCCTGCC 2950
2951 CACCTACAAC AACCACCTCT ACAAACAAAT TTCCAGCCAA TCAGGAGCCT 3000
3001 CGAACGACAA TCACTACTTT GGCTACAGCA CCCCTTGGGG GTATTTTGAC 3050
3051 TTCAACAGAT TCCACTGCCA CTTTTCACCA CGTGACTGGC AAAGACTCAT 3100
3101 CAACAACAAC TGGGGATTCC GACCCAAGAG ACTCAACTTC AAGCTCTTTA 3150
3151 ACATTCAAGT CAAAGAGGTC ACGCAGAATG ACGGTACGAC GACGATTGCC 3200
3201 AATAACCTTA CCAGCACGGT TCAGGTGTTT ACTGACTCGG AGTACCAGCT 3250
3251 CCCGTACGTC CTCGGCTCGG CGCATCAAGG ATGCCTCCCG CCGTTCCCAG 3300
3301 CAGACGTCTT CATGGTGCCA CAGTATGGAT ACCTCACCCT GAACAACGGG 3350
3351 AGTCAGGCAG TAGGACGCTC TTCATTTTAC TGCCTGGAGT ACTTTCCTTC 3400
3401 TCAGATGCTG CGTACCGGAA ACAACTTTAC CTTCAGCTAC ACTTTTGAGG 3450
3451 ACGTTCCTTT CCACAGCAGC TACGCTCACA GCCAGAGTCT GGACCGTCTC 3500
3501 ATGAATCCTC TCATCGACCA GTACCTGTAT TACTTGAGCA GAACAAACAC 3550 3551 TCCAAGTGGA ACCACCACGC AGTCAAGGCT TCAGTTTTCT CAGGCCGGAG 3600
3601 CGAGTGACAT TCGGGACCAG TCTAGGAACT GGCTTCCTGG ACCCTGTTAC 3650
3651 CGCCAGCAGC GAGTATCAAA GACATCTGCG GATAACAACA ACAGTGAATA 3700
3701 CTCGTGGACT GGAGCTACCA AGTACCACCT CAATGGCAGA GACTCTCTGG 3750
3751 TGAATCCGGG GCCCGCCATG GCAAGCCACA AGGACGATGA AGAAAAGTTT 3800
3801 TTTCCTCAGA GCGGGGTTCT CATCTTTGGG AAGCAAGGCT CAGAGAAAAC 3850
3851 AAAT GT GAAC AT T G AAAAGG TCATGATTAC AGACGAAGAG GAAATCGGAA 3900
3901 CAACCAATCC CGTGGCTACG GAGCAGTATG GTTCTGTATC TACCAACCTC 3950
3951 CAGAGAGGCA ACAGACAAGC AGCTACCGCA GATGTCAACA CACAAGGCGT 4000
4001 TCTTCCAGGC ATGGTCTGGC AGGACAGAGA TGTGTACCTT CAGGGGCCCA 4050
4051 TCTGGGCAAA GATTCCACAC ACGGACGGAC ATTTTCACCC CTCTCCCCTC 4100
4101 ATGGGTGGAT TCGGACTTAA ACACCCTCCT CCACAGATTC TCATCAAGAA 4150
4151 CACCCCGGTA CCTGCGAATC CTTCGACCAC CTTCAGTGCG GCAAAGTTTG 4200
4201 CTTCCTTCAT CACACAGTAC TCCACGGGAC ACGGTCAGCG TGGAGATCGA 4250
4251 GTGGGAGCTG CAGAAGGAAA ACAGCAAACG CTGGAATCCC GAAATTCAGT 4300
4301 ACACTTCCAA CTACAACAAG TCTGTTAATC GTGGACTTAC CGTGGATACT 4350
4351 AATGGCGTGT ATTCAGAGCC TCGCCCCATT GGCACCAGAT ACCTGACTCG 4400
4401 TAATCTGTAA TTGCTTGTTA ATCAATAAAC CGTTTAATTC GTTTCAGTTG 4450
4451 AACTTTGGTC TCTGCGTATT TCTTTCTTAT CTAGTTTCCA TGCTCTAGAG 4500
4501 CATGGCTACG TAGATAAGTA GCATGGCGGG TTAATCATTA ACTACAAGGA 4550
4551 ACCCCTAGTG ATGGAGTTGG CCACTCCCTC TCTGCGCGCT CGCTCGCTCA 4600
4601 CTGAGGCCGG GCGACCAAAG GTCGCCCGAC GCCCGGGCTT TGCCCGGGCG 4650
4651 GCCTCAGTGA GCGAGCGAGC GCGCCAGCTG GCGTAATAGC GAAGAGGCCC 4700
4701 GCACCGATCG CCCTTCCCAA CAGTTGCGCA GCCTGAATGG CGAATGGAAT 4750
4751 TCCAGACGAT TGAGCGTCAA AATGTAGGTA TTTCCATGAG CGTTTTTCCT 4800
4801 GTTGCAATGG CTGGCGGTAA TATTGTTCTG GATATTACCA GCAAGGCCGA 4850
4851 TAGTTTGAGT TCTTCTACTC AGGCAAGTGA TGTTATTACT AATCAAAGAA 4900
4901 GTATTGCGAC AACGGTTAAT TTGCGTGATG GACAGACTCT TTTACTCGGT 4950
4951 GGCCTCACTG ATTATAAAAA CACTTCTCAG GATTCTGGCG TACCGTTCCT 5000
5001 GTCTAAAATC CCTTTAATCG GCCTCCTGTT TAGCTCCCGC TCTGATTCTA 5050
5051 ACGAGGAAAG CACGTTATAC GTGCTCGTCA AAGCAACCAT AGTACGCGCC 5100
5101 CTGTAGCGGC GCATTAAGCG CGGCGGGTGT GGTGGTTACG CGCAGCGTGA 5150
5151 CCGCTACACT TGCCAGCGCC CTAGCGCCCG CTCCTTTCGC TTTCTTCCCT 5200
5201 TCCTTTCTCG CCACGTTCGC CGGCTTTCCC CGTCAAGCTC TAAATCGGGG 5250
5251 GCTCCCTTTA GGGTTCCGAT TTAGTGCTTT ACGGCACCTC GACCCCAAAA 5300
5301 AACTTGATTA GGGTGATGGT TCACGTAGTG GGCCATCGCC CTGATAGACG 5350
5351 GTTTTTCGCC CTTTGACGTT GGAGTCCACG TTCTTTAATA GTGGACTCTT 5400
5401 GTTCCAAACT GGAACAACAC TCAACCCTAT CTCGGTCTAT TCTTTTGATT 5450 5451 TATAAGGGAT TTTGCCGATT TCGGCCTATT GGTTAAAAAA TGAGCTGATT 5500
5501 TAACAAAAAT TTAACGCGAA TTTTAACAAA ATATTAACGT TTACAATTTA 5550
5551 AATATTTGCT TATACAATCT TCCTGTTTTT GGGGCTTTTC TGATTATCAA 5600
5601 CCGGGGTACA TATGATTGAC ATGCTAGTTT TACGATTACC GTTCATCGAT 5650
5651 TCTCTTGTTT GCTCCAGACT CTCAGGCAAT GACCTGATAG CCTTTGTAGA 5700
5701 GACCTCTCAA AAATAGCTAC CCTCTCCGGC ATGAATTTAT CAGCTAGAAC 5750
5751 GGTTGAATAT CATATTGATG GTGATTTGAC TGTCTCCGGC CTTTCTCACC 5800
5801 CGTTTGAATC TTTACCTACA CATTACTCAG GCATTGCATT TAAAATATAT 5850
5851 GAGGGTTCTA AAAATTTTTA TCCTTGCGTT GAAATAAAGG CTTCTCCCGC 5900
5901 AAAAGTATTA CAGGGTCATA ATGTTTTTGG TACAACCGAT TTAGCTTTAT 5950
5951 GCTCTGAGGC TTTATTGCTT AATTTTGCTA ATTCTTTGCC TTGCCTGTAT 6000
6001 GATTTATTGG ATGTTGGAAT TCCTGATGCG GTATTTTCTC CTTACGCATC 6050
6051 TGTGCGGTAT TTCACACCGC ATATGGTGCA CTCTCAGTAC AATCTGCTCT 6100
6101 GATGCCGCAT AGTTAAGCCA GCCCCGACAC CCGCCAACAC CCGCTGACGC 6150
6151 GCCCTGACGG GCTTGTCTGC TCCCGGCATC CGCTTACAGA CAAGCTGTGA 6200
6201 CCGTCTCCGG GAGCTGCATG TGTCAGAGGT TTTCACCGTC ATCACCGAAA 6250
6251 CGCGCGAGAC GAAAGGGCCT CGTGATACGC CTATTTTTAT AGGTTAATGT 6300
6301 CATGATAATA ATGGTTTCTT AGACGTCAGG TGGCACTTTT CGGGGAAATG 6350
6351 TGCGCGGAAC CCCTATTTGT TTATTTTTCT AAATACATTC AAATATGTAT 6400
6401 CCGCTCATGA GACAATAACC CTGATAAATG CTTCAATAAT ATTGAAAAAG 6450
6451 GAAGAGTATG AGTATTCAAC ATTTCCGTGT CGCCCTTATT CCCTTTTTTG 6500
6501 CGGCATTTTG CCTTCCTGTT TTTGCTCACC CAGAAACGCT GGTGAAAGTA 6550
6551 AAAGATGCTG AAGATCAGTT GGGTGCACGA GTGGGTTACA TCGAACTGGA 6600
6601 TCTCAACAGC GGTAAGATCC TTGAGAGTTT TCGCCCCGAA GAACGTTTTC 6650
6651 CAATGATGAG CACTTTTAAA GTTCTGCTAT GTGGCGCGGT ATTATCCCGT 6700
6701 ATTGACGCCG GGCAAGAGCA ACTCGGTCGC CGCATACACT ATTCTCAGAA 6750
6751 TGACTTGGTT GAGTACTCAC CAGTCACAGA AAAGCATCTT ACGGATGGCA 6800
6801 TGACAGTAAG AGAATTATGC AGTGCTGCCA TAACCATGAG TGATAACACT 6850
6851 GCGGCCAACT TACTTCTGAC AACGATCGGA GGACCGAAGG AGCTAACCGC 6900
6901 TTTTTTGCAC AACATGGGGG ATCATGTAAC TCGCCTTGAT CGTTGGGAAC 6950
6951 CGGAGCTGAA TGAAGCCATA CCAAACGACG AGCGTGACAC CACGATGCCT 7000
7001 GTAGCAATGG CAACAACGTT GCGCAAACTA TTAACTGGCG AACTACTTAC 7050
7051 TCTAGCTTCC CGGCAACAAT TAATAGACTG GATGGAGGCG GATAAAGTTG 7100
7101 CAGGACCACT TCTGCGCTCG GCCCTTCCGG CTGGCTGGTT TATTGCTGAT 7150
7151 AAATCTGGAG CCGGTGAGCG TGGGTCTCGC GGTATCATTG CAGCACTGGG 7200
7201 GCCAGATGGT AAGCCCTCCC GTATCGTAGT TATCTACACG ACGGGGAGTC 7250
7251 AGGCAACTAT GGATGAACGA AATAGACAGA TCGCTGAGAT AGGTGCCTCA 7300
7301 CTGATTAAGC ATTGGTAACT GTCAGACCAA GTTTACTCAT ATATACTTTA 7350 7351 GATTGATTTA AAACTTCATT TTTAATTTAA AAGGATCTAG GTGAAGATCC 7400
7401 TTTTTGATAA TCTCATGACC AAAATCCCTT AACGTGAGTT TTCGTTCCAC 7450
7451 TGAGCGTCAG ACCCCGTAGA AAAGATCAAA GGATCTTCTT GAGATCCTTT 7500
7501 TTTTCTGCGC GTAATCTGCT GCTTGCAAAC AAAAAAACCA CCGCTACCAG 7550
7551 CGGTGGTTTG TTTGCCGGAT CAAGAGCTAC CAACTCTTTT TCCGAAGGTA 7600
7601 ACTGGCTTCA GCAGAGCGCA GATACCAAAT ACTGTCCTTC TAGTGTAGCC 7650
7651 GTAGTTAGGC CACCACTTCA AGAACTCTGT AGCACCGCCT ACATACCTCG 7700
7701 CTCTGCTAAT CCTGTTACCA GTGGCTGCTG CCAGTGGCGA TAAGTCGTGT 7750
7751 CTTACCGGGT TGGACTCAAG ACGATAGTTA CCGGATAAGG CGCAGCGGTC 7800
7801 GGGCTGAACG GGGGGTTCGT GCACACAGCC CAGCTTGGAG CGAACGACCT 7850
7851 ACACCGAACT GAGATACCTA CAGCGTGAGC TATGAGAAAG CGCCACGCTT 7900
7901 CCCGAAGGGA GAAAGGCGGA CAGGTATCCG GTAAGCGGCA GGGTCGGAAC 7950
7951 AGGAGAGCGC ACGAGGGAGC TTCCAGGGGG AAACGCCTGG TATCTTTATA 8000
8001 GTCCTGTCGG GTTTCGCCAC CTCTGACTTG AGCGTCGATT TTTGTGATGC 8050
8051 TCGTCAGGGG GGCGGAGCCT ATGGAAAAAC GCCAGCAACG CGGCCTTTTT 8100
8101 ACGGTTCCTG GCCTTTTGCT GGCCTTTTGC TCACATGTTC TTTCCTGCGT 8150
8151 TATCCCCTGA TTCTGTGGAT AACCGTATTA CCGCCTTTGA GTGAGCTGAT 8200
8201 ACCGCTCGCC GCAGCCGAAC GACCGAGCGC AGCGAGTCAG TGAGCGAGGA 8250
8251 AGCGGAAGAG CGCCCAATAC GCAAACCGCC TCTCCCCGCG CGTTGGCCGA 8300
8301 TTCATTAATG 8310
All publications mentioned in the above specification are herein incorporated by reference. Various modifications and variations of the described methods and system of the invention will be apparent to those skilled in the art without departing from the scope and spirit of the invention. Although the invention has been described in connection with specific preferred embodiments, it should be understood that the invention as claimed should not be unduly limited to such specific embodiments. Indeed, various modifications of the described modes for carrying out the invention which are obvious to those skilled in molecular biology or related fields are intended to be within the scope of the following claims.
Various preferred features and embodiments of the present invention will now be described with reference to the following numbered paragraphs:
1. A method for making a DNA sequencing library of a genomic region of interest comprising a target nucleotide sequence, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence; c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) digesting the ligated DNA generated in step d) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); f) (i) degrading the digested DNA generated in step e) using an exonuclease; and/ or
(ii) enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f)(i) for DNA comprising the target nucleotide sequence; and g) optionally, determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
2. A method for determining the sequence of a genomic region of interest comprising a target nucleotide sequence, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence; c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) digesting the ligated DNA generated in step d) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); f) (i) degrading the digested DNA generated in step e) using an exonuclease; and/ or
(ii) enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f)(i) for DNA comprising the target nucleotide sequence; and g) determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
3. The method according to paragraph 1 or paragraph 2, wherein the genomic region of interest comprises one or more further target nucleotide sequences.
4. The method according to paragraph 1 or paragraph 3, wherein a DNA sequencing library of a plurality of genomic regions of interest is made.
5. The method according to any of the preceding paragraphs, wherein the sequences of a plurality of genomic regions of interest are determined.
6. The method according to any of the preceding paragraphs, wherein the fragmenting step b) comprises fragmenting with a restriction enzyme.
7. The method according to any one of paragraphs 1 to 5, wherein the fragmenting step b) comprises fragmenting with a site-directed nuclease, preferably a CRISPR-Cas nuclease.
8. The method according to any of the preceding paragraphs, wherein the fragmenting step b) comprises fragmenting a plurality of subsamples, each subsample having a different recognition sequence.
9. The method according to any of the preceding paragraphs, wherein the method further comprises the step of: i) (a) further fragmenting the crosslinked DNA provided in step a) or the ligated crosslinked DNA generated in step c); and (b) ligating the fragmented DNA generated in step i) (a), preferably wherein the ligation is performed in the presence of an adaptor, ligating adaptor sequences in between fragments; prior to step d); or ii) (a) further fragmenting the undigested DNA generated in step f); and
(b) optionally, circularising or ligating the fragmented DNA generated in step ii) (a), preferably ligating the fragmented DNA to at least one adaptor, prior to step g).
10. The method according to paragraph 9, wherein the further steps i) (a) and i) (b) are repeated at least once.
11. The method according to paragraph 9, wherein the recognition sequence of the fragmenting step b) is of greater length than the recognition sequence of the further fragmenting step i) (a) or step ii) (a).
12. The method according to any one of the preceding paragraphs, wherein the recognition sequence of fragmenting step b) is of 4 to 12 nucleotides in length, preferably of 6 to 8 nucleotides in length.
13. The method according to any one of the preceding paragraphs, wherein step e) is performed prior to step d).
14. The method according to any one of the preceding paragraphs, wherein the method further comprises the step of circularising the DNA of step d), prior to step e).
15. The method according to any one of the preceding paragraphs, wherein the fusion sequence of digesting step e) is of greater length than the recognition sequence of fragmenting step b).
16. The method according to any one of the preceding paragraphs, wherein the fusion sequence is of 15 to 25 nucleotides in length, preferably of 20 nucleotides in length.
17. The method according to any one of the preceding paragraphs, wherein the nuclease of digesting step e) is a restriction enzyme. 18. The method according to any one of paragraphs 1 to 16, wherein the nuclease of digesting step e) is a site-directed nuclease, preferably wherein the site-directed nuclease is a CRISPR-Cas nuclease.
19. The method according to any one of the preceding paragraphs, wherein the digesting step e) uses a multiplex nuclease digestion which recognises a plurality of specific fusion sequences.
20. The method according to any one of the preceding paragraphs, wherein the method further comprises the steps of:
A’) digesting the ligated DNA generated in step c) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using the nuclease of step e); and
B’) ligating the mixture of digested and undigested DNA generated in step A’); prior to step d).
21. The method according to paragraph 20, wherein the further steps A’) and B’) are repeated at least once.
22. The method according to any one of the preceding paragraphs, wherein the enriching step f) (ii) comprises amplifying the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i).
23. The method according to paragraph 22, wherein amplifying the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i) comprises using at least one primer which hybridises to the DNA fragment comprising the target nucleotide sequence generated in step b) and optionally further using a plurality of primers which each hybridises to the DNA sequence of one of the DNA fragments comprising the one or more further target nucleotide sequences generated in step b).
24. The method according to paragraph 23, wherein at least one primer directs amplification towards the recognition sequence of step b).
25. The method according to paragraph 23 or paragraph 24, wherein an identifier is included in the at least one primer.
26. The method according to any one of paragraphs 1 to 21 , wherein the enriching step f) (ii) comprises capture-based enrichment of the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i), preferably wherein the capture-based enrichment is specific for a defined sequence at one end of the target nucleotide sequence.
27. The method according to any one of paragraphs 1 to 21 , wherein the enriching step f) (ii) comprises site-directed nuclease-based enrichment of the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i), preferably wherein the site-directed nuclease-based enrichment comprises digesting the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) using a site-directed nuclease which is specific for a recognition sequence within the target nucleotide sequence followed by amplifying the digested DNA using PCR.
28. The method according to any one of the preceding paragraphs, wherein the method further comprises the step of size selection prior to or after the enrichment step f) (ii), preferably wherein the size selection step comprises using gel extraction chromatography, gel electrophoresis or density gradient centrifugation.
29. The method according to paragraph 28, wherein DNA is selected of a size between 20-20,0000 base pairs, preferably 50-10,0000 base pairs, most preferably between 100- 3,000 base pairs.
30. The method according to any one of the preceding paragraphs, wherein step g) is performed using whole genome sequencing.
31. The method according to any one of the preceding paragraphs, wherein step g) comprises determining at least part of the sequence of the undigested DNA comprising the target nucleotide sequence.
32. The method according to any one of the preceding paragraphs, wherein the method further comprises the step of building a contig of the genomic region of interest from the determined sequences generated in step g).
33. The method according to paragraph 32, wherein, when the cell ploidy of the genomic region of interest is greater than 1, a contig is built for each ploidy.
34. The method according to paragraph 32 or paragraph 33, wherein the step of building a contig comprises the steps of:
1) identifying the fragments of step b);
2) assigning the fragments to a genomic region; 3) building a contig for the genomic region.
35. The method according to paragraph 34, wherein the step 2) of assigning the fragments to a genomic region comprises identifying the different ligation products of step e) and coupling of the different ligation products to the identified fragments.
35. The method according to any one of the preceding paragraphs, wherein the first and second flanking sequences are from two separate DNA fragments which occur in close proximity to one another in the linear DNA sequence of the genomic region of interest.
36. The method according to any one of the preceding paragraphs, wherein the genomic region of interest comprises a transgene integration site.
37. The method according to any one of the preceding paragraphs, wherein the target nucleotide sequence comprises a transgene.
38. The method according to paragraph 36 or paragraph 37, wherein the first and second flanking sequences are from two separate DNA fragments from the transgene and/or from the vector used to deliver the transgene.
39. The method according to any one of paragraphs 36 to 38, wherein: i) the first and second flanking sequences are each from a separate DNA fragment from the vector; ii) the first flanking sequence is from a DNA fragment from the vector and the second flanking sequence is from a DNA sequence from the transgene; or iii) the first and second flanking sequences are each from a separate DNA fragment from the transgene.
40. The method according to any one of paragraphs 1 to 35, wherein the target nucleotide sequence comprises an allele of the genomic region of interest.
41. The method according to paragraph 40, wherein the first and second flanking sequences are from separate DNA fragments from the allele.
42. A method for determining the sequence of a genomic region of interest comprising a target nucleotide sequence, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a); c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) fragmenting the ligated DNA generated in step d) by non-random fragmentation of the DNA at a recognition sequence; f) circularising the fragmented DNA generated in step e); g) digesting the ligated DNA generated in step f) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step e) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); h) (i) degrading the digested DNA generated in step g) using an exonuclease; and/ or
(ii) enriching the mixture of digested and undigested DNA generated in step g) or the undigested DNA generated in step h)(i) for DNA comprising the target nucleotide sequence; and
(iii) determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.

Claims

1. A method for making a DNA sequencing library of a genomic region of interest comprising a target nucleotide sequence, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence; c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) digesting the ligated DNA generated in step d) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); f) (i) degrading the digested DNA generated in step e) using an exonuclease; and/ or
(ii) enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f)(i) for DNA comprising the target nucleotide sequence; and g) optionally, determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
2. A method for determining the sequence of a genomic region of interest comprising a target nucleotide sequence, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a) by non-random fragmentation of the DNA at a recognition sequence; c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) digesting the ligated DNA generated in step d) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step b) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); f) (i) degrading the digested DNA generated in step e) using an exonuclease; and/ or
(ii) enriching the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f)(i) for DNA comprising the target nucleotide sequence; and g) determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
3. The method according to claim 1 or claim 2, wherein the genomic region of interest comprises one or more further target nucleotide sequences.
4. The method according to claim 1 or claim 3, wherein a DNA sequencing library of a plurality of genomic regions of interest is made.
5. The method according to any of the preceding claims, wherein the sequences of a plurality of genomic regions of interest are determined.
6. The method according to any of the preceding claims, wherein:
(i) the fragmenting step b) comprises fragmenting with a restriction enzyme; or
(ii) the fragmenting step b) comprises fragmenting with a site-directed nuclease, preferably a CRISPR-Cas nuclease.
7. The method according to any of the preceding claims, wherein the fragmenting step b) comprises fragmenting a plurality of subsamples, each subsample having a different recognition sequence.
8. The method according to any of the preceding claims, wherein the method further comprises the step of: i) (a) further fragmenting the crosslinked DNA provided in step a) or the ligated crosslinked DNA generated in step c); and
(b) ligating the fragmented DNA generated in step i) (a), preferably wherein the ligation is performed in the presence of an adaptor, ligating adaptor sequences in between fragments; prior to step d); or ii) (a) further fragmenting the undigested DNA generated in step f); and
(b) optionally, circularising or ligating the fragmented DNA generated in step ii) (a), preferably ligating the fragmented DNA to at least one adaptor, prior to step g).
9. The method according to claim 8, wherein:
(i) the further steps i) (a) and i) (b) are repeated at least once; and/or
(ii) the recognition sequence of the fragmenting step b) is of greater length than the recognition sequence of the further fragmenting step i) (a) or step ii) (a).
10. The method according to any one of the preceding claims, wherein:
(i) the recognition sequence of fragmenting step b) is of 4 to 12 nucleotides in length, preferably of 6 to 8 nucleotides in length;
(ii) step e) is performed prior to step d);
(iii) the method further comprises the step of circularising the DNA of step d), prior to step e);
(iv) the fusion sequence of digesting step e) is of greater length than the recognition sequence of fragmenting step b); and/or
(v) the fusion sequence of digesting step e) is of 15 to 25 nucleotides in length, preferably of 20 nucleotides in length.
11. The method according to any one of the preceding claims, wherein:
(i) the nuclease of digesting step e) is a restriction enzyme; or
(ii) the nuclease of digesting step e) is a site-directed nuclease, preferably wherein the site-directed nuclease is a CRISPR-Cas nuclease.
12. The method according to any one of the preceding claims, wherein the enriching step f) (ii) comprises amplifying the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i).
13. The method according to claim 12, wherein amplifying the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i) comprises using at least one primer which hybridises to the DNA fragment comprising the target nucleotide sequence generated in step b) and optionally further using a plurality of primers which each hybridises to the DNA sequence of one of the DNA fragments comprising the one or more further target nucleotide sequences generated in step b), preferably wherein at least one primer directs amplification towards the recognition sequence of step b).
14. The method according to any one of claims 1 to 11, wherein the enriching step f) (ii) comprises capture-based enrichment of the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i), preferably wherein the capture-based enrichment is specific for a defined sequence at one end of the target nucleotide sequence.
15. The method according to any one of claims 1 to 11, wherein the enriching step f) (ii) comprises site-directed nuclease-based enrichment of the undigested DNA comprising the target nucleotide sequence generated in step e) or generated in step f) (i), preferably wherein the site-directed nuclease-based enrichment comprises digesting the mixture of digested and undigested DNA generated in step e) or the undigested DNA generated in step f) using a site-directed nuclease which is specific for a recognition sequence within the target nucleotide sequence followed by amplifying the digested DNA using PCR.
16. The method according to any one of the preceding claims, wherein:
(i) the method further comprises the step of size selection prior to or after the enrichment step f) (ii), preferably wherein the size selection step comprises using gel extraction chromatography, gel electrophoresis or density gradient centrifugation;
(ii) step g) is performed using whole genome sequencing; and/or
(iii) step g) comprises determining at least part of the sequence of the undigested DNA comprising the target nucleotide sequence.
17. The method according to any one of the preceding claims, wherein the method further comprises the step of building a contig of the genomic region of interest from the determined sequences generated in step g), preferably wherein, when the cell ploidy of the genomic region of interest is greater than 1, a contig is built for each ploidy.
18. The method according to any one of the preceding claims, wherein the first and second flanking sequences are from two separate DNA fragments which occur in close proximity to one another in the linear DNA sequence of the genomic region of interest.
19. The method according to any one of the preceding claims, wherein the genomic region of interest comprises a transgene integration site.
20. The method according to any one of the preceding claims, wherein the target nucleotide sequence comprises a transgene.
21. The method according to claim 19 or claim 20, wherein the first and second flanking sequences are from two separate DNA fragments from the transgene and/or from the vector used to deliver the transgene.
22. The method according to any one of claims 19 to 21, wherein: i) the first and second flanking sequences are each from a separate DNA fragment from the vector; ii) the first flanking sequence is from a DNA fragment from the vector and the second flanking sequence is from a DNA sequence from the transgene; or iii) the first and second flanking sequences are each from a separate DNA fragment from the transgene.
23. The method according to any one of claims 1 to 18, wherein the target nucleotide sequence comprises an allele of the genomic region of interest.
24. The method according to claim 23, wherein the first and second flanking sequences are from separate DNA fragments from the allele.
25. A method for determining the sequence of a genomic region of interest comprising a target nucleotide sequence, the method comprising the steps of: a) providing a sample of crosslinked DNA; b) fragmenting the crosslinked DNA of step a); c) ligating the fragmented crosslinked DNA generated in step b); d) reversing the crosslinking in the ligated crosslinked DNA generated in step c); e) fragmenting the ligated DNA generated in step d) by non-random fragmentation of the DNA at a recognition sequence; f) circularising the fragmented DNA generated in step e); g) digesting the ligated DNA generated in step f) to provide a mixture of digested and undigested DNA; wherein the digestion is performed using a nuclease which recognises a fusion sequence which comprises the recognition sequence of step e) flanked by a first and a second flanking sequence, and wherein the first and second flanking sequences are from two separate DNA fragments generated in step b); h) (i) degrading the digested DNA generated in step g) using an exonuclease; and/ or (ii) enriching the mixture of digested and undigested DNA generated in step g) or the undigested DNA generated in step h)(i) for DNA comprising the target nucleotide sequence; and i) determining at least part of the sequence of the undigested DNA, preferably using high throughput sequencing.
PCT/EP2022/071758 2021-08-03 2022-08-02 Method for targeted sequencing WO2023012193A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GBGB2111195.0A GB202111195D0 (en) 2021-08-03 2021-08-03 Method for targeted sequencing
GB2111195.0 2021-08-03

Publications (1)

Publication Number Publication Date
WO2023012193A1 true WO2023012193A1 (en) 2023-02-09

Family

ID=77651387

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2022/071758 WO2023012193A1 (en) 2021-08-03 2022-08-02 Method for targeted sequencing

Country Status (2)

Country Link
GB (1) GB202111195D0 (en)
WO (1) WO2023012193A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007004057A2 (en) 2005-07-04 2007-01-11 Erasmus University Medical Center Chromosome conformation capture-on-chip (4c) assay
WO2008008845A2 (en) 2006-07-11 2008-01-17 Microchips, Inc. Multi-reservoir pump device for dialysis, biosensing, or delivery of substances
WO2016210224A1 (en) * 2015-06-24 2016-12-29 Dana-Farber Cancer Institute, Inc. Selective degradation of wild-type dna and enrichment of mutant alleles using nuclease
WO2021003432A1 (en) * 2019-07-02 2021-01-07 Fred Hutchinson Cancer Research Center Recombinant ad35 vectors and related gene therapy improvements
US20210010065A1 (en) * 2018-03-15 2021-01-14 Twinstrand Biosciences, Inc. Methods and reagents for enrichment of nucleic acid material for sequencing applications and other nucleic acid material interrogations
US10934575B2 (en) * 2013-11-18 2021-03-02 Erasmus Universiteit Medisch Centrum Rotterdam Method for analysing the interaction of nucleotide sequences in a three-dimensional DNA structure

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007004057A2 (en) 2005-07-04 2007-01-11 Erasmus University Medical Center Chromosome conformation capture-on-chip (4c) assay
WO2008008845A2 (en) 2006-07-11 2008-01-17 Microchips, Inc. Multi-reservoir pump device for dialysis, biosensing, or delivery of substances
US10934575B2 (en) * 2013-11-18 2021-03-02 Erasmus Universiteit Medisch Centrum Rotterdam Method for analysing the interaction of nucleotide sequences in a three-dimensional DNA structure
WO2016210224A1 (en) * 2015-06-24 2016-12-29 Dana-Farber Cancer Institute, Inc. Selective degradation of wild-type dna and enrichment of mutant alleles using nuclease
US20210010065A1 (en) * 2018-03-15 2021-01-14 Twinstrand Biosciences, Inc. Methods and reagents for enrichment of nucleic acid material for sequencing applications and other nucleic acid material interrogations
WO2021003432A1 (en) * 2019-07-02 2021-01-07 Fred Hutchinson Cancer Research Center Recombinant ad35 vectors and related gene therapy improvements

Non-Patent Citations (13)

* Cited by examiner, † Cited by third party
Title
ALBERT L. LEHNINGER: "Principles of Biochemistry", 1982, ACADEMIC PRESS, pages: 793 - 800
ANNETTE DENKER ET AL: "The second decade of 3C technologies: detailed insights into nuclear organization", 1 January 2016 (2016-01-01), pages 1357 - 1382, XP055340371, Retrieved from the Internet <URL:http://genesdev.cshlp.org/content/30/12/1357.full.pdf> [retrieved on 20170130], DOI: 10.1101/gad.281964.116 *
AUSUBEL ET AL.: "Current Protocols in Molecular Biology", 1987, JOHN WILEY & SONS
HARD ET AL., BIORXIV 439527, 2021
JINEK M, SCIENCE, vol. 337, 2012, pages 816 - 821
KIM H, NATURE COMMUNICATIONS, vol. 8, 2017, pages 14406
KITTLER, ANALYTICAL BIOCHEMISTRY, vol. 300, 2002, pages 237 - 244
LANNONE ET AL., CYTOMETRY, vol. 39, 2000, pages 131 - 140
MAZOYER, HUMAN MUTATION, vol. 25, 2005, pages 415 - 422
NICOLA J PHILPOTT ET AL., J VIROL, vol. 76, 2002, pages 5411 - 21
SAMBROOK ET AL.: "Molecular Cloning. A Laboratory Manual", 1989, COLD SPRING HARBOR LABORATORY PRESS
SONG, NUCLEIC ACIDS RESEARCH, vol. 48, no. 4, 2020, pages e19
SUNGALEE ET AL., NATURE GENETICS, vol. 53, 2021, pages 650 - 662

Also Published As

Publication number Publication date
GB202111195D0 (en) 2021-09-15

Similar Documents

Publication Publication Date Title
US20230272373A1 (en) Methods and Compositions for the Single Tube Preparation of Sequencing Libraries Using Cas9
US20200224222A1 (en) Using RNA-guided FokI Nucleases (RFNs) to Increase Specificity for RNA-Guided Genome Editing
JP7008407B2 (en) Methods for Identifying and Counting Methylation Changes in Nucleic Acid Sequences, Expressions, Copies, or DNA Using Combinations of nucleases, Ligses, Polymerases, and Sequencing Reactions
US10011850B2 (en) Using RNA-guided FokI Nucleases (RFNs) to increase specificity for RNA-Guided Genome Editing
KR101862756B1 (en) 3-D genomic region of interest sequencing strategies
US20240117330A1 (en) Enzymes with ruvc domains
US10913941B2 (en) Enzymes with RuvC domains
JP2022190130A (en) Methods of assessing nuclease cleavage
KR20160111403A (en) Methods for generating double stranded dna libraries and sequencing methods for the identification of methylated cytosines
US8795968B2 (en) Method to produce DNA of defined length and sequence and DNA probes produced thereby
CN113330122A (en) In vitro isolation of optimized nucleic acids using site-specific nucleases
CN112501252B (en) Accurate method capable of carrying out in-vivo target activity evaluation and off-target detection in batches
US10385334B2 (en) Molecular identity tags and uses thereof in identifying intermolecular ligation products
WO2023012193A1 (en) Method for targeted sequencing
US20220220460A1 (en) Enzymes with ruvc domains
CN105255858B (en) Method for transforming nucleic acid genotype
WO2010113031A2 (en) Method of altering nucleic acids
CN111690724B (en) Method for detecting activity of reagent generated by double-strand break
WO2023012195A1 (en) Method
US20180282799A1 (en) Targeted locus amplification using cloning strategies
WO2020111983A2 (en) Dna-cutting agent based on cas9 protein from the biotechnologically relevant bacterium clostridium cellulolyticum
Berta Analysis of specific cis-acting DNA sequences of the Himar1 mariner transposon

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22761466

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE