GB2621392A

GB2621392A - Methods and uses

Info

Publication number: GB2621392A
Application number: GB2211784.0A
Authority: GB
Inventors: Umay Demirci Ilke; Anton Magnus Larsson John; Kristoffer Frisén Jonas; Rickard Håkan Sandberg Thore; Henrik Hagemann-Jensen Michael
Original assignee: Thore Rickard Haakan Sandberg
Current assignee: Thore Rickard Haakan Sandberg
Priority date: 2022-08-12
Filing date: 2022-08-12
Publication date: 2024-02-14
Also published as: WO2024033411A1; GB202211784D0

Abstract

A method for determining the location of a target sequence, such as an integrated and/or heterologous sequence, in the genomic DNA of a cell. The methods of the invention generally comprise the steps of performing linear amplification on the genomic DNA to generate a single-stranded DNA molecule comprising nucleotide sequence from the target sequence and flanking genomic DNA, followed by tagmentation and sequencing. Independent claims are included for using the technique on an individual cell, a single cell from a population of different cells, identifying the presence or location from an integrated target, a kit, and a library of sequences. The tagmentation may utilise a Tn5 transposase.

Description

METHODS AND USES

The present application relates to methods for determining the location of a target sequence, such as an integrated and/or heterologous sequence, in the genomic DNA of a cell. The methods of the invention generally comprise the steps of performing linear amplification on the genomic DNA to generate a single-stranded DNA molecule comprising nucleotide sequence from the target sequence and flanking genomic DNA, followed by tagmentation and sequencing.

The present invention finds utility, for example, in lineage tracing methods and/or in the mapping of integration sites and events, such as in gene therapy approaches and gene-editing approaches.

All prospective lineage tracing methods generally require the introduction of exogenous material into cells or tissues to some degree, for example for marking individual cells and their descendants, which limits the use of such strategies in human subjects. Lineage-tracing studies in humans rely on the detection of naturally occurring somatic mutations, including single-nucleotide variants (SNVs), copy number variants (CNVs), and variation in short tandem repeat sequences (such as microsatellites), which are stably propagated to daughter cells but are absent in distantly related cells. Sequence variation in mitochondrial DNA (mtDNA) can be a promising alternative for lineage tracing, due to mitochondrial genome size, high mutation rates and presence in multiple copies (Ludwig et al, 2019. Cell, 179:1325-1339). Not restricted by the above limitation is introduction of exogenous material into human cells in gene therapy recipient patients, where each vector-marked cell gets subsequently barcoded by a vector integration site (Schmidt et al, 2007. Nat. Methods, 4:1051-1057; Biasco et al, 2016. Cell Stem Cell, 19:107-119).

Integration site sequencing methods have limitations regarding their efficiency and they require a high number of cells as input, rendering them unsuitable for inferring lineages of rare cell types, where fewer cells are available for analysis.

The type of tissue where lineage tracing is aimed to be carried out plays an important role in the number of cells required to be processed in order to draw biologically-sound conclusions. Highly diverse tissues such as blood are likely to have a more complex lineage structure where processing of many cells may be required. This is challenging to fully investigate with existing technologies.

Advances in genome sequencing have improved the discovery and characterization of integrated sequences, for example in lineage tracing methods or mapping of integration sites. For example, advances in genome sequencing have improved the discovery and characterisation of long interspersed nuclear elements 1 ("LINE-1") for harnessing mutations to infer cell lineages. The identification of somatic mosaicism causing genomic diversity in different cell types has paved the way for more elaborate studies where the frequency and degree of shared somatic mutations was evaluated for them to be subsequently used for tracking lineages.

Previous methods for lineage tracing, for example detecting LINE-1 insertion sites, include LAM-PCR, Li-IP, RC-seq and whole-genome sequencing based approaches.

The studies which rely on LINE-1 to retrospectively reconstruct cell lineages are limited, largely due to the fact that existing methods are not scalable for simultaneous analyses of high numbers of cells, and low sensitivity.

Li-IP (Evrony et at 2012. Cell, 151:483-496) and RC-seq (Upton et al, 2015, Ce//161:228-239) both use whole genome amplification (WGA) techniques to amplify DNA before enrichment (specifically Multiple Displacement Amplification ("MDA") and Multiple Annealing and Looping-Based Amplification ("MALBAC") respectively). Li-IP uses random seed primers for reverse priming at LINE-1 3' flanks and RC-seq uses DNA shearing and ligation. Whole genome amplification can be achieved through different methods such as multiple annealing and looping based amplification cycles (Zong et al, 2012. Science, 338(6114):1622-6) and LIANTI (Chen et al, 2017. Science, 356(6334):189-194). Methods involving WGA techniques suffer from errors introduced into the amplified sequence or drop-out of genomic regions during random priming and amplification.

Linear amplification-mediated PCR ("LAM-PCR"; Schmidt et at 2007. Nat. Methods, 4: 1051-1057) combines repeated linear primer extension, primer tag selection, and asymmetric oligonucleotide cassette ligation with nested exponential PCR amplification. LAM-PCR is technically complex and time-consuming to perform, as it includes a magnetic capture step of a biotinylated linear PCR product, and subsequent reactions then take place in solid phase. Generation of an oligonucleotide cassette for the restriction digestion and restriction digestion steps are also time consuming.

Accordingly, further methods for lineage tracing, mapping of integration sites, and identifying events in gene therapy approaches and gene-editing approaches are needed, especially with single-cell sensitivity.

Against this background, the inventors have developed a new method -termed "LUSTRE" (Lineages Using Somatic TRansposable Elements) -which is a highly scalable, high-throughput method which enables the direct determination of the location of target DNA sequences, in genomic DNA of single cells. The present invention avoids amplification errors that arise from whole-genome amplification by using single DNA molecules as input, and directly enriching for the target(s) being determined, and the method is amenable to single cells. The inventors' method therefore provides an improved approach for lineage tracing and/or mapping of integration sites, and also has utility for identifying sites of gene therapy vector integration and/or CRISPR/CAS-mediated integration events.

In a first aspect, the invention provides a method for determining the location of a target sequence in the genomic DNA of a cell, the method comprising the steps of: providing genomic DNA from the cell to be analysed; (ii) performing linear amplification on the genomic DNA using one or more oligonucleotide primer specific for the target sequence, to generate one or more single-stranded DNA molecule comprising nucleotide sequence from the target sequence and the flanking genomic DNA; (Hi) generating double-stranded DNA from the one or more single-stranded DNA molecule of step (ii); (iv) tagmenting and sequencing the double-stranded DNA of step (iii), to determine the nucleotide sequence of the target sequence and the flanking genomic DNA; and (v) based on the nucleotide sequence obtained in step (iv), determining the location of the target sequence in the genomic DNA of the cell.

Thus, the invention provides a method in which genomic DNA is obtained from single cells and subjected to linear amplification. Linear amplification results in single stranded DNA molecules which comprise target sequence and genomic flank sequence.

Double-stranded DNA is then synthesized and subjected to tagmentation. Tagmented DNA is in turn subjected to DNA sequencing (which steps can involve the steps of amplification, indexing and pooling of the amplified DNA) and then the position of the target sequence determined by comparison with a known reference sequence (for example, the known genomic DNA sequence from the relevant cell, as held in a sequence database; or the genomic DNA of a comparable control cell which lacks the target sequence).

As presented in the Examples, the inventors show that the method of the invention can detect target sequence (such as integrated target sequence) with high accuracy and specificity. In particular, somatic transposable elements, such as retro-transposition events at unknown locations in the genomic DNA, were detected by the method. The inventors observed very high sensitivity of the method by correlating the results obtained from single cells with bulk data. In particular, the inventors found that the method of the invention performs with very high sensitivity in single cells at a sequencing read depth of around one million reads per cell, which is more sensitive than previously described methods (for example, 6.7 million sequencing reads per cell was reported in Evrony eta!, 2012. Cell, 151:483-496; 38 million sequencing reads per cell was reported in Upton et al, 2015. Cell, 161:228-239).

As discussed in more detail herein, the present invention utilises one or more oligonucleotide primer specific for the target sequence, and which ensure robust priming from the target sequence. Moreover, it uses transposase (such as the Tn5 transposase) to cut the resulting double-stranded DNA and insert a known oligonucleotide sequence in order to capture sequences. This approach allows for the amplification of unknown regions in the genome and circumvents the need for wholegenome sequencing-based approaches. The inventors have shown that the present invention can successfully amplify the target sequence for thousands of simultaneous loci in single cells, in a very effective manner, which is of particular utility in lineage-tracing applications.

The present invention presents number of advantages compared to the previous methods. The present invention is considerably more sensitive for detecting target sequences (including insertions, such as LINE-1 insertions) than the previous methods, and it exhibits that sensitivity on single cells. As shown in the accompanying Examples, the single cell sensitivity and high data quality is, surprisingly, comparable to previous methods performed on a population of cells. Furthermore, this sensitivity is achieved at lower sequencing reads per cell compared to previous methods, which reduces the time and resource required to perform analysis.

The present invention is also more streamlined compared to previous methods. For example, many previous prior art methods involved WGA and gel electrophoresis-based selection steps of sequences of a particular size (which can introduce sequence errors, and can be cumbersome and time-consuming to perform), which the present invention does not require.

The present invention is also more sensitive than many previous prior art methods. In particular, the inventors surprisingly discovered that the present invention is three times more sensitive than, for example the L1-IP assay, as demonstrated in the Examples. Increased sensitivity of the present invention is obtained, in part, due to lack of steps conducted on the solid phase, and due to the use of tagmentation.

Specifically: -As mentioned earlier, in LAM-PCR, biotinylated linear PCR products are captured on magnetic beads, where subsequent reactions take place. The inventors have found this step to impact the sensitivity of the assay due to loss of DNA. In the present invention, no reaction occurs in solid phase while almost every reaction of LAM-PCR occurs in solid phase (in particular, "on the beads"). This results in increased sensitivity and dramatically less time spent, from 8-48 hours (for LAM-PCR) to 15 minutes (for the method of the invention).

-Unlike other methods, the present invention uses tagmentation, which the inventors have found to have significant advantages over, for example, restriction enzyme digestions and/or linker ligation, as explained in Example 2. Briefly, tagmentation effectively allows 100% of the genome (in particular accessible genome) to be fragmented (i.e. there is no limitation or bias in terms of genome coverage), amplicon size can be readily modulated, and the use of transposase (such as Tn5 transposase) is a much more efficient reaction which makes the present invention more suitable for low DNA level applications, like single cell analyses.

The present invention therefore provides a versatile method which enables the detection of any target sequence in genomic DNA, provided that sufficient sequence information of the target is known to permit design of an oligonucleotide primer capable of specifically recognising it. In practice, only a short sequence of the target is actually needed, for example: a sequence of six or more nucleotide bases.

The present method can therefore be used, among other things: to detect sites of gene therapy vector and/or CRISPR/CAS-like mediated integration; to detect the loci of vector integration; to quantify integration events in gene therapy recipients; to discover transposable elements in individual cells and/or donors; to identify insertions and/or sequences that are polymorphic in a population of cells and/or organisms (for example, polymorphisms in the human population); to detect somatic variation within an individual.

The present invention also finds particular utility in lineage tracing, which is a particularly preferred aspect of the invention. Progeny of a single cell can be identified by lineage tracing, which becomes an essential tool for studying stem cell properties in adult mammalian tissues. Lineage tracing provides a powerful means of understanding tissue development, homeostasis, and disease, especially when it is combined with experimental manipulation of signals regulating cell-fate decisions (Kretzschmar, 2012. Cell, 148(1-2):33-45).

Genomic DNA (also referred to as genomic deoxyribonucleic acid, or gDNA) comprises chromosomal DNA, and is genetically transmitted and/or transmittable from parent to offspring. As will be appreciated, genomic DNA is complex and varies in complexity from one organism to another. In particular, genomes of most eukaryotes are larger and more complex than those of prokaryotes.

Although one may expect to find more genes in more complex organisms, genome size of many eukaryotes does not appear to be related to genetic complexity. A substantial portion of some genomes consists of highly repeated noncoding DNA sequences, for example, SINEs (short interspersed elements) or LINEs (long interspersed elements); while about a million such sequences are dispersed throughout the mammalian genome, their function is unknown. The genome of Escherichia coli is approximately 4.6x106 base pairs long and contains 4288 genes, with nearly 90% of the DNA used as protein-coding sequence; the Saccharomyces cerevisiae genome consists of 1.2x107 base pairs, contains approximately 6000 protein-coding genes, with 70% of the DNA used as protein-coding sequence; the Caenorhabditis elegans genome is 9.7x107 base pairs, contains approximately 19,000 protein-coding genes, and only about 25% of the C. elegans sequence corresponds to exons. In comparison, in the human genome, only a small fraction of the 3x109 base pairs is expected to correspond to protein-coding sequence and it is estimated that humans have between 20,000 and 25,000 genes.

It will be appreciated that the method of the present invention will perform well on any genomic DNA, regardless of the organism of origin. In some embodiments of the methods disclosed herein, the size range of the genomic DNA is 1x104 -lx1012 base pairs, such as 1x104 -1x10" base pairs, such as 1x104 -1x10'° base pairs, such as 1x104 -1x109 base pairs, such as 1x104 -1x108 base pairs, such as 1x104 -1x107 base pairs, such as 1x104 -1x106, or 1 x 104 -1x105 base pairs.

In some embodiments of the methods disclosed herein, the size the genomic DNA is at least 1x103 base pairs in length, at least lx 104 base pairs in length, at least lx 105 base pairs in length, at least 1x106 base pairs in length, at least 1 x 107 base pairs in length, at least 1x108 base pairs in length, at least 1x109 base pairs in length, at least lx101-° base pairs in length, at least 1x10" base pairs in length, or at least 1x10" base pairs in length.

It will be appreciated that approaches for isolating, purifying and storing genomic DNA are routine in the art.

The term "isolated" as used herein includes nucleic acid (such as genomic DNA) separated from at least one other component (e.g., polypeptide), such as those present with the nucleic acid in its natural source.

In a typical procedure, cells which are to be studied need to be collected or sorted.

Exposing the cells to a lysis buffer breaks the cell membranes and exposes the genomic DNA. Detergents and surfactants dissolve cellular and nuclear lipids while proteases break down proteins which further exposes genomic DNA. The solution is then treated with a saline and cellular debris is separated. Finally, DNA is purified from the lysis solution by, for example, alcohol precipitation, phenol-chloroform extraction or column purification.

In some embodiments of the methods disclosed herein, genomic DNA is released as a result of incubation in lysis buffer followed by a thermolabile proteinase K treatment.

For such embodiments, DNA purification from a lysis solution may not be necessary.

It will be appreciated that for different cells to be analysed, such as from different tissues or origin or organisms, different DNA isolation protocols may be followed which are optimised for these particular cell types. It will also be appreciated that the method of the invention can be performed on any genome.

The term "purified" as used herein includes the meaning that the indicated nucleic acid (such as genomic DNA) is present in the substantial absence of other biological macromolecules, e.g., proteins, and the like. In one embodiment, the polynucleotide is purified such that it constitutes at least 70% or 80% or 85% or 90% or 95% by weight, more preferably at least 99% by weight, of the indicated biological macromolecules present (but water, buffers, and other small molecules, especially molecules having a molecular weight of less than 1000 Da!tons, can be present).

The typical amount of genomic DNA derived from a single cell, such as a mammalian cell (for example, a human cell) is approximately 0.006 nanograms. The results disclosed in the Examples are obtained from single cells. The skilled person would appreciate that present invention is applicable, and performs very well, with low amounts of starting material, for example degraded DNA samples, or DNA from single cells or DNA from different organisms. Importantly, methods known in the prior art are not optimised to be performed on single cells. For example, LAM-PCR is reported to be applicable on 0.010-10,000 nanograms of genomic DNA (Schmidt et al, 2007.

Nat. Methods, 4: 1051-1057).

In some embodiments of the methods disclosed herein, the amount of genomic DNA to be used in the methods of the invention is at least 0.001 ng, at least 0.006 ng, at least 0.01 ng, at least 0.1 ng, at least 1 ng, at least 10 ng, at least 100 ng, or at least 1 pg.

Preferably, the amount of genomic DNA to be used in the methods of the invention is between 0.001 ng -100 ng, more preferably between 0.006 ng -1 ng.

As will be appreciated from the description of the invention herein, in the genomic DNA, a target sequence of interest (such as an integrated target sequence) is surrounded by flanking sequences of genomic DNA. Sequences integrated into genomic DNA using natural mechanisms, for example, transposable elements, can change positions within the genome or copy themselves over time, altering the cell's genetic identity and genome size. In gene therapy, exogenous DNA is artificially integrated into genomic DNA of cells to achieve a therapeutic effect. Similar to transposable elements, an artificial vector may display random patterns of integration making accurate identification of target sequence integration site challenging.

It will be appreciated that, if the sequence integration site is not known, it cannot be enriched or identified using conventional, targeted approaches, e.g. polymerase chain reaction (PCR), as primers used in PCR need to be designed to both ends of a known DNA sequence (and that is not possible when the target is integrated at an unknown location). In such cases, using primers complementary to the integrated target sequence alone would amplify said sequences from all genomic locations, but without their integration context. Methods established to detect integrations using next-generation whole genome sequencing (NGS) are limited by their inability to work on single cells, high cost and lengthy protocols.

By "target sequence" we include the meaning of any DNA sequence within a genomic DNA of a cell whose sequence and/or location and/or quantity it is desirable to identify. In the context of the present invention, it should be appreciated that a target sequence could originate from one or more of: a repeated sequence, including tandem repeats and interspersed repeats, homopolymers, a mobile element like transposable element, including retrotransposons (for example long terminal repeats, LTRs; long interspersed nuclear elements, LINEs, LINE-1.s, or Lis; short interspersed nuclear elements, SINEs) and DNA transposons.

The target sequence may also be a DNA sequence which was artificially integrated into genomic DNA, for example, via gene therapy.

In one embodiment, a target sequence can comprise or consist of a specific polymorphism relative to a consensus target sequence. A disease or disorder may be associated with a polymorphism in a nucleotide sequence encoded on a genomic DNA or in an integrated or heterologous target sequence. A polymorphism associated with a disease or disorder may alter the function of the nucleotide sequence or nucleic acid encoding said polymorphism, compared to a wild-type nucleotide sequence or nucleic acid that does not encode said polymorphism. Accordingly, a polymorphism associated with a disease may give rise to a loss-of-function, gain-of-function, antimorphic, dominant negative, hypomorphic, neomorphic, or lethal mutation in the nucleotide sequence associated with a disease.

The method of the invention may be used to detect such polymorphisms/variants. The polymorphism may be one or more polymorphism selected from the group comprising or consisting of: single nucleotide polymorphism (SNP), point mutation, insertion, deletion, substitution, chromosomal amplification, chromosomal deletion, chromosomal duplication, chromosomal inversion, chromosomal insertion, chromosomal translocation, loss of heterozygosity, frameshift, synonymous mutation, non-synonymous mutation, missense mutation, nonsense mutation, or any combination thereof.

In some embodiments, the method of the invention may be used to identify any single base pair polymorphisms/variants within a target sequence that exists in the genomic DNA. As shown in the accompanying Example, the inventors achieved very good target specificity with at least three base pair polymorphisms, based on the human-specific LINE-1 mutations (as shown in the accompanying Examples, such as Figure 12).

By "flanking genomic DNA", we include the genomic DNA sequence adjacent to (for example, immediately adjacent to; for example, directly attached to) one or both sides of the target sequence, for example the genomic sequence into which the target sequence has been integrated. Accordingly, it will be appreciated that an integrated target sequence will have one flanking sequence "downstream" (relative from its 3' end), and one flanking sequence "upstream" (relative from its 5' end).

As is clear from the description herein, the method of the invention does not involve whole genome amplification or whole genome sequencing. It also does not require prior knowledge of the flanking region sequence to identify the location of the target sequence.

The method of the invention uses known sequence within the target sequence to design oligonucleotides to linearly amplify the target sequence and the adjacent flanking region. It should be appreciated that although the accompanying Examples show identification of a flanking region relative to the 3' end of the target sequence, flanking regions at either or both ends of the target sequence can be determined, for example by using an oligonucleotide primer that is complementary to the 5' end of the target sequence. In this case, 5' region of a target sequence is linearly amplified together with a flanking region relative to the 5' end of the target sequence. Both flanking regions may be analysed sequentially or simultaneously, and results combined to provide further evidence of the presence of the target sequence. The skilled person would appreciate that determination of either flanking sequence may be sufficient to determine the location of an integration site of a target sequence.

The terms "nucleotide sequence" or "nucleic acid" or "polynucleotide" or "oligonucleotide" refer to a heteropolymer of nucleotides or the sequence of these nucleotides. These phrases also refer to DNA or RNA of genomic or synthetic origin which may be single-stranded or double-stranded and may represent the sense or the antisense strand, to peptide nucleic acid (PNA) or to any DNA-like or RNA-like material. In the sequences herein, A is adenine, C is cytosine, T is thymine, G is guanine and N is A, C, G or T (U). It is contemplated that where the polynucleotide is RNA, the T (thymine) in the sequences provided herein is substituted with U (uracil).

In the method disclosed herein, step (ii) comprises linear amplification of the target. By "linear amplification", we include the meaning of increasing target complementary sequence copies in a linear fashion. Linear amplification reactions are performed using a single oligonucleotide primer and produce a single complementary DNA copy of the target sequence with each cycle, rather than the exponential amplification that occurs with traditional polymerase chain reaction (PCR).

Linear amplification reaction reagents, ingredients and concentrations are similar to PCR and well known in the art. In the Examples, linear amplification was carried out for 40 amplification cycles; however, the number of amplification cycles can be adjusted to achieve the desired level of target sequence amplification -for example, 20, 30, 40, 50 or 60 or more amplification cycles may be used.

The terms "oligonucleotide" or "oligonucleotide primer" or "primer" refer to a nucleic acid molecule having a sequence of nucleotide residues of at least about 5 nucleotides, more preferably at least about 7 nucleotides, more preferably at least about 9 nucleotides, more preferably at least about 11 nucleotides, and even more preferably at least about 17 nucleotides. In a preferred embodiment, the oligonucleotide primer is preferably between 5-50 nucleotides in length, more preferably between 10-40 nucleotides in length, and even more preferably between 18-30 nucleotides in length. In the context of the present invention, oligonucleotide primers hybridize with the target DNA sequence and define the region of the target DNA that will be amplified.

A skilled person would understand that the method of the invention is not restricted to any particular oligonucleotide primer sequences. The oligonucleotide primers may be designed to identify the desired target sequence -for example, particular transposable elements either present in the human genome, integrated into the genome or present in the genome of other organisms. In some embodiments, the method of the invention may be applied to identify transposable elements in human genomic DNA (e.g. human-specific AluY subfamilies), mouse genomic DNA (e.g. L1MdTf) or transposable elements present in any other organism with a specific known sequence. It should therefore be 11.

appreciated that the method of the invention can be used to detect any insertions, by using the consensus sequence of particular type of a target sequence (such as transposable element).

The oligonucleotide primer used in step (ii) must be "specific" for the target sequence, in that it must specifically bind to the target sequence. By "specifically bind" we include the meaning that the oligonucleotide primer binds preferentially to the target sequence, and does not bind to sequences that are different to the target sequence (for example, sequences that have less than 50% or 400Io or 30% or 20°k or 10% sequence identity).

After a specific hybridisation to a complementary region of the target DNA, the oligonucleotide primer will provide the 3' hydroxyl end by which DNA polymerase mediated synthesis proceeds.

In a preferred embodiment, the one or more oligonucleotide primer may be modified, for example to improve recognition of the target sequence and/or performance in the amplification reaction. In a preferred embodiment, the oligonucleotide is 5' biotinylated, and/or used together with LNA (locked nucleic acid) modification. This combination increases the melting temperature of the oligonucleotide primer which creates higher specificity and protects against exonuclease activity of polymerases.

A skilled person would appreciate that oligonucleotide primers used in the method of the invention should have appropriate GC content, for example between 40% and 60%, and that it is more advantageous to use a primer with G or C base on its 3'-OH extremity, which creates a GC clamp and facilitates efficient binding at the target sequence. Preferably, primers should not contain repeats of a given base (PCP. Primer: A Laboratory Manual. New York: Cold Spring Harbor Press, 1995).

The skilled person would also appreciate that alternative modifications to those presented in the Examples can be applied to oligonucleotide primers to increase efficiency, hybridisation stringency, robustness, and/or resistance to degradation. Such modifications could be one or more from the list comprising: Iso-C and Iso-Gs (which can be used as a 5' modification); and 5-Methyl deoxyCytidine modification (which can be used in the 3' as an alternative to LNA since it also increases the melting temperature of oligonucleotides).

As will be appreciated, a single oligonucleotide primer is used for linear amplification 3' of the target sequence.

The skilled person would appreciate that the oligonucleotide primer should specifically bind relatively close to one end of the target sequence in order for tagmented DNA to contain sequence from the flanking region.

In some embodiments of the methods disclosed herein, the oligonucleotide primer should bind to the target sequence at a location that is immediately adjacent to the flanking sequence, for example, one nucleotide base away from the end of the target sequence, or about 10 nucleotide bases away from the end of the target sequence, or within about 100 nucleotide bases away from the end of the target sequence, or within about 200 nucleotide bases away from the end of the target sequence, or within about 300 nucleotide bases away from the end of the target sequence, or within about 400 nucleotide bases away from the end of the target sequence, or within about 500 nucleotide bases away from the end of the target sequence.

As will be appreciated, step (ii) of the method generates one or more single-stranded DNA molecule comprising sequence from the target sequence and the flanking genomic DNA. It will be appreciated that each cycle of linear amplification generates a copy of such single-stranded DNA molecules, each comprising a region complementary to the target sequence and a region complementary to the 3' and/or 5' flanking region.

In some embodiments of the methods disclosed herein, the size of the linearly amplified target sequence is at least 10 bases, at least 100 bases, at least 200 bases, at least 300 bases, at least 400 bases, at least 500 bases, at least 600 bases, at least 700 bases, at least 800 bases, at least 900 bases, at least 1,000 bases, at least 1,100 bases, at least 1,200 bases, at least 1,300 bases, at least 1,400 bases, at least 1,500 bases, at least 1,600 bases, at least 1,700 bases, at least 1,800 bases, at least 1,900 bases, at least 2,000 bases, at least 2,100 bases, at least 2,200 bases, at least 2,300 bases, at least 2,400 bases, at least 2,500 bases, at least 2,600 bases, at least 2,700 bases, at least 2,800 bases, at least 2,900 bases, at least 3,000 bases, at least 3,100 bases, at least 3,200 bases, at least 3,300 bases, at least 3,400 bases, at least 3,500 bases, at least 3,600 bases, at least 3,700 bases, at least 3,800 bases, at least 3,900 bases, or at least 4,000 bases.

In some embodiments of the methods disclosed herein, the size range of the linearly amplified target sequence is 10-100 bases in length, 10-200 bases in length, 10-300 bases in length, 10-400 bases in length, 10-500 bases in length, 10-600 bases in length, 10-700 bases in length, 10-800 bases in length, 10 bases 900 bases in length, 10-1,000 bases in length, 10-1,100 in length, 10-1,200 in length, 10-1,300 base in length, 10-1,400 bases in length, 10-1,500 bases in length, 10-1,600 bases in length, 10-1,700 bases in length, 10-1,800 bases in length, 10 bases 1,900 bases in length, 10-2,000 bases in length, 10-2,100 in length, 10-2,200 in length, 10-2,300 base in length, 10-2,400 bases in length, 10-2,500 bases in length, 10-2,600 bases in length, 10-2,700 bases in length, 10-2,800 bases in length, 10 bases 2,900 bases in length, 10-3,000 bases in length, 10-3,100 in length, 10-3,200 in length, 10-3,300 bases in length, 10-3,400 bases in length, 10-3,500 bases in length, 10-3,600 bases in length, 10-3,700 bases in length, 10-3,800 bases in length, 10-3,900 bases in length, or 104,000 bases in length.

In the methods disclosed herein, step (iii) comprises generating double-stranded DNA from the one or more single-stranded DNA molecule of step (ii).

Reaction reagents to generate double-stranded DNA from single stranded cDNA template are well known in the art. Preferably, such reagents comprise Klenow large fragment polymerase, or any DNA polymerase with strand displacement activity, for example, Bst polymerases.

Accordingly, the double-stranded DNA generated in step (iii) comprises a region complementary to the target sequence and a region complementary to the 3' and/or 5' flanking region.

In some embodiments of the methods disclosed herein, the size of the double-stranded DNA sequence is at least 10 base pairs in length, at least 100 base pairs in length, at least 200 base pairs in length, at least 300 base pairs in length, at least 400 base pairs in length, at least 500 base pairs in length, at least 600 base pairs in length, at least 700 base pairs in length, at least 800 base pairs in length, at least 900 base pairs in length, at least 1,000 base pairs in length, at least 1,100 base pairs in length, at least 1,200 base pairs in length, at least 1,300 base pairs in length, at least 1,400 base pairs in length, at least 1,500 base pairs in length, at least 1,600 base pairs in length, at least 1,700 base pairs in length, at least 1,800 base pairs in length, at least 1,900 base pairs in length, at least 2,000 base pairs in length, at least 2,100 base pairs in length, at least 2,200 base pairs in length, at least 2,300 base pairs in length, at least 2,400 base pairs in length, at least 2,500 base pairs in length, at least 2,600 base pairs in length, at least 2,700 base pairs in length, at least 2,800 base pairs in length, at least 2,900 base pairs in length, at least 3,000 base pairs in length, at least 3,100 base pairs in length, at least 3,200 base pairs in length, at least 3,300 base pairs in length, at least 3,400 base pairs in length, at least 3,500 base pairs in length, at least 3,600 base pairs in length, at least 3,700 base pairs in length, at least 3,800 base pairs in length, at least 3,900 base pairs in length, or at least 4,000 base pairs in length.

In some embodiments of the methods disclosed herein, the size of the double-stranded DNA sequence is 10-100 base pairs in length, 10-200 base pairs in length, 10-300 base pairs in length, 10-400 base pairs in length, 10-500 base pairs in length, 10-600 base pairs in length, 10-700 base pairs in length, 10-800 base pairs in length, 10-900 base pairs in length, 10-1,000 base pairs in length, 10-1,100 in length, 10-1,200 in length, 10-1,300 base in length, 10-1,400 base pairs in length, 10-1,500 base pairs in length, 10-1,600 base pairs in length, 10-1,700 base pairs in length, 10-1,800 base pairs in length, 10-1,900 base pairs in length, 10-2,000 base pairs in length, 10-2,100 base pairs in length, 10-2,200 base pairs in length, 10-2,300 base pairs in length, 10-2,400 base pairs in length, 10-2,500 base pairs in length, 10-2,600 base pairs in length, 102,700 base pairs in length, 10-2,800 base pairs in length, 10-2,900 base pairs in length, 10-3,000 base pairs in length, 10-3,100 base pairs in length, 10-3,200 base pairs in length, 10-3,300 base pairs in length, 10-3,400 base pairs in length, 10-3,500 base pairs in length, 10-3,600 base pairs in length, 10-3,700 base pairs in length, 10- 3,800 base pairs in length, 10-3,900 base pairs in length, or 10-4,000 base pairs in length.

In the methods disclosed herein, step (iv) comprises tagmenting and sequencing the double-stranded DNA of step (iii), to determine the nucleotide sequence of the target sequence and the flanking genomic DNA.

By "tagmentation" we include the meaning of a process for fragmenting of double stranded DNA and integrating of a known polynucleotide sequence (sometimes called a "sequencing adapter") into DNA using a transposase. Those skilled in the art will understand that when performing tagmentation, transposases randomly fragment the double-stranded DNA into smaller fragments and add sequencing adapters simultaneously, thereby generating double-stranded DNA having sequencing adapters at each end.

As shown in the accompanying Examples, introducing sequencing adapters by way of tagmenting provides a number of advantages over, for example, restriction digestion and ligation. In particular, restriction digestion cleaves the DNA at the enzyme's restriction site, posing limitations to amplicon size and/or diversity. Tagmentation exhibits a large diversity in cleavage site, allowing complete access to the genomic DNA. Moreover, as shown in the accompanying Examples, tagmentation makes the method of the invention more sensitive compared to previous methods.

Any transposase may be used in the methods of the invention. In some embodiments, MuA or TnY transposase could be used for tag mentation (Liscovitch-Brauer et al, 2021. Nat. Biotechnol., 39: 1270-1277). In a preferred embodiment, a hyperactive mutant transposase could be used for tagmentation. In a preferred embodiment, the transposase is the 1n5 transposase. It will be appreciated that transposase, such as Tn5 transposase, can be engineered to introduce any known polynucleotide sequence or not to insert any sequence into DNA.

It will be appreciated that the tagmented fragments formed in step (iv) comprise complementary double-stranded DNA molecules, wherein each DNA molecule comprises a region corresponding to the target sequence, a region corresponding to the flanking region of the genomic DNA and a region corresponding to the sequencing adapter.

By "sequencing", we include the meaning of determining the nucleic acid sequence, i.e. the order of nucleotides in a nucleic acid sequence. DNA sequencing is widely applied in determining the sequence of synthetic DNA sequences, whole genes or fragments of genes, larger genetic regions (i.e. clusters of genes or operons), full chromosomes, or entire genomes of any organism.

In the methods disclosed herein, step (v) comprises determining the location of the target sequence in the genomic DNA of the cell, based on the nucleotide sequence obtained in step (iv).

It is preferred in the method of the invention that the genomic DNA provided in step (i) is genomic DNA from a single cell.

As explained herein, the present method is particularly advantageous in that it permits analysis of single cells. Prior art methods (such as the LAM-PCR assay) rely on generating an "average" or "composite" result from many cells (which is often referred to as "bulk analyses"), and are unable to analyse a small number of cells, and therefore lose cellular heterogeneity information. Heterogeneity exists even within the smallest populations of cells. Compared to bulk analyses, single-cell technologies have the advantage of detecting heterogeneity among individual cells, which is especially useful in distinguishing a small number of cells and delineating cell maps.

Preferably, the target sequence is heterologous to the genomic DNA of the cell.

By "heterologous to the genomic DNA of the cell", we mean that the target sequence is foreign to the genomic DNA of the cell. For example, the sequence may differ from the "normal" or "natural" sequence of the genomic DNA of the cell, and/or it may comprise or consist of sequence that has been exogenously introduced or inserted into the genomic DNA of the cell. Such insertions may include, for example, transposable elements, polymorphic germline transposable elements, somatic transposable elements and/or DNA sequences which do not naturally occur in a host genomic DNA (for example, retroviruses, synthetic constructs, whole genes, mutated sequences).

In another embodiment, the target sequence may be one or more sequence which have originated from a common sequence, such as transposable elements. One example is the "LINE-1 sequence". It will be appreciated that the genome may contain many such copies of such sequences -for example, the human genome may contain many LINE-1 copies.

It is preferred that the target sequence has been integrated into the genomic DNA of the cell.

By "integrated", we mean that the target sequence has been inserted into the genomic DNA. As will be appreciated, integration can occur in numerous ways -for example, via transposition; retrotransposition, or via gene editing.

By "gene editing" or "genome editing" we include the meaning of technologies that allow genetic material to be added, removed and/or altered at particular locations in the genome. Several approaches to genome editing have been developed and these are widely known in the art (Li eta!, 2020, Signal Transduction and Targeted Therapy.

5:1). For example, such systems include: zinc-finger nucleases (ZFNs), meganucleases, transcription activator-like effector nucleases (TALEN), clustered regularly interspaced short palindromic repeat (CRISPR)-associated 9 (Cas9) nuclease, CRISPR-like nucleases, or synthetic DNA transposon, such as the sleeping beauty transposase system. Methods for efficient delivery of transgenes to cells are known for a skilled person, and include, for example, lentiviral transduction or lipofectamine transfection.

Preferably, the target sequence is 1 to 5,000 base pairs in length, preferably 10 to 1,000 base pairs in length, more preferably 50 to 500 base pairs in length.

In some embodiments of the methods disclosed herein, the size of a target sequence is at least 10 base pairs in length, at least 100 base pairs in length, at least 200 base pairs in length, at least 300 base pairs in length, at least 400 base pairs in length, at least 500 base pairs in length, at least 600 base pairs in length, at least 700 base pairs in length, at least 800 base pairs in length, at least 900 base pairs in length, at least 1,000 base pairs in length, at least 1,100 base pairs in length, at least 1,200 base pairs in length, at least 1,300 base pairs in length, at least 1,400 base pairs in length, at least 1,500 base pairs in length, at least 1,600 base pairs in length, at least 1,700 base pairs in length, at least 1,800 base pairs in length, at least 1,900 base pairs in length, at least 2,000 base pairs in length, at least 2,100 base pairs in length, at least 2,200 base pairs in length, at least 2,300 base pairs in length, at least 2,400 base pairs in length, at least 2,500 base pairs in length, at least 2,600 base pairs in length, at least 2,700 base pairs in length, at least 2,800 base pairs in length, at least 2,900 base pairs in length, at least 3,000 base pairs in length, at least 3,100 base pairs in length, at least 3,200 base pairs in length, at least 3,300 base pairs in length, at least 3,400 base pairs in length, at least 3,500 base pairs in length, at least 3,600 base pairs in length, at least 3,700 base pairs in length, at least 3,800 base pairs in length, at least 3,900 base pairs in length, or at least 4,000 base pairs in length.

In some embodiments of the methods disclosed herein in length, the size range of a target sequence is 10-100 base pairs in length, 10-200 base pairs in length, 10-300 base pairs in length, 10-400 base pairs in length, 10-500 base pairs in length, 10-600 base pairs in length, 10-700 base pairs in length, 10-800 base pairs in length, 10-900 base pairs in length, 10-1,000 base pairs in length, 10-1,100 base pairs in length, 101,200 base pairs in length, 10-1,300 base pairs in length, 10-1,400 base pairs in length, 10-1,500 base pairs in length, 10-1,600 base pairs in length, 10-1,700 base pairs in length, 10-1,800 base pairs in length, 10-1,900 base pairs in length, 10-2,000 base pairs in length, 10-2,100 base pairs in length, 10-2,200 base pairs in length, 102,300 base pairs in length, 10-2,400 base pairs in length, 10-2,500 base pairs in length, 10-2,600 base pairs in length, 10-2,700 base pairs in length, 10-2,800 base pairs in length, 10-2,900 base pairs in length, 10-3,000 base pairs in length, 10-3,100 base pairs in length, 10-3,200 base pairs in length, 10-3,300 base pairs in length, 103,400 base pairs in length, 10-3,500 base pairs in length, 10-3,600 base pairs in length, 10-3,700 base pairs in length, 10-3,800 base pairs in length, 10-3,900 base pairs, or 10-4,000 base pairs in length.

In a preferred embodiment, the target sequence is present as a single copy in the genomic DNA of the cell, or is present as multiple copies in the genomic DNA of the cell.

In some embodiments of the methods disclosed herein, the target sequence is present as at least one copy in the genomic DNA of the cell, at least five copies in the genomic DNA of the cell, at least 10 copies in the genomic DNA of the cell, at least 20 copies in the genomic DNA of the cell, at least 50 copies in the genomic DNA of the cell, at least 100 copies in the genomic DNA of the cell, at least 150 copies in the genomic DNA of the cell, at least 200 copies in the genomic DNA of the cell, at least 500 copies in the genomic DNA of the cell, at least 1,000 copies in the genomic DNA of the cell, at least 2,000 copies in the genomic DNA of the cell, at least 3,000 copies in the genomic DNA of the cell, at least 4,000 copies in the genomic DNA of the cell, at least 5,000 copies in the genomic DNA of the cell, at least 10,000 copies in the genomic DNA of the cell, at least 20,000 copies in the genomic DNA of the cell, at least 30,000 copies in the genomic DNA of the cell, at least 40,000 copies in the genomic DNA of the cell, at least 50,000 copies in the genomic DNA of the cell, at least 100,000 copies in the genomic DNA of the cell, at least 200,000 copies in the genomic DNA of the cell, at least 300,000 copies in the genomic DNA of the cell, at least 400,000 copies in the genomic DNA of the cell, at least 500,000 copies in the genomic DNA of the cell, at least 600,000 copies in the genomic DNA of the cell, at least 700,000 copies in the genomic DNA of the cell, at least 800,000 copies in the genomic DNA of the cell, at least 900,000 copies in the genomic DNA of the cell, or at least 1,000,000 copies in the genomic DNA of the cell.

In some embodiments of the methods disclosed herein, the target sequence is present in the genomic DNA of the cell at a copy number of 1-10 copies, 1-20 copies, 1-30 copies, 1-40 copies, 1-50 copies, 1-60 copies, 1-70 copies, 1-80 copies, 1-90 copies, 1-100 copies, 1-125 copies, 1-150 copies, 1-175 copies, 1-200 copies, 1-225 copies, 1-250 copies, 1-275 copies, 1-300 copies, 1-400 copies, 1-500 copies, 1-600 copies, 1-700 copies, 1-800 copies, 1-900 copies, 1-1,000 copies, 1-2,000 copies, 1-3,000 copies, 1-4,000 copies, 1-5,000 copies, 1-10,000 copies, 1-25,000 copies, 1-50,000 copies, 1-75,000 copies, 1-100,000 copies, 1-200,000 copies, 1-300,000 copies, 1- 400,000 copies, 1-500,000 copies, 1-1,000,000 copies, or 1,000,000 or more copies. Preferably the target sequence is present at a copy number of 1-500,000 copies, more preferably 1-250,000 copies, yet more preferably 1-100,000 copies, yet more preferably 1-50,000 copies, most preferably 1-10,000 copies.

Preferably, Step (ii) comprises a linear amplification, in the presence of one or more amplification primer which comprises a region that is specific for the target sequence.

It will be appreciated that, following linear amplification, the amplified product of step (ii) provides a sequence that can be specifically bound by subsequent primers and permit subsequent amplification of the population of tagmented fragments in step (iv).

Preferably, Step (iii) comprises subjecting the single-stranded DNA molecule to second-strand synthesis in the presence of degenerate hexamer oligonucleotides, to generate double-stranded DNA.

In a preferred embodiment of the methods disclosed herein, step (iii) comprises hexanucleotide oligonucleotides. By "hexanucleotide oligonucleotides", we include the meaning of a single-stranded DNA oligonucleotide sequence which is six nucleotides in length. To ensure robust and unbiased generation of double-stranded DNA, hexanucleotide oligonucleotides are degenerated, meaning that some or all nucleotide positions contain a number of possible bases, giving a population of primers with similar sequences that cover all possible nucleotide combinations for a given sequence. A skilled person would understand that the method of the invention is not restricted to any specific hexanucleotide oligonucleotides sequences. Additionally, a skilled person would understand that the hexanucleotide oligonucleotides may be exchanged with shorter or longer oligonucleotides, for example pentamer, heptamer, octamer, nonamer-and/or decamer oligonucleotides, without necessarily compromising the method procedure and/or method performance.

In the context of the present invention, these added oligonucleotides (such as hexanucleotide oligonucleotides) hybridize at any position along the complementary target DNA sequence generated during linear amplification in step (ii). In a preferred embodiment, these added oligonucleotides hybridise close to or at the 5' end of the complementary DNA sequence. It will be appreciated that upon extension of the primer, the polymerase creates a sequence complementary to the copy of the target sequence generated during linear amplification.

Preferably, Step (iv) comprises: - tagmenting the double-stranded DNA to generate a population of tagmented DNA fragments; - amplifying the population of tagmented fragments; - indexing the amplified fragments to obtain one or more sequencing library; and - sequencing the one or more sequencing library.

Preferably, tagmentation generates fragments between 1 to 2,000 base pairs in length, preferably between 100 to 1,000 base pairs in length, more preferably between 500 to 700 base pairs in length.

By "sequencing library", or "library", we include the meaning of a plurality of indexed double stranded DNA molecules obtained in step (iv).

In some embodiments of the methods disclosed herein, the size range of tagmented fragments is about 50 base pairs to about 2000 base pairs in length, about 50 base pairs to about 1900 base pairs in length, about 50 base pairs to about 1800 base pairs in length, about 50 base pairs to about 1700 base pairs in length, about 50 base pairs to about 1600 base pairs in length, about 50 base pairs to about 1500 base pairs in length, about 50 base pairs to about 1400 base pairs in length, about 50 base pairs to about 1300 base pairs in length, about 50 base pairs to about 1200 base pairs in length, about 50 base pairs to about 1100 base pairs in length, about 50 base pairs to about 1000 base pairs in length, about 50 base pairs to about 950 base pairs in length, about 50 base pairs to about 900 base pairs in length, about 50 base pairs to about 850 base pairs in length, about 50 base pairs to about 800 base pairs in length, about base pairs to about 750 base pairs in length, about 50 base pairs to about 700 base pairs in length, about 50 base pairs to about 650 base pairs in length, about 50 base pairs to about 600 base pairs in length, about 50 base pairs to about 550 base pairs in length, about 50 base pairs to about 500 base pairs in length, about 50 base pairs to about 450 base pairs in length, about 50 base pairs to about 400 base pairs in length, about 50 base pairs to about 350 base pairs in length, about 50 base pairs to about 300 base pairs in length, about 50 base pairs to about 250 base pairs in length, about 50 base pairs to about 200 base pairs in length, about 50 base pairs to about 150 base pairs in length, about 50 base pairs to about 100 base pairs in length, about 100 base pairs to about 1500 base pairs in length, about 150 base pairs to about 1400 base pairs in length, about 200 base pairs to about 1300 base pairs in length, about 250 base pairs to about 1200 base pairs in length, about 300 base pairs to about 1100 base pairs in length, about 350 base pairs to about 1000 base pairs in length, about 400 base pairs to about 1000 base pairs in length, about 450 base pairs to about 950 base pairs in length, about 500 base pairs to about 900 base pairs in length, about 550 base pairs to about 850 base pairs in length, about 600 base pairs to about 800 base pairs in length, about 650 base pairs to about 750 base pairs in length, about 700 base pairs to about 1500 base pairs in length, about 750 base pairs to about 1500 base pairs in length, about 800 base pairs to about 1500 base pairs in length, about 850 base pairs to about 1500 base pairs in length, about 900 base pairs to about 1500 base pairs in length, about 950 base pairs to about 1500 base pairs in length, about 1000 base pairs to about 1500 base pairs in length, about 1100 base pairs to about 1500 base pairs in length, about 1200 base pairs to about 1500 base pairs in length, about 1300 base pairs to about 1500 base pairs in length, or about 1400 base pairs to about 1500 base pairs in length. Preferably the size range of tagmented fragments is about 100 base pairs to about 1200 base pairs in length, more preferably 200 base pairs to 800 base pairs in length, most preferably 400 base pairs to 500 base pairs in length.

In some embodiments of the methods disclosed herein, Step (iv) further comprises the step of exponentially amplifying the population of tagmented fragments to generate one or more amplicon of each DNA molecule in the population.

By "amplicon" we include the meaning of a DNA molecule that has been amplified from a DNA template, for example, a PCR product. In some embodiments of the methods disclosed herein, the step of exponentially amplifying the population of tagmented DNA fragments comprises PCR amplification.

Exponential amplification reaction reagents, ingredients and concentrations for PCR are well known in the art.

In the Examples, exponential amplification was carried out for 22 amplification cycles in the presence of a set of amplification primers. A skilled person would understand that the number of amplification cycles can be adjusted to achieve desired target sequence amplification -for example, 10, 20, 30 or 40 (or more amplification cycles may be used.

In some embodiments of the methods disclosed herein, step of amplifying the population of tagmented fragments comprises high-fidelity amplification.

By "high fidelity amplification" we include the meaning of amplification that results in amplicons that have very few or no sequence changes relative to the corresponding sequence in the original template molecule (e.g. the tagmented fragment). In a preferred embodiment, high-fidelity amplification is carried out using a commercial proof-reading DNA polymerase enzyme. Proof-reading DNA polymerases possess a 3'-to-5' exonuclease activity and removes erroneously attached bases when incorporated in the growing DNA strand. High-fidelity amplification enzymes with proof-reading activity exhibit lower error rate of that of Thermus aquaticus Taq DNA polymerase, which error rate has been reported to range from lx 10-sto 2x10-4 (Hestand et al, 2016. Mutat Res, 784-785:39-45). Proof-reading activity increases the accuracy of DNA synthesis from template DNA, resulting in highly-accurate amplification. In another embodiment of the methods disclosed herein, a non-proof-reading DNA polymerase enzyme is used in exponential amplification in step (iv).

Preferably, amplification is performed in the presence of one or more pairs of oligonucleotide primers, wherein each pair comprises a first oligonucleotide primer that is specific for the target sequence, and a second oligonucleotide primer that is complementary to a sequence introduced by tagmentation in step (iv).

In a preferred embodiment, oligonucleotide primers comprise at least about 10 nucleotides, more preferably at least about 15 nucleotides, more preferably at least about 20 nucleotides, more preferably at least about 25 nucleotides, and even more preferably at least about 30 nucleotides. In a preferred embodiment, the oligonucleotide primer is preferably between 5-70 nucleotides in length, more preferably between 20-60 nucleotides in length.

In a preferred embodiment, oligonucleotide primers further comprise an indexing primer region sequence (i.e. an indexing primer adapter) enabling indexing of the amplified fragments.

In some embodiments of the methods disclosed herein, the size range of exponentially amplified fragments is about 50 base pairs to about 2000 base pairs in length, about 50 base pairs to about 1900 base pairs in length, about 50 base pairs to about 1800 base pairs in length, about 50 base pairs to about 1700 base pairs in length, about 50 base pairs to about 1600 base pairs in length, about 50 base pairs to about 1500 base pairs in length, about 50 base pairs to about 1400 base pairs in length, about 50 base pairs to about 1300 base pairs in length, about 50 base pairs to about 1200 base pairs in length, about 50 base pairs to about 1100 base pairs in length, about 50 base pairs to about 1000 base pairs in length, about 50 base pairs to about 950 base pairs in length, about 50 base pairs to about 900 base pairs in length, about 50 base pairs to about 850 base pairs in length, about 50 base pairs to about 800 base pairs in length, about 50 base pairs to about 750 base pairs in length, about 50 base pairs to about 700 base pairs in length, about 50 base pairs to about 650 base pairs in length, about base pairs to about 600 base pairs in length, about 50 base pairs to about 550 base pairs in length, about 50 base pairs to about 500 base pairs in length, about 50 base pairs to about 450 base pairs in length, about 50 base pairs to about 400 base pairs in length, about 50 base pairs to about 350 base pairs in length, about 50 base pairs to about 300 base pairs in length, about 50 base pairs to about 250 base pairs in length, about 50 base pairs to about 200 base pairs in length, about 50 base pairs to about 150 base pairs in length, about 50 base pairs to about 100 base pairs in length, about 100 base pairs to about 1500 base pairs in length, about 150 base pairs to about 1400 base pairs in length, about 200 base pairs to about 1300 base pairs in length, about 250 base pairs to about 1200 base pairs in length, about 300 base pairs to about 1100 base pairs in length, about 350 base pairs to about 1000 base pairs in length, about 400 base pairs to about 1000 base pairs in length, about 450 base pairs to about 950 base pairs in length, about 500 base pairs to about 900 base pairs in length, about 550 base pairs to about 850 base pairs in length, about 600 base pairs to about 800 base pairs in length, about 650 base pairs to about 750 base pairs in length, about 700 base pairs to about 1500 base pairs in length, about 750 base pairs to about 1500 base pairs in length, about 800 base pairs to about 1500 base pairs in length, about 850 base pairs to about 1500 base pairs in length, about 900 base pairs to about 1500 base pairs in length, about 950 base pairs to about 1500 base pairs in length, about 1000 base pairs to about 1500 base pairs in length, about 1100 base pairs to about 1500 base pairs in length, about 1200 base pairs to about 1500 base pairs in length, about 1300 base pairs to about 1500 base pairs in length, or about 1400 base pairs to about 1500 base pairs in length. Preferably the size range of exponentially amplified fragments is about 100 base pairs to about 1200 base pairs in length, more preferably 200 base pairs to 800 base pairs in length, most preferably 400 base pairs to 600 base pairs in length.

In the Examples presented herein, the oligonucleotide primers used in exponential amplification are "nested" in relation to the oligonucleotide primers used in the linear amplification in step (ii). By "nested oligonucleotide primers" we include the meaning that the first set of oligonucleotide primers creates a first amplification reaction product complementary to the DNA target, and the second set of oligonucleotide primers specifically binds to the sequence which is downstream from and/or overlaps with the sequence corresponding to the sequence of primers of the first set, which sequence is introduced into the first amplification reaction product during amplification. In the context of the present invention, the first set of oligonucleotide primers is used in linear amplification in step (ii), and the second set of oligonucleotide primers is used in exponential amplification in step (iv). An assay design where such "nested" primers are used is well known in the art. A skilled person would understand that "nested" primers may improve sensitivity and specificity of DNA amplification.

In some embodiments of the methods disclosed herein, Step (iv) further comprises the step of library preparation. By "library preparation" we include the meaning of indexing of amplicons generated in the exponential amplification step, which allows multiple amplicons from different cells or samples to be pooled and sequenced simultaneously. By "indexing" we include the meaning of adding a unique identifier (index) to DNA amplicons during library preparation. In some embodiments of the methods disclosed herein, dual indexing is performed by using custom-designed indexing primers and indexing reaction reagents. In a preferred embodiment, index primers are designed to specifically bind sequence which corresponds to the indexing primer adapters sequences introduced into exponentially amplified DNA fragments of step (iv). Preferably, the indices have between 1 to 50 base pairs in length in length, more preferably between 5 to 25 base pairs in length, most preferably between 5 to 10 base pairs in length.

In some embodiments of the methods disclosed herein, the step of indexing of amplified DNA fragments comprises PCR amplification. Indexing of amplified DNA fragments, indexing reaction reagents, ingredients and concentrations are well known in the art.

In the Examples, dual indexing was carried out for 12 amplification cycles in the presence of a set of indexing primers. A skilled person would understand that the number of amplification cycles can be adjusted to achieve desired amplification -for example, 15, 20, 25 or 30 cycles (or more amplification cycles may be used).

In some embodiments of the methods disclosed herein, the library comprises sequences which are at least 150 base pairs in length, at least 160 base pairs in length, at least 170 base pairs in length, at least 180 base pairs in length, at least 190 base pairs in length, at least 200 base pairs in length, at least 210 base pairs in length, at least 220 base pairs in length, at least 230 base pairs in length, at least 240 base pairs in length, at least 250 base pairs in length, at least 260 base pairs in length, at least 270 base pairs in length, at least 280 base pairs in length, at least 290 base pairs in length, at least 300 base pairs in length, at least 310 base pairs in length, at least 320 base pairs in length, at least 330 base pairs in length, at least 340 base pairs in length, at least 350 base pairs in length, at least 360 base pairs in length, at least 370 base pairs in length, at least 380 base pairs in length, at least 390 base pairs in length, at least 400 base pairs in length, at least 420 base pairs in length, at least 440 base pairs in length, at least 460 base pairs in length, at least 480 base pairs in length, at least 500 base pairs in length, at least 520 base pairs in length, at least 540 base pairs in length, at least 560 base pairs in length, at least 580 base pairs in length, at least 600 base pairs in length, at least 620 base pairs in length, at least 640 base pairs in length, at least 680 base pairs in length, at least 700 base pairs in length, at least 720 base pairs in length, at least 740 base pairs in length, at least 760 base pairs in length, at least 780 base pairs in length, at least 800 base pairs in length, at least 820 base pairs in length, at least 840 base pairs in length, at least 860 base pairs in length, at least 880 base pairs in length, at least 900 base pairs in length, at least 920 base pairs in length, at least 940 base pairs in length, at least 960 base pairs in length, at least 980 base pairs in length, or at least 1000 base pairs in length. Preferably, the library comprises sequences which are about 600 base pairs to about 700 base pairs in length.

After indexing, libraries are then pooled and sequenced, and index reads are used during downstream analysis to identify and separate the libraries obtained from different cells or samples.

In the methods disclosed herein, Step (iv) comprises the step of sequencing. In a preferred embodiment, sequencing is done by a short-read sequencing method. By "short-read sequencing method" we include the meaning of a sequencing method that does not cover the entirety of the sequenced molecules in a single sequencing read.

In some embodiments of the methods disclosed herein, the size of a sequencing read (for example using a "short-read sequencing method") is at least 10 bases in length, at least 20 bases in length, at least 30 bases in length, at least 40 bases in length, at least 50 bases in length, at least 60 bases in length, at least 70 bases in length, at least 80 bases in length, at least 90 bases in length, at least 100 bases in length, at least 110 bases in length, at least 120 bases in length, at least 130 bases in length, at least 140 bases in length, at least 150 bases in length, at least 160 bases in length, at least 170 bases in length, at least 180 bases in length, at least 190 bases in length, at least 200 bases in length, at least 210 bases in length, at least 220 bases in length, at least 230 bases in length, at least 240 bases in length, at least 250 bases in length, at least 260 bases in length, at least 270 bases in length, at least 280 bases in length, at least 290 bases in length, at least 300 bases in length, at least 310 bases in length, at least 320 bases in length, at least 330 bases in length, at least 340 bases in length, at least 350 bases in length, at least 360 bases in length, at least 370 bases in length, at least 380 bases in length, at least 390 bases in length, at least 400 bases in length, at least 420 bases in length, at least 440 bases in length, at least 460 bases in length, at least 480 bases in length, at least 500 bases in length, at least 520 bases in length, at least 540 bases in length, at least 560 bases in length, at least 580 bases in length, at least 600 bases in length, at least 620 bases in length, at least 640 bases in length, at least 680 bases in length, at least 700 bases in length, at least 720 bases in length, at least 740 bases in length, at least 760 bases in length, at least 780 bases in length, at least 800 bases in length, at least 800 bases in length, at least 820 bases in length, at least 840 bases in length, at least 860 bases in length, at least 880 bases in length, at least 900 bases in length, at least 920 bases in length, at least 940 bases in length, at least 960 bases in length, at least 980 bases in length, or at least 1000 bases in length.

In some embodiments, the short-read sequencing method is selected from the list consisting of: massive parallel short-read sequencing; DNA nanoball sequencing (Drmanac et al, 2010. Science, 327(5961):78-81); Illumina dye Sequencing (Solexa sequencing), (Meyer and Kircher, 2010. Cold Spring Harb Protoc); 454 pyrosequencing (Nyren and Lundin, 1985. Analytical Biochemistry, 151(2):504-509); SOLID sequencing (Shendure and Ji, 2008. Nature Biotechnology, 26:1135-1145); Helicos single molecule fluorescent sequencing (Thompson and Steinmann, 2010. Curr Protoc Mol Biol, 7 Unit 7.10); combinatorial probe anchor synthesis (cPAS), (Fehlmann et al, 2016. Clinical Epigenetics, 8(123)); polony sequencing (Porreca, Shendure and Church, 2006. Curr Protoc Mol Blot, 7 Unit 7.8); electrical sequencing chips (e.g. GenapSys) (La hens et al, 2017. BMC Genomics. 18(602)); or combinations thereof. In some embodiments, long-read sequencing methods can be used, for example, PacBio (Eid et al, 2009. Science, 323(5910):133-138).

In one embodiment, one end of each double-stranded DNA molecule is sequenced (i.e. "single end sequencing"). In an alternative embodiment, both ends of each double-stranded DNA molecule is sequenced (i.e. "paired end sequencing").

In the Examples presented herein, paired-end sequencing is performed. It will be understood that, in paired-end sequencing, both ends of sequences in the library are determined, so the target sequence location information is determined with high accuracy. For paired-end sequencing, the first read ("mate read 1") comprises flanking sequence, while the second read ("mate read 2") comprises one or more of the following: a portion of the immediate flanking sequence and/or a portion of the target sequence.

In a preferred embodiment of the invention where one or more LINE-1 transposable elements are detected, mate read 2 may comprise one or more of the following: 3' Li sequence, poly-A tract, potential 3'-transduction sequence and/or additional poly-A tract, and/or potentially a portion of the immediate flanking sequence. In some embodiments the methods disclosed herein may be used to detect shorter target sequences (including shorter transposable elements, such as SINE). In such embodiments, read mate 2 may comprise a longer portion of the immediate flanking sequence in addition to a portion of the target sequence. Accordingly, it will be appreciated that short target sequences may be detected with a high confidence.

In some embodiments, if long-read sequencing is performed, the sequenced fragments may comprise a LINE-1 sequence (comprising one or more of the following: 3' Li sequence, poly-A tract, potential 3'-transduction sequence and/or additional poly-A tract), a portion of the immediate flanking sequence and/or a sequence incorporated by the transposase. A person skilled in the art would be able to use sequence alignment and analysis software to analyse long sequencing data.

Preferably, Step (v) comprises the step of determining the identity and/or location of the nucleotide sequence from step (iv) in the genomic DNA of the cell.

In the methods disclosed herein, step (v) comprises determining the location of the target sequence in the genomic DNA of the cell, based on the nucleotide sequence obtained in step (iv). Alignment of flanking region to known genomic DNA can be carried using conventional computational and/or sequence alignment methods which would be known for a skilled person. For example, software can be used to map all sequence reads obtained to a database of reference sequences and then annotate each sequence read (or read-pair) based on the population DNA molecules from which that read/read-pair is derived, using, for example, flanking region sequence present in the read/read-pairs. By DNA sequence "mapping" we include the meaning of showing the positions of genes and other sequence features on a reference sequence, for example a genome. The location where the flanking sequence has mapped is close to the target sequence, within the range of the sequenced fragment sizes, as discussed above.

By "reference sequence" we include the meaning of a known sequence, typically from a database, against which sequence reads generated in the method of the invention can be compared and aligned. The reference sequence may or may not be part of a reference genome. Common reference genomes publicly available include, for example, Homo sapiens, Mus musculus, Rattus norvegicus, Escherichia coli, Arabidopsis thaliana, Pneumocystis carinii, Bos taurus, Hepatitis C virus, Caenorhabditis elegans, Saccharomyces cerevisiae, Chlamydomonas reinhardtii, Schizosaccharomyces pombe, Danio rerio, Myco plasma pneumoniae, Takifugu rubripes, Dictyostelium discoideum, Oryza sativa, Xenopus laevise, Drosophila 10 melanogaster, Plasmodium falciparum, or Zea mays. In one embodiment of the method of the invention, artificial sequence is used as reference sequence.

The location of the target sequence can be further refined with specialised analyses called "discordant read-pair mapping" and "split-read mapping" which both exploit information derived from the mapping procedure and knowledge about the size of the sequenced DNA fragments (Fonseca et al, 2012. Bioinformatics, 28:24, 3169-3177; Ewing, 2015, Mobile DNA, 6:24). Examples of software that can be used to perform mapping include, for example bwa mem, blat, last and blast. Further details presenting how the data obtained with the method of the invention is processed are in the Examples disclosed herein.

The number of sequencing reads needed to determine the location of target of interest is directly related to the number of available targets in the genome of interest. For example, for human-specific LINE-1 approximately 1000 target sequences are detected simultaneously in a single cell. A skilled person would appreciate that when multiple copies of a repetitive sequence are detected, for example LINE-1, the target sequence for all the copies is identical and the flanking region of each individual copy are distinct sequences. It should be anticipated that distinct flanking sequences are used to detect the presence and location of each individual copy.

It should be appreciated that the optimal number of sequencing reads to achieve a desired result may depend on the sequencing method used. In some embodiments of the methods disclosed herein, in particular in the context of detecting LINE-1 in human cells, if short-read sequencing method is performed, the optimal number of sequencing reads is approximately 0.5 million reads per cell. A skilled person would appreciate that the optimal number of sequencings reads may vary, depending on, for example, the number of target sequences detected simultaneously. Therefore, it should be understood that in some embodiments of the methods disclosed herein, the optimal number of sequencing reads needed may be lower than 0.5 million reads per cell, such as about 0.4 million reads, about 0.3 million reads, about 0.2 million reads, about 0.1 million reads per cell or less than 0.1 million reads per cell; or higher than about 0.5 million reads per cell, such as about 0.6 million reads, about 0.8 million reads, about 1 million reads, about 2 million or more than 2 million reads per cell.

Preferably, the genomic DNA provided in step (i) has not been subjected to wholegenome amplification.

As mentioned above, prior art methods (such as Li-IP and RC-seq) use whole genome amplification (WGA) techniques to amplify DNA before enrichment. Whole genome amplification is a widely used molecular technique as minute amounts of DNA can be multiplied to generate quantities suitable for genetic testing and analysis. However, a disadvantage of WGA is that it introduces sequence errors into the DNA during the course of amplification (Sabina and Leamon, 2015. Methods Mol 8i0l, 1347:15-41).

The method of the invention conveniently does not require WGA and/or gel electrophoresis-based selection steps. Therefore, it would be appreciated that the method of the invention does not suffer from WGA-introduced amplification errors or biases. The resulting method is therefore faster, more sensitive compared to previous methods, while still allowing for the analysis of the entire human genome.

Preferably, the cell to be analysed is selected from the following: a eukaryotic cell (for example, from an animal, a plant, or a fungus), a bacterial cell (for example, from Eubacteria), or an archaeal cell (for example, from Archaebacteria). It will be appreciated that the method of the invention can be performed on any genome.

Preferably, the cell to be analysed is provided in a biological sample, for example, a biopsy of a tissue, a whole tissue, blood, serum, urine, or saliva. Methods for obtaining such samples are well known in the art, as are methods for preparing and/or isolating cells in such samples for genetic analysis. A skilled person would therefore be able to obtain such samples and analyse them using the method of the invention without difficulty.

In a further aspect, the invention provides a method for determining the location of a target sequence in the genomic DNA of more than one cell in a population of different cells, the method comprising the steps of: (i) providing genomic DNA from more than one cell in the population of different cells to be analysed; (ii) performing linear amplification of the genomic DNA from the more than one cell in the population of different cells using one or more oligonucleotide primer specific for the target sequence, to generate from the genomic DNA from the more than one cell in the population of different cells one or more single-stranded DNA molecule comprising sequence from the target sequence and the flanking genomic DNA; (iii) generating double-stranded DNA from the one or more single-stranded DNA molecule of step (ii); (iv) tagmenting and sequencing the double-stranded DNA of step (iii), to determine the nucleotide sequence of the target sequence and the flanking genomic DNA; and (v) based on the nucleotide sequence obtained in step (iv), determining the location of the target sequence in the genomic DNA of the more than one cell in the population of different cells.

It will be appreciated that, unless stated otherwise, the definitions and descriptions provided above in relation to the previous aspects of the invention also apply to this further aspect of the invention.

In a preferred embodiment the method determines the location of a target sequence in the genomic DNA of every cell in the population of different cells, or substantially every cell in the population of different cells.

Accordingly, in another aspect, the invention permits analyses of multiple individual cells from a population of cells. In one embodiment, every cell in the population is analysed. A skilled person would appreciate that there is no theoretical limit of how many cells can be analysed with the method of invention.

In a preferred embodiment, the population of cells comprise: cells derived from embryogenesis; cells derived from a particular organ (for example, whole blood, liver, kidney, heart, lung, bone, muscle, stomach, gallbladder, intestine, bladder, brain, pancreas, adrenal gland, thymus, parathyroid, thyroid, spinal cord, skin, bone marrow, spleen, thymus, tonsil, ovary, testis), or cells obtained from cultured cells.

In a further embodiment, the population of cells comprise cells derived from a non-healthy or diseased tissue, such as from cancerous tissue. Preferred cancers include: bladder cancer, breast cancer, colon and rectal cancer, endometrial cancer, kidney cancer, leukemia, liver cancer, lung cancer, melanoma, non-Hodgkin lymphoma, pancreatic cancer, prostate cancer, thyroid cancer.

It is preferred that at least 10%, 15%, 200/c, 25%, 30%, 35%, 40%, 450/s, 50%, 550/s, 60%, 65°/o, 70°/o, 75%, 800/c, 85% or 90% or more of the cells in the population of different cells is analysed, preferably simultaneously.

Preferably, between 1-1,000,000 cells from the cell population are analysed, preferably simultaneously. In some embodiments of the methods disclosed herein, the number of cells in the population that are analysed is between 1 to 500,000 cells, 1 to 250,000 cells, 1 to 150,000 cells, 1 to 100,000 cells, or more preferably 1 to 50,000 cells, or most preferably 1 to 20,000 cells.

In a further aspect, the target sequence is heterologous to the genomic DNA of the more than one cell in the population.

In a preferred embodiment, the target sequence has been integrated into the genomic DNA of the more than one cell in the population.

Gene therapy with, for example, retroviral vectors, can induce adverse effects when those vectors integrate in sensitive genomic regions. Therefore, vectors are preferred that target such sensitive regions less frequently. In some embodiments, a DNA sequence integration pattern is different between one or more cells in the population of different cells. This difference can be attributed to either natural biological mechanisms or by virtue of the technique used in DNA sequence integration. In some embodiments, DNA sequence does not integrate into genomic DNA of one or more cells in the population of different cells. Accordingly, a skilled person would understand that the method of the invention can be applied to investigate DNA integration pattern differences between cells from a population of different cells, for example, to quantitively and qualitatively study DNA integration patterns between the cells.

It is preferred that the target sequence is 1 to 5,000 base pairs in length, preferably to 1,000 base pairs in length, more preferably 50 to 500 base pairs in length.

Preferably, the target sequence is present as a single copy in the genomic DNA of the more than one cell in the population, or is present as multiple copies in the genomic DNA of the more than one cell in the population.

It is preferred that the target sequence is present in 1 to 1,000,000 copies in the genomic DNA, preferably present in 1 to 100,000 copies in the genomic DNA, more preferably present in 1 to 10,000 copies in the genomic DNA.

It is preferred that in step (i), individual cells from the population of different cells are sorted into one or more microwell plates, most preferably, 96-or 384-well microwell plates.

By "sorting" we include the meaning of isolating a desired cell population from a heterogeneous suspension of cells based on its physical or biological properties (such as size, morphological parameters, viability, and/or extracellular and/or intracellular protein expression) or physically separating one or more cells from a population of different cells into individual reaction chambers. Methods for sorting cells are known in the art and skilled person would know which methods of sorting cells would be compatible with the methods of invention. Performing the methods of the invention on cells sorted into microwell plates (such as 96-or 384-well microwell plates) ensures high-throughput and physical separation of reaction constituents. A skilled person would understand that it is important to process such plates carefully to not mix materials from different wells before pooling.

Preferably, step (i) comprises providing the genomic DNA from the more than one cell in the population of different cells as a separate sample.

Approaches for obtaining, isolating, purifying, and storing genomic DNA from cells sorted into one or more microwell plates are known in the art.

It is preferred that Step (ii) comprises a linear amplification, in the presence of one or more amplification primer which comprises a region that is specific for the target sequence.

Preferably, Step (iv) comprises: - tagmenting the double-stranded DNA to generate a population of tagmented DNA fragments; - amplifying the population of tagmented fragments; - indexing the amplified fragments to obtain one or more sequencing library; and -sequencing the one or more sequencing library.

In a preferred embodiment, tagmentation generates fragments between 1 to 2,000 base pairs in length, preferably between 100 to 1,000 base pairs in length, more preferably between 400 to 500 base pairs in length.

In some embodiments of the methods disclosed herein, Step (iv) further comprises the step of exponentially amplifying and indexing the population of tagmented fragments to generate one or more amplicon of each DNA molecule in the population.

It is preferred that amplification is performed in the presence of one or more pairs of oligonucleotide primers, wherein each pair comprises a first oligonucleotide primer that is specific for the target sequence, and a second oligonucleotide primer that is complementary to a sequence introduced by tagmentation in step (iv).

Preferably, indexing of the amplified fragments to obtain one or more sequencing library is performed by using a unique pair of indexing primers for every cell from a cell population, to identify which cell the sequenced fragments originated.

Preferably, Step (v) comprises the step of determining the identity and/or location of the nucleotide sequence from step (iv) in the genomic DNA of the more than one cell in the population of different cells.

In some embodiments of the methods disclosed herein, step (v) further comprises mapping the flanking region of the sequenced fragments to the reference genome, specific for an organism, based on their alignment to some or all of the reference genome. Preferably, step (v) further comprises processing of the mapped reads in order to determine the location of a target sequence in the genomic DNA, and optionally comparing the locations of the detected insertions within the reference genome between cells from the population of cells. It would be appreciated by a skilled person that cells within a population of cells and/or different populations of cells may have target sequence present at different locations. In this context, "population of cells" may mean e.g. cells from different tissues of the same individual, or cells from the same kind of tissue from different individuals, or cells from a primary tumor biopsy and a metastatic tumor biopsy.

In a preferred embodiment, the genomic DNA provided in step (i) has not been subjected to whole-genome amplification.

It is preferred that the cell to be analysed is selected from the following: a eukaryotic cell (for example, from an animal, a plant, or a fungus), a bacterial cell (for example, from Eubacteria), or an archaeal cell (for example, from Archaebacteria).

In a further aspect, every cell in the population of different cells to be analysed is provided in a biological sample, for example, a biopsy of a tissue, a whole tissue, blood, serum, urine, or saliva.

In a further aspect, the invention provides a method for determining the presence and/or location of an integrated target sequence in the genome of a cell, the method comprising the steps of: (i) integrating a target sequence into the genomic DNA of a cell, and providing genomic DNA from the cell; (ii) performing linear amplification of the genomic DNA using one or more oligonucleotide primer specific for the integrated target sequence, to generate one or more single-stranded DNA molecule comprising sequence from the integrated target sequence and the flanking genomic DNA; (iii) generating double-stranded DNA from the one or more single-stranded DNA molecule of step (ii); (iv) tagmenting and sequencing the double-stranded DNA of step (iii), to determine the sequence of the integrated target sequence and the flanking genomic DNA; and (v) based on the nucleotide sequence obtained in step (iv), determining the presence and/or location of the integrated target sequence in the genome of the cell.

It is preferred that the genomic DNA provided in step (i) is from a single cell, or is from more than one cell in a population of different cells.

Preferably, at least 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85% or 90% or more of the cells in the cell population are analysed, preferably simultaneously.

In a further aspect, the step of integrating the target sequence into the genome of the cell or one or more cell in a population of different cells, is performed using gene-editing technique, for example using CRISPR or a CRISPR-like system.

By "gene editing" or "genome editing" or "genetic-editing" we include the meaning of technologies that allow genetic material, such as target polynucleotide sequence, to be integrated (added), removed and/or altered at particular locations in the genome.

Several approaches to genome editing have been developed and these are widely known in the art. For example, such systems include: zinc-finger nucleases (ZFNs), meganucleases, TALENs, CRISPR-Cas nucleases (for example, CRISPR-Cas9), CRISPRlike nucleases, or synthetic DNA transposon, such as the sleeping beauty transposase system. Methods for efficient delivery of transgenes to cells are known for a skilled person, and include, for example, lentiviral transduction or lipofectamine transfection.

It is preferred that the target sequence is present as a single copy in the genomic DNA of the cell, or is present as a single copy in the genomic DNA of one or more cell in a population of different cells, or is present as multiple copies in the genomic DNA of the cell, or is present as multiple copies in the genomic DNA of one or more cell in a population of different cells.

Preferably, the target sequence has been integrated into the genomic DNA of the cell, or one or more cell in a population of different cells, in 1 to 1,000,000 copies in the genomic DNA, preferably in 1 to 100,000 copies in the genomic DNA, more preferably in 1 to 10,000 copies in the genomic DNA.

Preferably, the target sequence is heterologous to the genomic DNA of the cell.

It will be appreciated that descriptions provided above in relation to the "target sequence" also apply to this further aspect of the invention.

A skilled person will appreciate that for any target sequence, such as an integrated target sequence or an insertion, the distance between the sequence targeted by the oligonucleotide primer used in linear amplification of step (ii) and the adjacent flanking region should be considered. This distance is associated with the sequencing platform employed. For short-read sequencing, i.e. as used in the Examples presented herein, the skilled person will appreciate that the oligonucleotide primer should bind relatively close to one end of the integrated target sequence in order for tagmented DNA to contain sequence from the flanking region.

In some embodiments of the methods disclosed herein, the oligonucleotide primer should bind to the integrated target sequence at a location that is immediately adjacent to the flanking sequence, for example, one nucleotide base away from the end of the integrated target sequence, or about 10 nucleotide bases away from the end of the integrated target sequence, or within about 100 nucleotide bases away from the end of the integrated target sequence, or within about 200 nucleotide bases away from the end of the integrated target sequence, or within about 300 nucleotide bases away from the end of the integrated target sequence, or within about 400 nucleotide bases away from the end of the integrated target sequence, or within about 500 nucleotide bases away from the end of the integrated target sequence.

It is preferred that in step (i), the cell, or one or more cell from a population of different cells, are sorted into one or more microwell plates, most preferably, 96-or 384-well microwell plates.

In a preferred embodiment, step (i) comprises providing the genomic DNA from the more than one cell in the population of different cells as a separate sample.

Preferably, Step (iv) comprises: - tagmenting the double-stranded DNA to generate a population of tagmented DNA fragments; - amplifying the population of tagmented fragments; -indexing the amplified fragments to obtain one or more sequencing library; and - sequencing the one or more sequencing libraries.

It is preferred that, tagmentation generates fragments between 1 to 2,000 base pairs in length, preferably between 100 to 1,000 base pairs in length, more preferably between 400 to 500 base pairs in length.

In a preferred embodiment, amplification is performed in the presence of one or more pairs of oligonucleotide primers, wherein each pair comprises a first oligonucleotide primer that is specific for the target sequence, and a second oligonucleotide primer that is complementary to a sequence introduced by tagmentation in step (iv).

It is preferred that the cell, or one or more cell in the population of different cells, to be analysed is selected from the following: a eukaryotic cell (for example, from an animal, a plant, or a fungus), a bacterial cell (for example, from Eubacteria), or an archaeal cell (for example, from Archaebacteria).

Preferably, the cell, or one or more cell in the population of different cells to be analysed, is provided in a biological sample, for example, a biopsy of a tissue, blood, serum, urine, or saliva.

In some embodiments of the methods disclosed herein, the invention is performed in combination with a second RNA-based assay. By "RNA-based assay" we include the meaning of any assay that can be performed towards qualitative and quantitative measurements of RNA molecules. In some embodiments of the methods disclosed herein, the RNA molecules comprises one or more RNA molecule selected from the group consisting of: messenger RNA (mRNA), precursor mRNA (pre-mRNA), antisense RNA (asRNA) and precursors thereof, enhancer RNA and precursors thereof, long non-coding RNA (IncRNA) and precursors thereof, microRNA (miRNA) and precursors thereof, ribosomal RNA (rRNA) and precursors thereof, transfer RNA (tRNA) and precursors thereof, histone RNA and precursors thereof, small nucleolar RNA (snoRNA) and precursors thereof, small nuclear RNAs (snRNA) and precursors thereof, mitochondrial RNA and precursors thereof, viral RNA, transposon RNA, synthetic RNA, in vitro transcribed RNA, or combinations thereof. Analysis of RNA transcripts that are produced by the genome, using high-throughput methods (such as RNA sequencing or microarray analysis), allows identification of genes that are differentially expressed between cells from population of cells and/or distinct cell populations. In some embodiments, differential gene expression results from different conditions to which cells are exposed to, for example, treatments with chemical compounds.

For the invention to be combined with another RNA-based assay, cytosolic and nuclear fraction of a cell and/or cells may be physically (spatially) separated. The second RNA-based assay may then be performed on the RNA present in the cytosolic fraction of a cell and/or cells, while the method of the invention is performed on the genomic DNA present in the nuclear fraction. Methods of separating cellular components (for example, cytosolic and nuclear fractions) while preserving individual functions of each component are known in the art (Zachariadis Vet al, 2020. Mol. Cell., 80(3):541-533).

A skilled person would therefore be able to obtain such fractions and analyse them using the method of the invention without difficulty.

In another aspect, the invention provides a kit of parts for determining the location of a target sequence in the genomic DNA of a cell, or population of cells, wherein the kit comprises: (i) one or more oligonucleotide amplification primer and/or oligonucleotide primer pairs; (ii) a DNA polymerase and a transposase; (iii) optionally, instructions for use.

In certain embodiments, kits are provided that may be used to carry out the methods of the invention as described herein, for example for determining the location of a target sequence in the genomic DNA of a cell, or population of cells, using the oligonucleotides, method steps and/or reagents described herein.

Preferred kits may comprise one or more containers (such as vials, tubes, and the like) configured to contain the reagents used in the methods described herein, and optionally may contain instructions or protocols for using such reagents. The kits described herein may comprise one or more components selected from the group consisting of one or more oligonucleotides described herein (including, but not limited to one or more amplification primers used in step (ii), degenerate hexamer oligonucleotides of step (iii), one or more pairs of oligonucleotide primers of step (iv), one or more pairs of indexing primers of step (iv), one or more DNA polymerase, one or more transposase, one or more buffer or buffering salts, one or more nucleotides, one or more genomic DNA or template DNA molecules, and/or other reagents for analysis or further manipulation of the products or intermediates produced by the methods described herein. Such additional components may include components used for cell lysis, DNA purification, DNA manipulation, such as magnetic beads or purification columns and/or DNA sequencing.

In some embodiments of the kits disclosed herein, the number of amplification primers to be used in the kits of the invention is at least one amplification primer, at least two amplification primers, at least three amplification primers, at least four amplification primers, at least five amplification primers, at least 10 amplification primers, at least 16 amplification primers, at least 32 amplification primers, at least 64 oligonucleotide primers, at least 96 amplification primers, at least 128 amplification primers, at least 384 amplification primers or more amplification primers.

In some embodiments of the kits described herein, one or more amplification primers used in step (ii) is provided in individual vials or tubes or in a multi-well plate. In some embodiments, the multi-well plate comprises a 96 or 384 well plate. In some embodiments, the kit further comprises a plurality of such multi-well plates, such as two or more, four or more, six or more, eight or more plates. As will be appreciated, the use of multi-well plates can allow kits to be compact, fitting more tubes into the same box, saving on shipment and storage space and costs.

Preferably, the one or more amplification primer in the kit of the invention is as described hereinabove. For example, such amplification primers preferably comprise a sequence of nucleotide residues of at least about five nucleotides, more preferably at least about seven nucleotides, more preferably at least about nine nucleotides, more preferably at least about 11 nucleotides, and even more preferably at least about 17 nucleotides. In a preferred embodiment, the amplification primer is preferably between 5-50 nucleotides in length, more preferably between 10-40 nucleotides in length, and even more preferably between 18-30 nucleotides in length.

It will be appreciated that in some embodiments of the kits described herein, an amplification primer which is designed to specifically bind one DNA target sequence can be present in multiple tubes and/or vials or plates of the kit. In other embodiments, different oligonucleotide primers, designed to specifically bind different target sequences, can be present in separate tubes and/or vials and/or plates.

A skilled person will appreciate that the particular arrangement and number of oligonucleotide primers in the kit will depend on the type of target DNA or targets, as well as the scale and type of equipment used, such as liquid handling robots. In some embodiments of the kits disclosed herein, the kit can be for detecting any DNA sequence target within the genomic DNA of a cell or plurality of cells, whose sequence and/or location and/or quantity it is desirable to identify, as already described herein. It will be appreciated that a target sequence could originate from one or more of: a repeated sequence, including tandem repeats and interspersed repeats, homopolymers, a mobile element like transposable element, including retrotransposons (for example long terminal repeats, LTRs; long interspersed nuclear elements, LINEs, LINE-is, or Lis; short interspersed nuclear elements, SINEs) and DNA transposons.

In some embodiments, the target sequence may be a DNA sequence which was artificially integrated into genomic DNA, for example, via gene therapy, or comprise or consist of a specific polymorphism/variant relative to a consensus target sequence. In such an embodiment, for example, the integrated DNA target or detected polymorphism relative to a consensus target sequence may be associated with a disease or disorder. A skilled person will understand that amplification primers may be designed to identify any desired target sequence -for example, particular transposable elements either present in the human genome, integrated into the genome or present in the genome of other organisms.

In a preferred embodiment, one or more genomic DNA or template DNA molecules is provided as a "control sample" in the kit. By "control sample" or "control DNA" or "controls" we include the meaning of samples with the aim to reduce false negative or false positive results, control the performance of methods describe herein and/or to aid data analysis and interpretation. In some embodiments the control sample may be a synthetic DNA, genomic DNA obtained from cells, and/or DNA obtained from any suitable biological sample, for example, a biopsy of a tissue, a whole tissue, blood, serum, urine, or saliva.

In some embodiments of the kits disclosed herein, the amount of DNA to be used as controls is at least 0.001 ng, at least 0.006 ng, at least 0.01 ng, at least 0.1 ng, at least 1 ng, at least 10 ng, at least 100 ng, or at least 1 pg. In some embodiments, one or more additional amplification primers of step (ii) are provided to specifically bind a particular DNA target, optimised for the type of control used.

In some embodiments of the kits disclosed herein, one or more buffers or buffering salts and/or reagents are provided, to be used in the linear amplification reaction of step (ii) of the methods disclosed herein. Linear amplification reaction reagents, ingredients and concentrations are, for example, as those described hereinabove and/or provided in the accompanying examples, such as Example 1. In a preferred embodiment, such reagents comprise nucleotides and/or a DNA polymerase, for example a thermostable DNA polymerase.

In some embodiments, the oligonucleotides provided in the kit, are for generating double-stranded DNA from single stranded cDNA template obtained from linear amplification step (ii). In a preferred embodiment of the kits disclosed herein, oligonucleotides to generate double-stranded DNA comprise hexanucleotide oligonucleotides. A skilled person will understand that the kit is not restricted to any specific hexanucleotide oligonucleotide sequences. Additionally, a skilled person will understand that the hexanucleotide oligonucleotides may be exchanged or mixed with shorter or longer oligonucleotides, for example pentamer, heptamer, octamer, nonamer and/or decamer oligonucleotides, without necessarily compromising the method procedure and/or method performance.

In some embodiments, one or more buffers, buffering salts and/or reagents to generate double-stranded DNA from single stranded cDNA template obtained from linear amplification step (ii) is provided in the kit. In a preferred embodiment, such reagents comprise nucleotides, Klenow large fragment polymerase, and/or any DNA polymerase with strand displacement activity, for example, Bst polymerase.

In preferred embodiment, the kit comprises one or more transposase, which may be used to perform the tagmentation steps of the methods of the invention. Any transposase can be used, such as MuA or TnY transposase. In a preferred embodiment, a hyperactive mutant transposase could be used for tagmentation. Preferably, the transposase is Tn5 transposase.

As already discussed herein, those skilled in the art will understand that transposases randomly fragment the double-stranded DNA into smaller fragments and add sequencing adapters simultaneously, thereby generating double-stranded DNA having sequencing adapters at each end. In some embodiments of the kits disclosed herein, different types of transposases are provided, for example, one transposase designed to add one sequencing adapter, and another transposase designed to add a different sequencing adapter.

In some embodiments of the kits disclosed herein, the number of transposases to be used in the kits of the invention is at least one transposase, at least two transposases, at least three transposases, at least four transposases, at least five transposases, at least 10 transposases, at least 16 transposases, at least 32 transposases, at least 64 transposases, at least 96 transposases, at least 128 transposases, at least 384 transposases or more transposases.

In some embodiments of the kits described herein, one or more transposases used in step (iv) is provided in the kit in individual vials and/or tubes and/or in multi-well plate.

In some embodiments, one or more transposases is provided in a 96 or 384 well plate. In some embodiments, the kit further comprises a plurality of such plates. In some embodiments, the plurality of plates may include two or more, four or more, six or more, or eight or more plates.

In some embodiments, the kit further comprises one or more pairs of oligonucleotide primers of step (iv) of the methods of the invention, to exponentially amplify the population of tagmented fragments to generate one or more amplicon of each DNA molecule in the population.

Such oligonucleotide primers are as described above in relation to the methods of the invention. In a preferred embodiment, one or more pairs of oligonucleotide primers comprise a nucleic acid molecule having a sequence of nucleotide residues of at least about 10 nucleotides, more preferably at least about 15 nucleotides, more preferably at least about 20 nucleotides, more preferably at least about 25 nucleotides, and even more preferably at least about 30 nucleotides. In a preferred embodiment, the oligonucleotide primer is preferably between 5-70 nucleotides in length, more preferably between 20-60.

In a preferred embodiment, each pair of one or more pairs of oligonucleotide primers comprises a first oligonucleotide primer that is specific for the target sequence, and a second oligonucleotide primer that is complementary to a sequence introduced by tagmentation in step (iv).

In a preferred embodiment, one or more pairs of oligonucleotide primers further comprise an indexing primer region sequence (i.e. an indexing primer adapter) enabling indexing of the amplified fragments.

A skilled person will appreciate that in some embodiments of the kits disclosed herein, the number of oligonucleotide primers pairs will depend on number of transposases provided in the kit, as a second oligonucleotide primer is complementary to a sequence introduced by tagmentation in step (iv). In some embodiments of the kits disclosed herein, the number of oligonucleotide primers pairs to be used in the kits of the invention is at least one primer pair, at least two primer pairs, at least three primer pairs, at least four primer pairs, at least five primer pairs, at least 10 primer pairs, at least 16 primer pairs, at least 32 primer pairs, at least 64 primer pairs, at least 96 primer pairs, at least 128 primer pairs, at least 384 primer pairs or more primer pairs.

In some embodiments of the kits described herein, one or more pairs of oligonucleotide primers used in step (iv) is provided in individual vials and/or tubes and/or in multi-well plate. In some embodiments, one or more pairs of oligonucleotide primers is provided in a 96 or 384 well plate. In some embodiments, the kit further comprises a plurality of such plates, such as two or more, four or more, six or more, eight or more plates.

In some embodiments, one or more buffers, buffering salts and/or reagents to exponentially amplify the population of tagmented fragments is provided in the kit. Exponential amplification reaction reagents, ingredients and concentrations are, for example, as those discussed above in relation to the methods of the invention and/or provided in the accompanying examples, such as Example 1. In a preferred embodiment, reagents comprise nucleotides and/or a DNA polymerase, for example a thermostable DNA polymerases.

In some embodiments of the kits disclosed herein, indexing primers comprise dual indexing primers, i.e. comprising a primer pair where each comprise custom-designed index. Such indexing primers are as described above in relation to the methods of the invention. In a preferred embodiment, indexing primers are designed to specifically bind sequence which corresponds to the indexing primer adapters sequences introduced into exponentially amplified DNA fragments of step (iv). Preferably, a unique pair of indexing primers is used for every individual cell analysed. Preferably, the indices have between 1 to 50 base pairs in length in length, more preferably between 5 to 25 base pairs in length, most preferably between 5 to 10 base pairs in length.

In some embodiments of the kits disclosed herein, the number of indexing primers of step (iv) to be used in the kits of the invention is at least one indexing primer pair, at least two indexing primer pairs, at least three indexing primer pairs, at least four indexing primer pairs, at least five indexing primer pairs, at least 10 indexing primer pairs, at least 16 indexing primer pairs, at least 32 indexing primes pairs, at least 64 indexing primer pairs, at least 96 indexing primer pairs, at least 128 indexing primer pairs, at least 384 indexing primer pairs or more indexing primer pair.

In some embodiments, one or more buffers, buffering salts and/or reagents for library preparation are provided in the kit. Library preparation reaction reagents, ingredients and concentrations are, for example, as those discussed above in relation to the methods of the invention and/or provided in the accompanying examples, such as Example 1. In a preferred embodiment, reagents comprise nucleotides and/or a DNA polymerase, for example a thermostable DNA polymerases.

In another aspect, the invention provides a library of amplified sequences obtained by, or obtainable by, the methods disclosed herein.

In some embodiments, the library of amplified sequences comprises a library of double stranded DNA molecules comprising one or more of the following sequences: one or more indices, target sequence and/or flanking sequence.

In some embodiments of the kits disclosed herein, the library comprises amplified sequences which are at least 150 base pairs in length, at least 160 base pairs in length, at least 170 base pairs in length, at least 180 base pairs in length, at least 190 base pairs in length, at least 200 base pairs in length, at least 210 base pairs in length, at least 220 base pairs in length, at least 230 base pairs in length, at least 240 base pairs in length, at least 250 base pairs in length, at least 260 base pairs in length, at least 270 base pairs in length, at least 280 base pairs in length, at least 290 base pairs in length, at least 300 base pairs in length, at least 310 base pairs in length, at least 320 base pairs in length, at least 330 base pairs in length, at least 340 base pairs in length, at least 350 base pairs in length, at least 360 base pairs in length, at least 370 base pairs in length, at least 380 base pairs in length, at least 390 base pairs in length, at least 400 base pairs in length, at least 420 base pairs in length, at least 440 base pairs in length, at least 460 base pairs in length, at least 480 base pairs in length, at least 500 base pairs in length, at least 520 base pairs in length, at least 540 base pairs in length, at least 560 base pairs in length, at least 580 base pairs in length, at least 600 base pairs in length, at least 620 base pairs in length, at least 640 base pairs in length, at least 680 base pairs in length, at least 700 base pairs in length, at least 720 base pairs in length, at least 740 base pairs in length, at least 760 base pairs in length, at least 780 base pairs in length, at least 800 base pairs in length, at least 820 base pairs in length, at least 840 base pairs in length, at least 860 base pairs in length, at least 880 base pairs in length, at least 900 base pairs in length, at least 920 base pairs in length, at least 940 base pairs in length, at least 960 base pairs in length, at least 980 base pairs in length, or at least 1000 base pairs in length. Preferably, the library comprises amplified sequences which are about 600 base pairs to about 700 base pairs in length.

In some embodiments of the kits disclosed herein, the library of amplified sequences comprises at least 1 x 104 amplified sequences, at least 1 x 105 amplified sequences, at least 1 x 106 amplified sequences, at least 1 x 107 amplified sequences, at least 1 x 108 amplified sequences, at least 1 x 109 amplified sequences, at least 1 x 1010 amplified sequences, at least 1 x 10" amplified sequences, at least 1 x 1012 amplified sequences, at least 1 x 1013 amplified sequences, or more.

In another aspect, the invention provides a method, or a kit, or a library, substantially as described herein with reference to the accompanying claims, description, examples and/or figures.

FIGURES

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying figures, in which: Figure 1 shows the structure of the full-length LINE-1 in the human genome which is the main example of an application of the present invention.

Figure 2 shows the core steps that constitute the present invention. First linear amplification is performed using a primer appropriate to the targeted transposable element, in this example LINE-1. Then, second strand synthesis is performed to obtain double-stranded DNA (dsDNA). The resulting dsDNA is tagmented using the Tn5 enzyme. Exponential amplification is then performed using a primer appropriate to the targeted transposable element and a primer targeting the sequence incorporated by tagmentation. The resulting DNA product may then be multiplexed before sequencing.

Figure 3 shows the number of LINE-1 detected per insertion category. The present invention was performed on 384 cells in two donors (74 and 79). Known reference insertions are LINE-1 present in the human reference genome, known non-reference insertions are polymorphic in the human population but not present in the human reference genome, and the unknown insertions are not known in the human population. The latter category is treated as putative somatic insertions.

Figure 4 shows the read depth of the known non-reference, unknown and unknown with softclip support categories of insertions.

Figure 5 shows the sequence logos (i.e. PWM) of KNR (Known Non Reference) and UNK (Unknown) insertions for donor 74 and donor 79 respectively.

Figure 6 shows the validation of putative somatic and germline insertions predicted by the present invention to be present in Donor 79. The predicted locations were amplified using known sequences flanking the somatic LINE-1 and visualized using gel electrophoresis. Two known reference LINE-1 were used as positive control. See Table 1 for more details.

Figure 7 shows a comparison between the number of known non-reference insertions detected performing Li-IP on a large population of cells (21 bulk samples from Donor 79) and performing the present invention on single cells from the same donor (Donor 79).

Figure 8 shows a comparison between the sensitivity to detect transposable elements in single cells, with present invention and a previously established method, Li-IP. This figure shows the distribution of cells where known non-reference insertions were detected using either the present invention or L1-IP.

Figure 9 shows the restriction sites for various restriction enzymes identified for each L1H5 element in the human genome that oligonucleotide primers used in the Examples are specific for. Each restriction enzyme has a restriction site which may be shared with other restriction enzymes not shown in this figure.

Figure 10 shows Bioanalyzer traces of LAM-PCR using L1 HS-specific primers.

Figure 11 shows Bioanalyzer traces for the method of the invention.

Figure 12 shows the target specificity of the method of the invention. Method of the invention is able to detect transposition active human LINE-1 with high sensitivity. The primers used are specific to the group of transposition active human LINE-1 and contain two groups diagnostic nucleotides, i.e. "AC" and "G". Boxplots show the number of cells a given insertion was detected: "Neither AC/G" -the group of insertions which have neither the AC or G diagnostic nucleotides; "Only AC" the group of insertions which have the AC diagnostic nucleotides; "Only G" -the group of insertions which have the G diagnostic nucleotide; and "Both AC/G" -the group of insertions which, expectedly, have both the AC and G diagnostic nucleotides (Donor D74, N = 47, 62, 116, 646 insertions respectively and donor D79, N = 43, 51, 116, 658 insertions respectively; 384 cells per donor). Centre lines: median; hinges: the first and third quartiles; whiskers: 1.5x the interquartile range (IQR).

Figure 13 shows long-read validation of a L1HS insertion discovered by LUSTRE using Pacbio sequencing. The image is the direct visualisation of sequencing reads from UNK1 using the Integrated Genome Viewer. The arrow indicates the insertion site. The insertion is only present on one allele, supported by the absence of an insertion in approximately half of the reads.

EXAMPLES Example 1

A major portion of the human DNA is derived from transposable elements (Lander et at 2001. Nature 412, 565-566). While most of these elements have lost their ability to transpose in the human genome, a subset of the human specific long-interspersed element-1 (LINE-1) retain the ability to mobilize through a mechanism defined as retro-transposition. Retro-transposition results in the integration of new LINE-1 at other locations in the genome, creating stable genomic tags by virtue of sequence variation at individual loci. LINE-1 are 6 kb long at full length with a known consensus sequence, where human specific LINE-1 (L1Hs) are identified by bases AC at positions 5930-5931 and G base at position 6015 (Figure 1).

Briefly, lysed DNA from single cells is subjected to linear amplification as the first step of the method, where a primer targeting the primate specific sequences of LINE-1 is used. This step results in single stranded DNA fragments which contain the 3' end of LINE-1 and the genomic flank. After size selection, double stranded DNA is synthesized through a hexanucleotide priming reaction. This step is required for the downstream tagmentation step, where the oligo inserted upon tagmentation by 1n5 transposase is used for generating amplicons. Another round of size selection is performed, and the newly synthesized double stranded DNA is subjected to tagmentation. Tagmented DNA is then subjected to an exponential amplification, where a primer targeting the 3' L1Hs consensus sequence is used together with a primer complementary to the sequence of the oligonucleotide inserted by Tn5. PCR products are then dual indexed to be pooled together and sequenced on a short-read sequencing platform (Figure 2).

Materials and methods Peripheral blood mononuclear cell samples Peripheral samples were obtained from two healthy male donors aged 41 and 46.

Mononuclear cells were isolated by density centrifugation and cryopreserved in fetal bovine serum (FBS) with 10°/0 DMSO to be stored in liquid nitrogen for later use for FAGS.

Flow cytometry analysis and sorting Upon thawing, samples were washed in FAGS buffer (PBS, 2% FBS, 1 mM EDTA) and filtered through a 40 pm strainer. Peripheral blood mononuclear cells were stained with CD3-FITC (Biolegend) or CD8-PECy5 (BD Biosciences) were used for different experimental conditions. For live/dead cell discrimination, DAM or Propidium Iodide ReadyProbes Reagent (Thermo Scientific) were used. Single cell FACS was performed in a BD Influx Cell Sorter or BD FACSMelody instrument with a 100 pm nozzle.

GENERATION OF L1 INSERTION PROFILING (L1-IP) LIBRARIES L1-IP libraries were generated according to previously described protocol (Evrony et al, 2012. Cell, 151:483-496). This method requires whole-genome amplified DNA material from single cells as input. Therefore, multiple displacement amplification (MDA) according to the optimized protocol in the same publication was performed.

Briefly, single-cell FACS was performed to sort T-cells into 96-well plates containing 2.8 pl lysis buffer (200 mM KOH, 5 mM EDTA, 40 mM DTT) in each well and neutralized with 1.4 pl neutralization buffer (400 mM HCI, 600 mM Tris pH 7.5). For the bulk samples, approximately five million cells were sorted from a peripheral blood sample.

Genomic DNA was isolated by using the QIAGEN Blood & Tissue kit according to manufacturer's instructions.

MDA was performed by adding 15.8 pl of reaction mix, containing lx RepliPHI phi29 reaction buffer (Epicentre), 50 pM random hexamers, 2 mM dNTPs, 40 U RepliPHI phi29 polymerase (Epicentre) and H20. The reaction was carried out for 10 hours at 30 °C, followed by heat inactivation for 10 min at 65 °C. After quality control, the samples were stored in -20 °C.

Upon thawing, the MDA products were diluted to 100 ng/pl to be used in PCR. First PCR of Li-IP involves the pairing of a human-specific Li primer (L1H5-AC) complementary to Li 3' UTR and 8 arbitrary seed primers containing 6-base barcodes along with 5 degenerate N nucleotides. Therefore, each sample was divided into 8 wells of 96-well plates and one set of seed primers were used for each sample. After 5 cycles of linear amplification with L1Hs-AC, seed primers are added into each reaction to enable exponential amplification of samples. Next, the PCR products were purified with SPRI Ampure XP beads with a 1:1.8 ratio.

Second round of PCR involves pairing of L1Hs-G primer complementary to Li 3' UTR and an oligo complementary to seed sequences which incorporates an Illumina sequencing adapter. Libraries were subsequently visualized by gel electrophoresis and size selection was performed by excising bands between 200-500 bp. Gel bands were cut using sterile scalpels for each sample set (8 seed primer products). DNA was extracted in individual columns by using a gel extraction kit (QIAGEN) according to manufacturer's instructions. Each sample set consisting of 8 seed reactions were pooled equimolarly and DNA was quantified using Quant-iT PicoGreen dsDNA Assay Kit (Thermo Scientific). Library quality and size distribution was checked on a DNA 1000 chip. In each sequencing library, 32 single cell libraries (32 cells x 8 seeds = 256 samples) were pooled equimolarly and sequenced at 150-bp single-end using Nextseq Mid Output Cartridge kits and a Nextseq 500 sequencer (IIlumina).

Data orocessina and analysis of L1-IP data Demultiplexing and mapping Due to the single-end sequencing with custom (non-Illumina) indexes of L1-IP protocol, no demultiplexing occurs at the bc12fastq step. Instead, the first 21 bases of the read give information regarding the cell and which seed primer the read belongs to. The cell barcode is present at base 5-11, while the seed barcode is present at base 16-21. Cutadapt was used to identify these sequences and the software deML (https://github.com/grenaud/deml) was used to assign the most likely cell and seed combination.

The reads corresponding to the same cell were then identified based on the demultiplexing. Before mapping, repetitive G sequences present on the end of the read were trimmed with cutadapt. The reads were mapped with bowtie2 to a custom version of hg38 where L1HS insertion sequences have been N-masked to ensure unbiased mapping. The resulting.sam file was then turned into a.bam file and sorted using samtools.

Peak calling insertions The peak calling was done with an in-house script, which is essentially a reimplementation of the peak calling method described by Evrony (Evrony et al, 2012. Cell, 151:483-496). Briefly, a peak is started when the read pileup above a given mapping quality on a particular mapping position is above a specified threshold. For the present invention, that threshold is 10 reads and a mapping quality of 25. The peak is extended until there are not any more reads covering a position, with an allowed gap up to 500 base pair. Peak statistics were then collected: read depth, strand and unique read start sites.

Softclipping reads and insertion site profiling During the mapping process, the 5' or 3' part of the read (or both) is allowed not to match the reference sequence which the read is assigned to. This part of the read (if present) is called the softclipping part of the read. For the L1-IP-seq method, this occurs if the L1HS insertion is sequenced, but there is no support for the insertion in the reference genome (i.e., the insertion is either KNR or UNK). Therefore, the inventors extracted the softclipping sequences from reads and mapped them to the L1HS sequence. If the read maps to the L1HS sequence the peak which that read is a part of has softclip support. Furthermore, it allows to obtain information on the exact insertion site sequence of that peak.

Construction of position weight matrices to analyse insertion site sequences of LINE-1 To investigate whether the insertion site sequence as a group is consistent with previous knowledge, the inventors constructed a PWM (position weight matrix) from the insertion site data in the present example. Construction of PWM is common for biological sequences and a skilled person will be able to readily construct a PWM. In the present Example, the probability of each nucleotide in a sequence motif was represented based on the observation of each nucleotide in LINE-1 insertion sites.

PWM represents the probability of each of these nucleotides existing over all insertion sites in a dataset at a given position relative to the insertion site. The insertion site sequences were scored based on this PWM and a Logo plot was made for the UNK and KNR insertions present in that sample respectively. A logo plot shows the probability of each letter (A, T, C, G) for each position in the insertion site sequence.

Assessing sensitivity by KNR insertions To measure the sensitivity of Li-IP to detect heterozygous insertions, the inventors used known non-reference (KNR) insertions. The reasoning for this is that these insertions are likely to be germline heterozygous due to their polymorphic nature.

Since which KNR insertion are present in the sequenced individual is not known, the inventors considered KNR insertions detected in at least one sample from the individual. The fraction of samples a given insertion is detected is then used to estimate sensitivity. We then calculate the median fraction across KNR insertions as a measure of sensitivity.

LUSTRE

Single cells were sorted into 96-well or 384-well plates containing 3 pl lysis buffer (60 mM Tris-Ac pH 8.3, 2 mM EDTA pH 8, 15mM DTT) and immediately spun down for storage in -80 °C. Samples were incubated at 75 °C for 30 min. Next, 0.5 pl of thermolabile Proteinase K (1:100 dilution, NEB) was added to samples for incubation at 37 °C for 15 min, followed by 55 °C for 10 min and immediately placed on ice afterwards or stored in -20 °C for later use.

Linear amplification Linear amplification was performed by adding 7 pl of PCR mix, containing lx PrimeStar Max (Takers) and 0.5 pM LUSTRE-AC primer (SEQ ID NO 1: 5'-biotin-GGGAGATATACCTAATGCTAGATGAC + A+ C-3', H PLC desa Ited; I DT) . LUSTRE-AC primer targets human-specific Li at positions 5930-5931 as previously described (Ewing and Kazazian,2010. Genome Res. 20, 1262-1270; Evrony et at 2012. Cell, 151:483-496). The inventors have made modifications to ensure robust amplification, where 5' biotinylation and 3' locked nucleic acid technology prevent reverse-complementary priming and inhibit 3' degradation of the primer.

Linear amplification was carried out as follows: 3 min at 95 °C for initial denaturation, 40 cycles of 20 sec at 98 °C, 15 sec at 63 °C, 45 sec at 72 °C and 5 min at 72 °C for final elongation. It should be noted that the linear amplification could be performed with vendor reagents as an alternative to the specific protocol the inventors have developed.

Samples were purified with SPRIselect beads (Beckman Coulter) to size select for fragments >300 bp.

Generation of double-stranded DNA To generate double-stranded DNA from the linear amplification product, 2 pl of hexanucleotide priming mix containing 1X Buffer 2 (NEB), dNTPs 0.1 mM (NEB), 0.25 pM random hexamers (Invitrogen), 0.125 pl Klenow large fragment (NEB), 0.5 pl H20 was added to samples and incubated for 1 hour at 37 °C followed by 20 sec at 75 °C for heat inactivation. Samples were purified again with SPRIselect beads (Beckman Coulter) to size select for fragments > 200 bp.

Taqmentation Purified samples were immediately subjected to tagmentation. Tagmentation was carried out as previously described (Picelli et al, 2014. Genome Res. 24:2033-2040), with adjustments to accommodate the target DNA to be tagmented to achieve the optimal library size with LUSTRE. Tagmentation was carried out in 10 pl, consisting of lx tagmentation buffer (10 mM Tris pH 7.5, 5 mM MgC12, 5% DMF) and 0.02 pl TDE1 (Illumine). Tagmentation was carried out for 12 min at 55°C, followed by the addition of 0.5 pl of 0.2% SDS. The release of Tn5 from the DNA was enabled through incubation for 30 min at 68 °C.

Exponential amplification Exponential amplification of tagmented DNA was performed by adding 6 pl of PCR mix, containing 1)( PrimeStar HS Buffer (Takara), 0.25 mM dNTPs (Takara), 0.04 U/pl PrimeStar HS polymerase (Takara), 1 pM LUSTRE-L1Hs-G primer (SEQ ID NO 2: 5'-GTCTCGTGGGCTCG GAGATGTGTATAAGAGACAGTGCACATGTACCCTAAAACTT+ A+ G-3', HPLC desalted; IDT) and 1 pM Illumina-FC-121-1030 (SEQ ID NO 3: 5'-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-3', HPLC desalted; IDT). LUSTREL1Hs-G primer targets human-specific Li at position 6015 as previously described (Zong et al, 2012. Science, 338(6114):1622-6 and Chen et al 2017, Science, 356(6334):189-194) and contains the adapter for Illumina-Nextera Index 1, allowing for paired-end sequencing and dual-indexing. Illumina-FC-121-1030 is complementary to a synthetic oligonucleotide inserted by Tn5 transposase upon tagmentation (Adey and Shendure, 2012. Genome Res. 22:1139-1143) and contains the adapter for Illumina-Nextera Index 2. Exponential amplification can be performed with other vendor reagents as an alternative to the protocol the inventors have developed.

Exponential PCR was carried out as follows: 3 min at 95 °C for initial denaturation, 22 cycles of 20 sec at 98 °C, 15 sec at 65 °C, 1 min at 72 °C and 5 min at 72 °C for final elongation. Resulting LUSTRE libraries were purified with SPRI beads (Beckman Coulter) or in-house produced 22% PEG beads with 1:0.8 ratio. Library quality and size distribution were checked on a high-sensitivity DNA chip (Agilent Bioanalyzer). Libraries were quantified using the QuantiFluor dsDNA System (Promega) and normalized to 1 ng/pl.

Library preparation and sequencing Dual indexing of all libraries was performed by using custom designed Nextera index primers containing 10 bp indexes at 0.1 pM concentration each, differing with a minimum Levenshtein distance of 2 between any two indices. For LUSTRE libraries, lx PrimeStar Max (Takara) and 3 pl Nextera indices were added to each well containing 3 pl of DNA, and cycled as follows: 3 min at 95 °C, 12 cycles of 20 sec at 98 °C, 15 sec at 55 °C, 30 sec at 72 °C, and 5 min at 72 °C.

After indexing PCR, individual plates containing the samples were pooled, and the final pools were purified with SPRI beads (Beckman Coulter) or in-house produced 22°A) PEG beads (1:0.8 ratio for LUSTRE and 1:0.7 ratio for Smart-seq3). Library quality and size distribution were checked on a high-sensitivity DNA chip (Agilent Bioanalyzer).

Data processing and analysis of LUSTRE data The reads were mapped to the human reference genome (hg38, no alternative contigs) using bwa mem (v0.7.17) (Li and Durbin, 2010. Bioinformatics 26:589-595). The softclipping portions of reads were identified and remapped. These two alignment files were then used to identify LINE-1 insertions.

The first read mate (read 1) contain the flanking sequence, while the second read mate (read 2) contain the 3' Li sequence, poly-A tract, potential 3'-transduction sequence and/or additional poly-A tract, and/or potentially a portion of the immediate flanking sequence.

The read 2 alignments which serve as anchoring reads were first identified. All the read 2 alignments were collected and subjected to four requirements for further consideration: the alignment is not a supplementary alignment, is paired, is mapped, and the first 23 bases of the forward sequence correspond to the expected Li sequence up to a hamming distance of 6. Hamming distance is a well-known concept in the art, which measures how many symbol substitutions is required to change one string to another string of equal length (Hamming, 1950. The Bell System Technical Journal. 29 (2):147-160). Furthermore, the sequence of the 8 bases immediately following the primer sequence is required to correspond to the expected Li sequence immediately following the primer sequence, AGTATAAT, up to a hamming distance of 1.

If the alignment passes all these conditions, the sequence was added to a list of reads which are anchored. The reads are also marked whether it is disconcordant or concordant. In this context, a read pair is disconcordant if any of these three requirements are true: the read pair map to different chromosomes, or the read pair map to the same chromosome at least 15,000 bp away from each other, or the read pair are properly paired but the Li portion of read 2 does not map to the loci (softclipping). Any region in the genome which contain concordant reads is not considered in the detection of non-reference and somatic insertions.

Once the anchor reads were determined, all read 1 alignments are collected and are subjected to 3 requirements for further consideration: the read 2 mate is anchored, the alignment is not a supplementary alignment and the mapping quality of the alignment is above 0. If a read contains more than 90% As in the aligned portion of the sequence, the non-aligning portion of the read is mapped to the reference genome again, to recover reads which do not.

If the alignment passed all these conditions, multiple pieces of information is saved: contig, reference start and end, strandedness, disconcordance and if the read 1 or 2 softclip. If read 1 softclips, that information is used to adjust the read start or end depending on the strand. The reads are then merged across cells by location and strand, reads are considered a part of the same cluster if they are within 1000 base pairs of each other. In the case of germline insertions, this usually results in a genomic window of 500-3000 base pairs where the reads are considered to support that insertion. For somatic insertions, which are only present in a subset of cells, the window may be narrower due to the lower number of total reads and therefore insert sizes.

Then, each insertion peak is quality filtered. The mean mapping quality is required to be above 25. The softclipping reads need to support the presence of a polyA tract at the predicted insertion site (most common softclip position), the consensus of the softclipping sequence need to contain at least 8 As in the first 10 bases of the sequence.

If the consensus of the softclipping sequence is less than 10 bases long, but present, all bases are required to be A. Insertion peaks which do not have softclip support are filtered away.

Validation of LINE-1 insertion sites with long-read sequencing To further validate the L1Hs insertion sites identified with LUSTRE, the inventors selected a set of 12 loci containing detected L1Hs insertion sites in the inventors' single-cell datasets, together with two germline insertion sites which are present in the reference genome. High molecular weight DNA was isolated from PBMCs from a 41-year-old male donor according to manufacturer's instructions (QIAGEN MagAttract HMW DNA kit). Long-range PCR was performed by adding 18 pl of PCR mix, containing 1X PrimeStar GXL buffer (Takara), 0.2 mM dNTPs each, 0.5 U/pl PrimeStar GXL polymerase (Takara), 0.4 pM forward primer, 0.4 pM reverse primer, and H20 to 40 ng of template DNA. Forward and reverse primers were designed to span 10 kb regions in the genome where the inventors identified L1H5 insertion in the inventors' single cell datasets and additional 2 regions where there is a previously reported germline insertion present (Table 1).

Identifier Chr om oso me Insertio Stra nd Typ e Forward Primer 5'- Reverse Primer Size bp SEQ ID NO n site 3' position KNR1* Chr 16955216 3 + KNR TCCTCGAT AGTCCCT GCCACAA TGTAAAG ACT 3907 4,5 3 TCTTCATC

TGCCTGAG

KNR2 Chr 59867162 + KNR TCTCATGG AGAGGAC TGGCATT TTGGTGT ACT 3909 6,7 3 GTTTATCA

GGGGCTTC

KNR3 Chr 10932718 0 KNR CTCAGTAG GAAGCCC AACTACC CTGATGA AGA 3859 8,9 4 GCTCAGCA

GAAACTCA

KNR4 Chr 13749240 8 KNR TTCCTTGT AATCCCA CCAAAGC CATTTAA GCC 3942 10,11 4 CTCTTCTG

GGTTCTGG

UNK1 Chr 14289222 + UNK CCTGGAGA CCGTCTT TACCTTC TCAGCAT COT 3814 12,13

TGAGGGAA

TGTTAGGG

KNR5 Chr 11038777 KNR TAGGCCTA GCACCTT AACTCCA TCTTCTC AGG 3825 1445 12 GACAAGTA

CCCTCACA

UNK2 Chr 99486207 UNK TCCACCCT TTTTCTC CTGCCCA GTGTTCC AAA 4000 16,17 12 TTTGCCCT

CAGTTTAT

KNR6 Chr 61912726 + KNR GAGTCATG TTCATCT TTTCCTA GTGCACC CCA 3968 18,19 18 TGATCTAG

AGCCCAGG

UNK3 Chr 12607422 + UNK CACTTAAC TAACCAC GAACTAA CAAAGGC CCA 3756 20,21 22 CAGCCTGT

TTTCCGTT

UNK4 Chr 12154205 UNK GTGGCCA GCTGAGTA GTATCTGA A GTGAAAG AGATGTA CATGCTG GGC 3992 22,23 KNR7* Chr 16765164 + KNR GATGCCAG CTGTTGG AAAAGCG CCGTATT AGG 3829 24,25 22 CTACTCGT

TCTATGGA

UNK5 Chr X 10638817 0 UNK CTGGGTGC ACACTTA GATATGG ACATGCT GCC 3866 26,27

TGTTTTGT

GGAGTATG

chr1:2099 13772-20991982 3 Chr 1 20991982 3 KR TCAAGTGG AGCTTGA GAACCAC TTGCTCC ATA 9000 28,29

AGATAAGG

CTGCTGTT

chr7:3043 Chr 7 30445274 + KR TACCTTCA CAGGGCT TTGCGTT ATGTCCT AAG 9233 30,31 9243- GAAACCCC 30445274 ATCAGCAA Table 1. Primer design for each putative somatic, polymorphic germline and homozygous germline LINE-1 insertions. Insertions marked with asterisk (*) are polymorphic germline LINE-1 which are not expected to validate due to known contamination.

Long-range PCR was carried out as follows: 30 cycles of 10 sec at 98 °C and 10 min at optimized temperatures ranging between 66-70 °C for each primer pair used. PCR products were subsequently visualized by gel electrophoresis and each amplicon was excised using a sterile scalpel for each band. DNA was extracted by using a gel purification kit (QIAGEN) according to manufacturer's instructions. SMRTbell libraries were prepared by adding sequencing adapters to the indexed amplicons generated by LUSTRE, according to instructions provided by Pacific Biosciences (available on the manufacturer's website), and sequenced on a PacBio Sequel II sequencer.

Data analysis of LINE-1 insertion sites with long-read sequencing First, the HiFi reads were demultiplexed using the known primer sequences together with the ligation adapter sequences to separate reads by sample. Reads which did not contain any expected combination of sequences were discarded at this step. Then, reads were aligned to the hg38 build of the human reference genome, using pbmm2 with the HIFI alignment mode.

The reads of each loci in the experiment was inspected using the software Integrated Genome Viewer to find large insertions or softclipping reads at the expected insertion site (Figure 13). If found, characteristics of the sequences were collected: target site duplication (TSD), polyA tract length, and allelic mutations by inspecting the reads in IGV and their specific insertion sequence. A target site duplication is the identical sequences present immediately 5' and 3' of the of the transposon, originating due to the insertion of a transposon. TSD are typically 7-20 bp in length.

Results The inventors have applied LUSTRE to a peripheral blood mononuclear cells (PBMCs) isolated from human donors, here signified as Donor 74 and Donor 79. As a result, the inventors were able to detect and classify insertions as homozygous germline which are L1Hs found in both alleles, polymorphic germline which are L1Hs found in one allele and undocumented L1Hs which are heterozygous and have occurred through retrotransposition in somatic cells or previously undocumented polymorphic germline events. The median number of detected L1Hs per cell through LUSTRE are 666, 83, 3 for Donor 74 and 742, 96, 4 respectively for each category of insertions (Figure 3).

To validate the detected putative somatic L1Hs, the read count distributions of heterozygous germline LINE-1 were compared to the putative somatic L1Hs. The germline polymorphic insertions share the same characteristics as somatic insertions and should not give rise to different read counts. The read count distributions were found to be comparable, which support the authenticity of the detected putative somatic L1Hs (Figure 4).

The pre-integration sequences detected at the insertion sites are expected to be consistent with a canonical LINE-1 endonuclease mediated cleavage (5'-TTTT/A-3').

Indeed, the detected integration sequences are consistent with this mechanism of insertion as represented by the sequence logos of KNR and putative somatic L1Hs for donor 74 and 79 respectively (Figure 5).

To further validate the L1Hs insertion sites identified with LUSTRE, the inventors selected and amplified a set of loci containing somatic and germline L1Hs in the inventors' single-cell datasets. These loci were amplified from DNA extracted from cells from the same individual by designing primers flanking the 5' and 3' of the predicted L1Hs insertion site. If there is an insertion present, it would result in two bands in the gel: long band representing the allele containing the somatic or heterozygous germline insertion (size of the LINE-1, 6 kb and 2 kb spacing on each end, resulting approx. 10 kb) and the short band representing empty allele which is only 4 kb. The shorter band contain DNA from the allele without the insertion and the other allele from cells which do not harbour the insertion. The longer band contain DNA from the allele with the insertion from the cells containing the L1Hs. Germline insertions yield a single band since they are homozygous with a size of 10 kb. The inventors found that 9/12 tested loci contain two bands, which indicate the presence of heterozygous L1Hs in the cells of the donor (Figure 6).

To evaluate the performance of LUSTRE, the inventors first assessed the sensitivity of the method of the invention by measuring the number of polymorphic germline insertions in bulk samples and single cells from human PBMCs. Polymorphic germline insertions are mainly heterozygous insertions which would have the same probability of detection as a somatic insertion since somatic insertions are also found as one copy in the genome. Since the number of polymorphic germline insertions are known in human genomes (Ewing and Kazazian,2010. Genome Res. 20:1262-1270), this allowed the inventors to make a precise assessment of how the method of the invention performs.

The inventors observed that the method of the invention performs at 80% sensitivity in single cells at a read depth of -1 million reads per cell, which is a lower sequencing depth when compared to previously described methods which have been employed in single cells (6.7 million reads was reported in Evrony eta!, 2012. Cell, 151:483-496; 38 million reads was reported in Upton et al, 2015. Cell, 161:228-239). Moreover, the sensitivity of the method of the invention in single cells is comparable to bulk data, meaning that the inventors were able to achieve a level of information regarding the LINE-1 insertion sites in single cells with minimal missing data that arise from handling low-input DNA (Figure 7).

The inventors next benchmarked the method of the invention against a previously described approach (L1-IP) which aims to profile human specific LINE-1 elements in individual neurons. The inventors have used the polymorphic germline insertions detected in a given individual in order to calculate the fraction of cells in which that specific insertion was detected. For example, for 71 single cells in the experiment where a given insertion in X cells is detected, the sensitivity to detect that insertion is calculated as x/n. This calculation shows a median detection probability of 0.66 for the method of the invention, compared 0.24 for Li-IP in a typical experiment (Figure 8). For both methods, all 96 sequenced cells were included for this calculation regardless of detection performance of the individual cells.

Example 2

Introduction

To compare LAM-PCR (Schmidt et al, 2007. Nat. Methods, 4: 1051-1057) with the present invention, the inventors applied the LAM-PCR method with the L1HS-specific primers and MseI as the restriction enzyme to DNA extracted from human primary cells. The LAM-PCR method was applied to varying amounts of DNA, with the lowest input amount equivalent to the amount present in single cell.

LUSTRE on bulk genomic DNA samples Genomic DNA was isolated from PBMCs by using the QIAGEN Blood & Tissue kit according to manufacturer's instructions. Serial dilution of genomic DNA was performed to obtain templates with a DNA concentration of 60 ng/pl, 6 ng/pl, 0.6 ng/pl, 0.06 ng/pl and 0.006 ng/pl.

Single-cell LUSTRE protocol was adjusted for volumes to enable amplification in bulk genomic DNA samples as follows: linear amplification was performed in 20 pl of reaction mix containing lx PrimeStar Max (Takara), 0.5 pM LUSTRE-AC primer and H20, with genomic DNA as template at respective concentrations. Hexanucleotide reaction and tagmentation was carried out same as single-cell LUSTRE.

Exponential amplification of tagmented DNA was performed by adding 20 pl of PCR mix, containing lx PrimeStar HS Buffer (Takara), 0.25 mM dNTPs (Takara), 0.04 U/pl PrimeStar HS polymerase (Takara), 1 pM LUSTRE-L1Hs-G primer, and H20. All size-selection with beads and reaction protocols were performed same as single-cell LUSTRE.

The resulting amplicons were visualized for quality and size distribution on a high sensitivity DNA chip (Agilent Bioanalyzer).

GENERATION OF LINEAR AMPLIFICATION-MEDIATED PCR (LAM-PCR) SAMPLES Genomic DNA was isolated from PBMCs by using the QIAGEN Blood & Tissue kit according to manufacturer's instructions. Serial dilution of genomic DNA was performed to obtain templates with a DNA concentration of 60 ng/pl, 6 ng/pl, 0.6 ng/pl, 0.06 ng/pl and 0.006 ng/pl. LAM-PCR products were generated according to the previously published protocol (Evrony et al, 2012. Cell, 151:483-496).

Briefly, linear amplification was performed by adding 49 ul of PCR mix, containing lx QIAGEN buffer, 0.2 mM dNTPs, 0.5 pM L1H5PIA2 primer (HPLC desalted; IDT), 2.5 U/pl taq polymerase (QIAGEN) and H20 to 1 pl of DNA at respective concentrations.

Linear amplification was carried out as follows: 5 min at 98 PC for initial denaturation, cycles of 1 min at 98 PC, 45 sec at 60 PC, 90 sec at 72 PC and 10 min at 72 PC for final elongation. After the first 50 cycles of linear amplification, 2.5 U of Tag polymerase was added to each sample and a second 50-cycle amplification was performed. Magnetic beads solution was prepared as follows: 20 pl of Dynabeads (ThermoScientific) were placed on a magnetic particle concentrator (MPG) for 1 min and the supernatant was discarded; the attached beads were then resuspended in 40 pl of PBS with 0.1% BSA; the beads were then collected on the MPG and the supernatant was discarded; this step was done twice. Then beads were then washed in 20 pl binding solution (BS), collected with the MPG and the supernatant was then discarded. The beads were then resuspended in 50 pl of BS. 50 pl of the magnetic beads solution was then added to the linear amplification product and gently mixed.

The DNA-bead complexes were incubated overnight at room temperature on a horizontal shaker at 300 RPM. The DNA-bead complexes were then collected on the MPG, the supernatant was discarded, and the DNA-bead complexes were resuspended in 100 pl of H20. The DNA-bead complexes were then again collected on the MPG, the supernatant was discarded, and the DNA-bead complexes were resuspended in 20 pl of hexanucleotide priming mix, containing lx hexanucleotide mixture (Roche), 0.2 mM dNTPs (NEB), 2 U/pl Klenow Large Fragment (NEB) and H20, and incubated at 37 °C for 1 hour.

Then, 80 pl of H20 was added to the hexanucleotide reaction. The DNA-bead complexes were collected on the MPG, the supernatant was discarded, and the DNA-bead complexes were resuspended with 100 pl of H20. The DNA-bead complexes were collected on the MPG, the supernatant was discarded and the DNA-bead complexes were resuspended with 20 pl of restriction digestion mixture containing 1)( CutSmart buffer (NEB), 4 U/pl MseI (NEB) and H20.

Restriction digestion was carried out at 37 PC for 1 hour to digest the unknown flanking regions. 80 pl of H20 was added to the reaction, the DNA-bead complexes collected on the MPG, the supernatant was discarded, and the DNA-bead complexes were resuspended with 100 pl of H20. 10 pl of ligation mix, containing lx T4 buffer with ATP (NEB), 2 pl linker cassette, 2 U/pl T4 DNA ligase (NEB) and H20 were added to the DNA-bead complexes suspended in 100 pl of H20. The reaction was incubated at 37 PC for 1 hour.

pl of H20 was added to the ligation reaction, the DNA-bead complexes were collected on the MPC, the supernatant was discarded, and the DNA-bead complexes were resuspended with 100 pl of I-120. The DNA-bead complexes were collected on the MPC, the supernatant was discarded and the DNA-bead complexes were resuspended with 5 pl of 0.1 N NaOH and incubated at room temperature for 30 min to denature DNA-bead complexes. The DNA-bead complexes were collected on the MPC and the DNA was transferred to a new plate.

PCR amplification of the DNA was performed by adding 2 pl of DNA to 48 pl of PCR mix containing lx taq reaction buffer (QIAGEN), 0.2 mM dNTPs (NEB), 2.5 U/pl taq polymerase (QIAGEN), 0.5 pM of forward primer L1H5G targeting 3' human specific consensus sequence (HPLC desalted; IDT), 0.5 pM of reverse primer LCI targeting linker cassette sequence (HPLC desalted; IDT), and H20. PCR amplification was carried out as follows: 5 min at 98°C for initial denaturation, 35 cycles of 45 sec at 60 PC, 90 sec at 72 PC and 10 min at 72 PC for final elongation. PCR products were visualized for quality and size distribution on a high-sensitivity DNA chip (Agilent Bioanalyzer).

Results The tagmentation step of the method of the invention conceptually replace two major steps of LAM-PCR: restriction digestion and linker ligation. As shown herein, the inventors have identified multiple reasons why tagmentation is preferable to restriction digestions and linker ligation.

Advantages of tagmentation over restriction digestion.

Restriction digestion have limited amplicon size diversity. A restriction enzyme only cleaves the DNA at the enzyme's restriction site, which is very specific. Since the flanking sequence is unknown, the location of the closest restriction site in the flanking sequence is unknown if present at all. In fact, the restriction enzyme preferably used in LAM-PCR by Schmidt et. al, HpyCH4IV, only allow access to 28.7% of the human genome (Figure 9).

These limitations give rise to two major complications. First, an insertion may not be detected due to a failure to cleave the DNA sufficiently close to the insertion site to generate an appropriately sized amplicon. Second, since shorter DNA fragments are usually more efficiently amplified, insertions with a restriction site close to the insertion site will be preferentially amplified and may outcompete longer amplicons.

Inventors' bioinformatic analysis lead concludes that the most appropriate restriction enzyme that could be used would only generate amplicons of an appropriate size for only about 40% of insertions present in the reference genome (Figure 9). Therefore, the risk to miss uncatalogued insertions with a LAM-PCR approach is large.

Tagmentation (Tn5) avoids these two limitations by having a large diversity in cleavage site. This effectively allows 100% access to the human genome. Furthermore, the amplicon size can be modulated by modifying Tn5 concentration which can be exploited to obtain the desired size. Furthermore, the tagmentation approach have no limitation in terms of genome coverage.

The restriction digestion reaction is not efficient enough to be used on the low amount of DNA which is present in a single cell. The inventors posit that tagmentation by Tn5 transposase is a much more efficient reaction and this contributes to the better performance of the method of the present invention over LAM-PCR.

Tagmentation is arguably more efficient than ligation, which is a crucial aspect in the context of low-input reactions. Ligation reactions may also have substantially longer incubation periods.

In summary, the method of the invention is faster and allows for the analysis of the entire human genome, compared to roughly 30% for LAM-PCR. The method of the invention is also more sensitive compared to previous methods.

The reproducibility of LAM-PCR can be assessed without sequencing since the amplicon sizes are determined by the restriction enzyme. Instead, the inventors ran a capillary electrophoresis using the Agilent Bioanalyzer. The size distribution of the DNA fragments should be very similar across samples to consider the method reliable.

For each trace, each peak corresponds to one or more individual insertions. These traces should look very similar, since the DNA was extracted from the same individual.

The results indicate that while the LAM-PCR assay is reproducible at 60 ng of DNA, corresponding to about 10 000 cells, the assay is not reliable below 60 ng (representative Bioanalyzer traces shown in Figure 10).

Because of the difference between tagmentation and restriction digestion it is difficult to show a direct comparison between the method of the invention and LAM-PCR. While restriction digestion should produce very distinct DNA traces based on the restriction site, tagmentation produces traces similar to a "bell curve", which represents a more uniform coverage genome-wide. For a direct comparison, the inventors used the same genomic DNA as input template for the method of the invention and applied the method across varying DNA concentrations.

The method of the invention is optimized for single cells, therefore the results where >0.6 ng input DNA are used variable, although minimal (Figure 11). The inventors were able to achieve consistent traces for <0.6 ng DNA input and no background for negative (water) controls. The inventors ware also able to achieve the optimal amplicon size of 400-500 bp when 0.006 ng DNA is used as input, representative of a single cell.

REFERENCES

Adey, A. & Shendure, J. Bisulfite Ultra-low-input, tagmentation-based whole-genome bisulfite sequencing. Genome Res. 22:1139-1143 (2012).

Biasco, L. et al. In Vivo Tracking of Human Hematopoiesis Reveals Patterns of Clonal Dynamics during Early and Steady-State Reconstitution Phases. Cell Stem Cell, 19:107-119 (2016).

Chen C. et al. Single-cell whole-genome analyses by Linear Amplification via Transposon Insertion (LIANTI). Science, 356(6334):189-194 (2017).

Drmanac et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science, 327(5961):78-81 (2010).

Evrony, G. D. et al. Cell Lineage Analysis in Human Brain Using Endogenous Retroelements. Neuron 85:49-59 (2015).

Evrony, G. D. et al. Single-neuron sequencing analysis of 11 retrotransposition and somatic mutation in the human brain. Cell 151:483-496 (2012).

Eid J. et al. Real-Time DNA Sequencing from Single Polymerase Molecules. Science 323(5910):133-138 (2009).

Ewing A. Transposable element detection from whole genome sequence data. Mobile DNA. 6(24) https://doi.org/10.1186/s13100-015-0055-3 (2015).

Ewing, A. D. & Kazazian, H. H. High-throughput sequencing reveals extensive variation in human-specific Li content in individual human genomes. Genome Res. 20:12621270 (2010).

Fehlmann T et al. cPAS-based sequencing on the BGISEQ-500 to explore small non-coding RNAs. Clinical Epigenetics, 8(123) (2016).

Fonseca N et al. Tools for mapping high-throughput sequencing data. Bioinformatics, 28(24):3169-3177 (2012).

Hamming, R. W. Error detecting and error correcting codes. The Bell System Technical Journal. 29 (2): 147-160 (1950).

Kretzschmar K Watt F. Lineage Tracing. Cell 148(1-2):33-45 (2012).

Lahens N et al. A comparison of Illumina and Ion Torrent sequencing platforms in the context of differential gene expression. BMC Genomics 18(602) (2017).

Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 412:565-566 (2001).

Ludwig S.L. et al. Lineage Tracing in Humans Enabled by Mitochondria! Mutations and Single-Cell Genomics. Cell. 176,1325-1339 (2019) Li H et al. Applications of genome editing technology in the targeted therapy of human diseases: mechanisms, advances and prospects. Signal Transduction and Targeted Therapy 5(1) https://doi.org/10.1038/s41392-019-0089-y (2020).

Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26:589-595 (2010).

Liscovitch-Brauer N et al. Profiling the genetic determinants of chromatin accessibility with scalable single-cell CRISPR screens. Nat. Biotechnol. 39:1270-1277 (2021).

Hestand MS et al. Polymerase specific error rates and profiles identified by single molecule sequencing. Mutat Res. 784-785:39-45 (2016).

Meyer M and Kircher Ni. Illumina Sequencing Library Preparation for Highly Multiplexed Target Capture and Sequencing. Cold Spring Harb Protoc. doi:10.1101/pdb.prot5448 (2010).

Nyren P and Lundin A. Enzymatic method for continuous monitoring of inorganic pyrophosphate synthesis. Analytical Biochemistry. 151(2):504-509) (1985).

PCR Primer: A Laboratory Manual. New York: Cold Spring Harbor Press, 1995.

Picelli, S. et al. Tn5 transposase and tagmentation procedures for massively scaled sequencing projects. Genome Res. 24:2033-2040 (2014).

Porreca GJ, Shendure J and Church GM. Polony DNA sequencing. Curr Protoc Vol Biol. 7 Unit 7.8 (2006).

Sabina J and Leamon J. Bias in Whole Genome Amplification: Causes and Considerations. Methods Mol Biol.1347:15-41 (2015).

Schmidt, M. et al. High-resolution insertion-site analysis by linear amplification-mediated PCR (LAM-PCR). Nat. Methods 4:1051-1057 (2007).

Shendure J and Ii H. Next-generation DNA sequencing. Nature Biotechnology 26:1135-1145 (2008).

Thompson J and Steinmann KE. Single Molecule Sequencing with a HeliScope Genetic Analysis System. Curr Protoc Mol Biol 7 Unit 7.10 (2010).

Upton, K. R. et al. Ubiquitous Li mosaicism in hippocampal neurons. Cell 161:228239 (2015).

Zachariadis V at al. A Highly Scalable Method for Joint Whole-Genome Sequencing and Gene-Expression Profiling of Single Cells. Mol. Cell. 80(3):541-533 (2020) Zong C et al. Genome-wide detection of single-nucleotide and copy-number variations of a single human cell. Science 338(6114):1622-6 (2012).

Claims

CLAIMS1. A method for determining the location of a target sequence in the genomic DNA of a cell, the method comprising the steps of: providing genomic DNA from the cell to be analysed; (H) performing linear amplification on the genomic DNA using one or more oligonucleotide primer specific for the target sequence, to generate one or more single-stranded DNA molecule comprising nucleotide sequence from the target sequence and the flanking genomic DNA; (iii) generating double-stranded DNA from the one or more single-stranded DNA molecule of step (ii); (iv) tagmenting and sequencing the double-stranded DNA of step (iii), to determine the nucleotide sequence of the target sequence and the flanking genomic DNA; and (v) based on the nucleotide sequence obtained in step (iv), determining the location of the target sequence in the genomic DNA of the cell.
2. The method according to Claim 1, wherein the genomic DNA provided in step (i) is genomic DNA from a single cell.
3. The method according to Claim 1 or 2, wherein the target sequence is heterologous to the genomic DNA of the cell.
4. The method according to any preceding claim, wherein the target sequence has been integrated into the genomic DNA of the cell.
5. The method according to any preceding claim, wherein the target sequence is 1 to 5,000 base pairs in length, preferably 10 to 1,000 base pairs in length, more preferably 50 to 500 base pairs in length.
6. The method according to any preceding claim, wherein the target sequence is present as a single copy in the genomic DNA of the cell, or is present as multiple copies in the genomic DNA of the cell. 7. 8. 9. 10. 11. 12. 13.
The method according to Claim 6, wherein the target sequence is present in 1 to 1,000,000 copies in the genomic DNA, preferably present in 1 to 1001000 copies in the genomic DNA, more preferably present in 1 to 10,000 copies in the genomic DNA.
The method according to any preceding claim, wherein Step (ii) comprises a linear amplification, in the presence of one or more amplification primer which comprises a region that is specific for the target sequence.
The method according to any preceding claim, wherein Step (iii) comprises subjecting the single-stranded DNA molecule to second-strand synthesis in the presence of degenerate hexamer oligonucleotides, to generate double-stranded DNA.
The method according to any preceding claim, wherein Step (iv) comprises: - tagmenting the double-stranded DNA to generate a population of tagmented DNA fragments; - amplifying the population of tagmented fragments; - indexing the amplified fragments to obtain one or more sequencing library; and - sequencing the one or more sequencing libraries.
The method according to any preceding claim, wherein tagmentation generates fragments between 1 to 2,000 base pairs in length, preferably between 100 to 1,000 base pairs in length, more preferably between 500 to 700 base pairs in length.
The method according to any of Claims 10 or 11, wherein amplification is performed in the presence of one or more pairs of oligonucleotide primers, wherein each pair comprises a first oligonucleotide primer that is specific for the target sequence, and a second oligonucleotide primer that is complementary to a sequence introduced by tagmentation in step (iv).
The method according to any preceding claim, wherein Step (v) comprises the step of determining the identity and/or location of the nucleotide sequence from step (iv) in the genomic DNA of the cell.
14. The method according to any preceding claim, wherein the genomic DNA provided in step (i) has not been subjected to whole-genome amplification.
15. The method according to any preceding claim, wherein the cell to be analysed is selected from the following: a eukaryotic cell (for example, from an animal, a plant, or a fungus), a bacterial cell (for example, from Eubacteria), or an archaeal cell (for example, from Archaebacteria).
16. The method according to Claim 15, wherein the cell to be analysed is provided in a biological sample, for example, a biopsy of a tissue, a whole tissue, blood, serum, urine, or saliva.
17. A method for determining the location of a target sequence in the genomic DNA of more than one cell in a population of different cells, the method comprising the steps of: providing genomic DNA from more than one cell in the population of different cells to be analysed; (ii) performing linear amplification of the genomic DNA from the more than one cell in the population of different cells using one or more oligonucleotide primer specific for the target sequence, to generate from the genomic DNA from the more than one cell in the population of different cells one or more single-stranded DNA molecule comprising sequence from the target sequence and the flanking genomic DNA; (iii) generating double-stranded DNA from the one or more single-stranded DNA molecule of step (ii); (iv) tagmenting and sequencing the double-stranded DNA of step (iii), to determine the nucleotide sequence of the target sequence and the flanking genomic DNA; and (v) based on the nucleotide sequence obtained in step (iv), determining the location of the target sequence in the genomic DNA of the more than one cell in the population of different cells. 18. 19. 20. 21. 22. 23. 24. 25. 26.
The method according to Claim 17, wherein the method determines the location of a target sequence in the genomic DNA of every cell in the population of different cells, or substantially every cell in the population of different cells.
The method according to Claim 17 or Claim 18, wherein at least 10%, 15%, 20%, 25%, 30%, 35%, 40%, 450/0, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85% or 90% or more of the cells in the population of different cells is analysed, preferably simultaneously.
The method according to any of Claims 17-19, wherein between 1-1,000,000 cells from the cell population are analysed, preferably simultaneously.
The method according to any of Claims 17-20, wherein the target sequence is heterologous to the genomic DNA of the more than one cell in the population.
The method according to any of Claims 17-21, wherein the target sequence has been integrated into the genomic DNA of the more than one cell in the population.
The method according to any of Claims 17-22, wherein the target sequence is 1 to 5,000 base pairs in length, preferably 10 to 1,000 base pairs in length, more preferably 50 to 500 base pairs in length.
The method according to any of Claims 17-23, wherein the target sequence is present as a single copy in the genomic DNA of the more than one cell in the population, or is present as multiple copies in the genomic DNA of the more than one cell in the population.
The method according to Claim 24, wherein the target sequence is present in 1 to 1,000,000 copies in the genomic DNA, preferably present in 1 to 100,000 copies in the genomic DNA, more preferably present in 1 to 10,000 copies in the genomic DNA.
The method according to any of Claims 17-25, wherein in step (i), individual cells from the population of different cells are sorted into one or more microwell plates, most preferably, 96-or 384-well microwell plates. 27. 28. 29. 30. 31. 32.
The method according to any of Claims 17-26, wherein step (i) comprises providing the genomic DNA from the more than one cell in the population of different cells as a separate sample.
The method according to any of Claims 17-27, wherein Step (ii) comprises a linear amplification, in the presence of one or more amplification primer which comprises a region that is specific for the target sequence.
The method according to any of Claims 17-28, wherein Step (iii) comprises subjecting the single-stranded DNA molecule to second-strand synthesis in the presence of degenerate hexamer oligonucleotides, to generate double-stranded DNA.
The method according to any of Claims 17-29, wherein Step (iv) comprises: - tagmenting the double-stranded DNA to generate a population of tagmented DNA fragments; - amplifying the population of tagmented fragments; - indexing the amplified fragments to obtain one or more sequencing library; and - sequencing the one or more sequencing library.
The method according to any of Claims 17-30, wherein tagmentation generates fragments between 1 to 2,000 base pairs in length, preferably between 100 to 1,000 base pairs in length, more preferably between 500 to 700 base pairs in length.
The method according to Claim 30 or 31, wherein amplification is performed in the presence of one or more pairs of oligonucleotide primers, wherein each pair comprises a first oligonucleotide primer that is specific for the target sequence, and a second oligonucleotide primer that is complementary to a sequence introduced by tagmentation in step (iv).
33. The method according to any of Claims 17-32, wherein Step (v) comprises the step of determining the identity and/or location of the nucleotide sequence from step (iv) in the genomic DNA of the more than one cell in the population of different cells.
34. The method according to any of Claims 17-33, wherein the genomic DNA provided in step (i) has not been subjected to whole-genome amplification.
35. The method according to any of Claims 17-34, wherein the cell to be analysed is selected from the following: a eukaryotic cell (for example, from an animal, a plant, or a fungus), a bacterial cell (for example, from Eubacteria), or an archaeal cell (for example, from Archaebacteria).
36. The method according to Claim 35, wherein every cell in the population of different cells to be analysed is provided in a biological sample, for example, a biopsy of a tissue, a whole tissue, blood, serum, urine, or saliva.
37. A method for determining the presence and/or location of an integrated target sequence in the genome of a cell, the method comprising the steps of: (i) integrating a target sequence into the genomic DNA of a cell, and providing genomic DNA from the cell; (ii) performing linear amplification of the genomic DNA using one or more oligonucleotide primer specific for the integrated target sequence, to generate one or more single-stranded DNA molecule comprising sequence from the integrated target sequence and the flanking genomic DNA; (iii) generating double-stranded DNA from the one or more single-stranded DNA molecule of step (ii); (iv) tagmenting and sequencing the double-stranded DNA of step (iii), to determine the sequence of the integrated target sequence and the flanking genomic DNA; and (v) based on the nucleotide sequence obtained in step (iv), determining the presence and/or location of the integrated target sequence in the genome of the cell.
38. The method according to Claim 37 wherein the genomic DNA provided in step (i) is from a single cell, or is from more than one cell in a population of different cells. 39. 40. 41. 42. 43.
The method according to Claim 37 or 38, wherein at least 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70°/a, 75%, 80%, 85% or 90% or more of the cells in the cell population are analysed, preferably simultaneously.
The method according to any of Claims 37-39, wherein between 1-1,000,000 cells from the cell population are analysed, preferably simultaneously.
The method according to any of Claims 37-40, wherein the step of integrating the target sequence into the genome of the cell or one or more cell in a population of different cells, is performed using gene-editing technique, for example using CRISPR or a CRISPR-like system.
The method according to any of Claims 37-41, wherein the target sequence is present as a single copy in the genomic DNA of the cell, or is present as a single copy in the genomic DNA of one or more cell in a population of different cells, or is present as multiple copies in the genomic DNA of the cell, or is present as multiple copies in the genomic DNA of one or more cell in a population of different cells.
The method according to any of Claims 37-42, wherein the target sequence has been integrated into the genomic DNA of the cell, or one or more cell in a population of different cells, in 1 to 1,000,000 copies in the genomic DNA, preferably in 1 to 100,000 copies in the genomic DNA, more preferably in 1 to 10,000 copies in the genomic DNA.
44. The method according to any of Claims 37-43, wherein the target sequence is heterologous to the genomic DNA of the cell.
45. The method according to any of Claims 37-44, wherein the target sequence is 1 to 5,000 base pairs in length, preferably 10 to 1,000 base pairs in length, more preferably 50 to 500 base pairs in length.
46. The method according to any of Claims 37-45, wherein in step (i), the cell, or one or more cell from a population of different cells, are sorted into one or more microwell plates, most preferably, 96-or 384-well microwell plates. 47. 48. 49. 50. 51.
The method according to any of Claims 37-46, wherein step (i) comprises providing the genomic DNA from the more than one cell in the population of different cells as a separate sample.
The method according to any of Claims 37-47, wherein Step (ii) comprises a linear amplification, in the presence of one or more amplification primer which comprises a region that is specific for the target sequence.
The method according to any of Claims 37-48, wherein Step (iii) comprises subjecting the single-stranded DNA molecule to second-strand synthesis in the presence of degenerate hexamer oligonucleotides, to generate double-stranded DNA.
The method according to any of Claims 37-49, wherein Step (iv) comprises: tagmenting the double-stranded DNA to generate a population of tagmented DNA fragments; amplifying the population of tagmented fragments; indexing the amplified fragments to obtain one or more sequencing library; and sequencing the one or more sequencing library.
The method according to any of Claims 37-50, wherein tagmentation generates fragments between 1 to 2,000 base pairs in length, preferably between 100 to 1,000 base pairs in length, more preferably between 500 to 700 base pairs in length.
The method according to Claim 50 or 51, wherein amplification is performed in the presence of one or more pairs of oligonucleotide primers, wherein each pair comprises a first oligonucleotide primer that is specific for the target sequence, and a second oligonucleotide primer that is complementary to a sequence introduced by tagmentation in step (iv).
53. The method according to any of Claims 37-52, wherein Step (v) comprises the step of determining the identity and/or location of the nucleotide sequence from step (iv) in the genomic DNA of the cell.
54. The method according to any of Claims 37-53, wherein the genomic DNA provided in step (i) has not been subjected to whole-genome amplification.
55. The method according to any of Claims 37-54, wherein the cell, or one or more cell in the population of different cells, to be analysed is selected from the following: a eukaryotic cell (for example, from an animal, a plant, or a fungus), a bacterial cell (for example, from Eubacteria), or an archaeal cell (for example, from Archaebacteria).The method according to Claim 55, wherein the cell, or one or more cell in the population of different cells to be analysed, is provided in a biological sample, for example, a biopsy of a tissue, a whole tissue, blood, serum, urine, or saliva.A kit of parts for determining the location of a target sequence in the genomic DNA of a cell, or population of cells, wherein the kit comprises: one or more oligonucleotide, amplification primer and/or oligonucleotide primer pairs as defined in any one of Claims 1 to 56; (ii) a DNA polymerase and a transposase; (iii) optionally, instructions for use.A kit of parts according to Claim 57, wherein the transposase is Tn5 transposase.A library of amplified sequences obtained using the method according to any one of Claims 1 to
56.A method, or a kit, or a library, substantially as described herein with reference to the accompanying claims, description, examples and/or figures. 56. 57. 58. 59. 60.