WO2023014967A1

WO2023014967A1 - Methods for detecting indel produced by genome editing protocol

Info

Publication number: WO2023014967A1
Application number: PCT/US2022/039567
Authority: WO
Inventors: Dario Boffelli; Stacia WYMAN; David IK MARTIN; Wendy J. MAGIS; Mark Dewitt
Original assignee: The Regents Of The University Of California; Children's Hospital & Research Center At Oakland
Priority date: 2021-08-05
Filing date: 2022-08-05
Publication date: 2023-02-09

Abstract

Certain embodiments of the present invention provide a method for unbiased discovery of the full spectrum of indels induced by a genome editing protocol. In certain embodiments, the present invention provides a method for discovering large and/or distant indels induced by a genome editing protocol.

Description

METHODS FOR DETECTING INDEL PRODUCED BY GENOME EDITING

PROTOCOL

CROSS- REFERENCE TO RELATED APPLICATION(S)

This application claims priority to U.S. Provisional Application No. 63/229,943, filed August 5, 2021, the contents of which are incorporated by reference herein.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH

This invention was made with government support under HL 151319 awarded by National Institutes of Health. The government has certain rights in the invention.

BACKGROUND

Gene editing typically involves the use of a targeted nuclease to induce double-strand DNA breaks (DSBs) at specific genomic sites. DSBs are then repaired by one of two cellular mechanisms: homology-directed repair (HDR) uses a DNA template to repair the DSB, while nonhomologous end joining (NHEJ) directly repairs the DSB but frequently creates an insertion or deletion mutation (indel) at the DSB site. Because HDR is usually less efficient than NHEJ, even protocols that use a DNA template to edit the region around a DSB result in a large proportion of repaired alleles containing an indel. Current methods for determining the genotypes produced by genome editing involve Polymerase Chain Reaction (PCR) amplification of DNA fragments followed by deep sequencing. These methods have limitations and do not inform the full extent of induced indel landscape. Accordingly, new and improved methods for discovering indels induced by a genome editing process is needed.

SUMMARY

Certain embodiments of the invention provide a method of providing an unbiased, full landscape of mutations induced by a genome editing protocol (e.g., CRSIPR-Cas, TALEN, or ZFN based protocol, or other genome-editing protocol).

Certain embodiments of the invention provide a method of identifying in a sample a DNA variant induced by a genome editing protocol, comprising: contacting a genomic DNA of the genome edited sample with one or more targeted nucleases (e.g., one targeted nuclease, or a pair of targeted nucleases) that is capable of excising a DNA fragment (e.g., a high molecular weight DNA fragment) from the genomic DNA; isolating the DNA fragment; and sequencing the isolated DNA fragment; wherein the genomic DNA comprises an editing site targeted by the genome editing protocol, and the DNA fragment comprises the editing site and the DNA variant (e.g., large and/or remote DNA variant).

DETAILED DESCRIPTION

Genome editing holds great promise in a wide range of applications from advancing basic research to revolutionizing treatment for certain intractable diseases. Genome editing protocols are known in the art, and the field continues to evolve. Currently CRISPR/Cas based protocols are efficient and facile genome editing approaches. Before the advent of CRISPR/Cas, transcription activator-like effector nuclease (TALEN), and Zinc finger nuclease (ZFN) based platforms were also widely adopted genome editing technologies.

In the process of genome editing, a double stranded break (DSB) and/or a single stranded break may be generated by a targeted nuclease or nickase at a specifically targeted editing site. During the DNA repair or ligation for the break, unintended DNA modifications may be randomly introduced as DNA variant artifacts. DNA repair mechanisms often involved in genome editing processes may include, but not limited to, homology directed repair (HDR) and non-homologous end joining (NHEJ). Without wanting to be bound by theory, the NHEJ process may be particularly prone to introduce unintended DNA modification relative to the original DNA sequence. DNA repair or ligation mechanism(s), however, are not fully elucidated. In addition, the unbiased, full spectrum of DNA variants (e.g., indels) unintentionally introduced by genome editing process is not well characterized and understood.

Conventionally, validation and characterization of post-editing outcomes involve amplifying a DNA fragment surrounding the editing site via polymerase chain reaction (PCR). For example, a pair of PCR primers are designed and located upstream and downstream of the editing site. Depending on the nature of PCR primers and the genome-edited template DNA, however, PCR reaction kinetics may bias the amplified fragments, leading to a skewed and/or incomplete representation of the indel landscape. For example, since a DNA variant can be unintentionally and unpredictably introduced by a genome editing protocol, the resulting template DNA may lack the sequence sufficiently complementary to the designed PCR primer or even lack the sequence entirely due to deletion. Therefore, PCR based post-editing assessments or quality controls by themselves may be insufficient to faithfully enumerate the whole gamut of indels. There are different types of DNA variants (e.g., indels) that have been unrecognized or underappreciated so far. The inadequacy of current post-editing evaluation workflow could have major consequences for genome editing applications including their adoption in medicine to deliver effective and safe therapies. Certain embodiments of the invention provide methods of generating a catalogue of mutations induced by a genome editing protocol. Certain embodiments described herein provide efficient methods suitable to provide an unbiased, full landscape of mutations induced by a genome editing protocol, including but not limited to, large indels, and/or remote indels that are distant from the targeted editing site.

In certain embodiments, the invention provides a method of identifying in a sample a DNA variant unintendedly induced by a genome editing protocol, comprising: contacting a genomic DNA of the genome edited sample with one or more targeted nucleases (e.g., one targeted nuclease or a pair of targeted nucleases) that is capable of excising a DNA fragment from the genomic DNA, isolating the DNA fragment, and sequencing the DNA fragment; wherein the genomic DNA comprises an editing site targeted by the genome editing protocol, and the DNA fragment comprises the editing site and the DNA variant. In certain embodiments, the sample is genome edited with a genome editing protocol selected from the group consisting of CRISPR-Cas based protocol, TALEN based protocol, and ZFN based protocol. In certain embodiments, the sample is genome edited using a CRISPR-Cas based genome-editing protocol.

In certain embodiments, the invention provides a method of identifying in a sample a DNA variant unintendedly induced by a genome editing protocol, comprising: contacting a genomic DNA of the genome edited sample with one or more targeted nucleases (e.g., one targeted nuclease or a pair of targeted nucleases) that is capable of excising a high molecular weight (HMW) DNA fragment from the genomic DNA, isolating the HMW DNA fragment, and sequencing the HMW DNA fragment; wherein the genomic DNA comprises an editing site targeted by the genome editing protocol, and the HMW DNA fragment comprises the editing site and the DNA variant.

The term “DNA variant” as described herein refers to an unintended DNA sequence modification induced by a genome editing protocol. In the process of genome editing, double stranded break (DSB) may be generated by targeted nuclease at specifically targeted editing site. During DNA repair processes, unintended DNA modifications may be randomly introduced as side effect on the edited DNA molecule. In certain embodiments, the DNA variant is an indel (insertion or deletion). In certain embodiments, the DNA variant is a deletion. In certain embodiments, the DNA variant is an insertion. In certain embodiments, the DNA variant is a point mutation. In certain embodiments, one or more point mutation DNA variants may be contiguous, for example, multiple point mutations in a row will form a segment mutation (e.g., 2 base pair (bp) or longer in length) so the segment sequence is entirely replaced but no change in length occurred in the segment. In certain embodiments, a DNA variant may be disadvantageous, or even harmful (e.g., to an edited cell, or to a host, or to a recipient of the edited cell). In certain embodiments, a DNA variant may be harmless, or even beneficial. The term “DNA variant” as described herein also encompasses DNA rearrangement and/or translocation as unintended DNA modification induced by a genome editing protocol. For example, chromothripsis is a mutational phenomenon of clustered chromosomal rearrangement occurred in localized genomic region(s). DNA rearrangement and/or translocation may involve the deletion of a DNA segment at one location of the genomic DNA and the insertion of the DNA segment at another location of the genomic DNA. In certain embodiments, the DNA variant is a deletion of a DNA segment that has been, completely or partially, rearranged or translocated or inserted into another location of the genomic DNA. In certain embodiments, the DNA variant is an insertion of a DNA segment that has been, completely or partially, rearranged or translocated or deleted at another location of the genomic DNA. In certain embodiments, the DNA variant is an insertion of a DNA segment that has been, completely or partially, copied from another location of the genomic DNA so the copy number of the DNA segment might be changed (e.g., increased) in the genome.

Methods described herein is suitable for discovering DNA variants that may have been missed by currently available protocols. In certain embodiments the DNA variants include short and long DNA variants, near and remote DNA variants, and isolated and clustered DNA variants. In certain embodiments, the DNA variant has a length of at least about Ibp, 2bp, 3bp, 4bp, 5bp, lObp, 20bp, 30bp, 40bp, 50bp, 60bp, 70bp, 80bp, 90bp, lOObp, 200bp, 300bp, 400bp, 500bp, 600bp, 700bp, 800bp, 900bp, Ikb, l. lkb, 1.2kb, 1.3kb, 1.4kb, 1.5kb, 1.6kb, 1.7kb, 1.8kb, 1.9kb, 2kb, 2.5kb, 3kb, 3.5kb, 4kb, 4.5kb, 5kb, 5.5kb, 6kb, 6.5kb, 7kb, 7.5kb, 8kb, 8.5kb, 9kb, 9.5kb, lOkb, 1 Ikb, 12kb, 13kb, 14kb, 15kb, 16kb, 17kb, 18kb, 19kb, 20kb, 25kb, 30kb, 35kb, 40kb, 45kb, 50kb, 55kb, 60kb, 65kb, 70kb, 75kb, 80kb, 85kb, 90kb, 95kb, lOOkb, or longer. As used herein, a short DNA variant indicates a DNA variant that has a length of less than about lOObp in length. As used herein, a long or large DNA variant indicates a DNA variant that is at least about 100 bp in length. In certain embodiments, the large DNA variant (e.g., an indel) has a length of about lOObp, 200bp, 300bp, 400bp, 500bp, 600bp, 700bp, 800bp, 900bp, Ikb, l. lkb, 1.2kb, 1.3kb, 1.4kb, 1.5kb, 1.6kb, 1.7kb, 1.8kb, 1.9kb, 2kb, 2.5kb, 3kb, 3.5kb, 4kb, 4.5kb, 5kb, 5.5kb, 6kb, 6.5kb, 7kb, 7.5kb, 8kb, 8.5kb, 9kb, 9.5kb, lOkb, 1 Ikb, 12kb, 13kb, 14kb, 15kb, 16kb, 17kb, 18kb, 19kb, 20kb, 25kb, 30kb, 35kb, 40kb, 45kb, 50kb, 55kb, 60kb, 65kb, 70kb, 75kb, 80kb, 85kb, 90kb, 95kb, lOOkb, or longer. In certain embodiments, the DNA variant is an indel that has a length of at least 1 kilobase (kb). In certain embodiments, the DNA variant is an indel that has a length of at least 2kb. In certain embodiments, the DNA variant is an indel that has a length of at least 3kb. In certain embodiments, the DNA variant is an indel that has a length of at least 4kb. In certain embodiments, the DNA variant is an indel that has a length of at least 5kb. In certain embodiments, the DNA variant is an indel that has a length of at least 6kb. In certain embodiments, the DNA variant is an indel that has a length of at least 7kb. In certain embodiments, the DNA variant is an indel that has a length of at least 8kb. In certain embodiments, the DNA variant is an indel that has a length of at least 9kb. In certain embodiments, the DNA variant is an indel that has a length of at least lOkb. As non-limiting examples, in certain embodiments, a large DNA variant (e.g., indel) has a length of about lOObp to lOOkb, Ikb to 90kb, 2kb to 80kb, 5kb to 70kb, lOkb to 60kb, or 15kb to 50kb. Accordingly, the method described herein is capable of detecting a large DNA variant (e.g., having a length of about lOObp to lOOkb, such as 2kb, 5kb or 15kb) as described above.

The method described herein is particularly suitable to generate an unbiased, full spectrum of DNA variants including remote DNA variant induced by a genome editing protocol. As used herein, a near DNA variant indicates a DNA variant that is less than about 500 bases from the editing site. As used herein, a remote DNA variant indicates a DNA variant that is at least about 500 bases from the editing site. In certain embodiments, the DNA variant (e.g., indel) is at least about 500bp, 600bp, 700bp, 800bp, 900bp, Ikb, 2kb, 3kb, 4kb, 5kb, 6kb, 7kb, 8kb, 9kb, lOkb, 15kb, 20kb, 25kb, 30kb, 35kb, 40kb, 45kb, 50kb, 55kb, 60kb, 65kb, 70kb, 75kb, 80kb, 85kb, 90kb, 95kb, lOOkb, HOkb, 120kb, 130kb, 140kb, 150kb, 160kb, 170kb, 180kb, 190kb, 200kb or further, away from the editing site. In certain embodiments, the DNA variant (e.g., insertion or deletion) is at least about 500 bases away from the editing site. In certain embodiments, the DNA variant is at least about Ikb, 2kb, 3kb, 4kb or 5kb away from the editing site. In certain embodiments, the DNA variant is at least about lOkb away from the editing site. In certain embodiments, the DNA variant is at least about 20kb away from the editing site. In certain embodiments, the DNA variant is at least about 30kb away from the editing site. In certain embodiments, the DNA variant is at least about 40kb away from the editing site. In certain embodiments, the DNA variant is at least about 50kb away from the editing site. In certain embodiments, the DNA variant is at least about 60kb away from the editing site. In certain embodiments, the DNA variant is at least about lOOkb away from the editing site. In certain embodiments, the DNA variant is at least about 200kb, or further, away from the editing site. As non-limiting examples, in certain embodiments, the DNA variant is about 500bp to 200kb, Ikb to 190kb, 2kb to 180kb, 5kb to 170kb, lOkb to 160kb, or 15kb to 150kb away from the editing site. Accordingly, the method described herein is capable of detecting a remote DNA variant (e.g., from about 500bp to 200kb away, such as lOkb, 50kb or 60kb away from the editing site) as described above.

In certain embodiments, a DNA variant may have any combination of a length described herein and a distance from editing site described herein. Accordingly, the method described herein is capable of detecting a large and/or remote DNA variant as described herein. For example, in certain embodiments, the method described herein is capable of detecting a large and remote DNA variant of 5kb in length and is at least 60kb away from the editing site. In certain embodiments, a large and remote DNA variant may have a length of at least lOObp and is at least 500bp away from the editing site. In certain embodiments, a DNA variant may have a length of at least 300bp and is at least 800bp away from the editing site. In certain embodiments, a DNA variant may have a length of at least 500bp and is at least Ikb away from the editing site. In certain embodiments, a DNA variant may have a length of at least 800bp and is at least 3kb away from the editing site. In certain embodiments, a DNA variant may have a length of at least Ikb and is at least 5kb away from the editing site. In certain embodiments, a DNA variant may have a length of at least 2kb and is at least lOkb away from the editing site. In certain embodiments, a DNA variant may have a length of at least 3kb and is at least 20kb away from the editing site. In certain embodiments, a DNA variant may have a length of at least 4kb and is at least 30kb away from the editing site. In certain embodiments, a DNA variant may have a length of at least 5kb and is at least 40kb away from the editing site.

In certain embodiments, methods described herein are suitable for generating full spectrum of DNA variants or characterizing extensive DNA variants induced by a genomeediting protocol. In certain embodiments, one DNA variant is characterized. In certain embodiments, two or more DNA variants are characterized. In certain embodiments, at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90 or more DNA variants are characterized. In certain embodiments, at least 2 DNA variants are characterized. In certain embodiments, at least 5 DNA variants are characterized. In certain embodiments, at least 10 DNA variants are characterized. In certain embodiments, at least 20 DNA variants are characterized. In certain embodiments, at least 30 DNA variants are characterized. In certain embodiments, at least 40 DNA variants are characterized. In certain embodiments, at least 50 DNA variants are characterized. In certain embodiments, at least 60 DNA variants are characterized. As used herein, an isolated DNA variant indicates one DNA variant having no other DNA variant in the proximity within 5kb upstream of the DNA variant 5' end and 5kb downstream of the DNA variant 3' end. As used herein, clustered DNA variants indicate that the distance between two DNA variants is shorter than 5kb (distance from 3' end of one DNA variant to the 5' end of another DNA variant is <5kb).

The term “editing site” as described herein refers to the intended target site on a genomic DNA (e.g., chromosomal, mitochondrial or plasmid DNA) for editing in a genome editing protocol. The target site for editing is purposefully and rationally chosen. For example, the editing site in a CRISPR-Cas based genome editing protocol may be targeted by specifically designed guide RNA (gRNA), while in TALEN or ZFN based protocol the editing site is targeted via specifically designed TALE nuclease or zinc finger nuclease. The desired outcome of genome editing at the editing site may include, but is not limited to, correction of a single point mutation at the editing site, replacement of a deleterious DNA segment with a beneficial DNA segment at the editing site, knocking out of an undesirable DNA segment at the editing site, or knocking in of a desirable DNA segment at the editing site. The editing site may be located at a protein coding region or a non-coding region. The editing site may be located at a regulatory region such as a promoter, enhancer, 5' or 3 '-untranslated region (UTR). The editing site may be located at a transposon or retrotransposon. The editing site may be located at a microRNA coding region. The editing site may be located at a site encoding a splice signal. The editing site may be located within chromosomal DNA, mitochondrial DNA, or plasmid DNA.

The editing site may be a single nucleotide in length (e.g., for point mutation editing). The editing site may be a DNA segment (longer than a single nucleotide) that has a 5' end and a 3' end. As non-limiting examples, the editing site (e.g., for gene replacement, or knock-out) may have a length of about Ibp, 2bp, 3bp, 4bp, 5bp, lObp, 25bp, 50bp, lOObp, 200bp, 500bp, Ikb, 2kb, 5kb, lOkb, or longer. The editing site may have the same length prior to and after the genome editing protocol (e.g., single point edit, or replacement of a DNA segment of equal length). The editing site may have different lengths prior to and after the genome editing protocol (e.g., knock out, knock in, or replacement of a DNA segment of differing length). Given the rational design nature of a genome editing protocol, the editing site and the immediately adjacent nucleotide(s) surrounding the editing site have definitive location/loci in a genome map (e.g., chromosome map), and the sequence at the editing site and its close proximity can be located and probed precisely before and/or after the genome editing protocol. For example, prior to a genome editing process, a cell may carry a disease-causing allele at the editing site, the sequence of which can be probed and ascertained with PCR/sequencing or any other suitable sequencing, genotyping or diagnostic methods. Likewise, after the genome editing process, DNA sequence at the editing site can be probed and ascertained to provide an indication whether correct DNA sequence is now present at the editing site as intended. Accordingly, the editing site of a genomic DNA in a cell may have successfully edited DNA sequence as intended by the genome editing protocol. Alternatively, the editing site of a genomic DNA in a cell may have the original, unedited DNA sequence prior to or after the genome editing protocol. In some instances, the editing site of a genomic DNA may display correct editing on one chromosome and not on its homologous chromosome. It is also possible that the editing site of a genomic DNA in a cell may have partially edited DNA sequence that falls short of the full length of the intended DNA segment for knock out, knock in, or replacement.

In certain embodiments, the methods described herein comprise contacting an edited genomic DNA with one or more targeted nucleases (e.g., one single targeted nuclease or a pair of targeted nucleases) that is capable of excising a DNA fragment, such as a high molecular weight (HMW) DNA fragment, from the edited genomic DNA.

In certain embodiments, the one or more targeted nucleases comprises one, two, or more targeted nucleases. In certain embodiments, the one or more targeted nucleases comprises one single targeted nuclease. In certain embodiments, the single targeted nuclease may be designed to cut a linear genomic DNA for excision and release of a HMW DNA fragment of interest that comprises the editing site and one end of the linear genomic DNA.

In certain embodiments, the one or more targeted nucleases comprises a pair of targeted nucleases. The pair of targeted nucleases (downstream nuclease and upstream nuclease) is designed to cut at downstream and upstream of the editing site respectively for excision and release of a DNA fragment (e.g., HMW DNA fragment) that includes the editing site and flanking sequences.

Accordingly, the HMW DNA fragment may also comprise one or more DNA variant induced by a genome editing protocol (e.g., induced by NHEJ repair following genome editing). In certain embodiments, the HMW DNA fragment may comprise an unbiased, full spectrum of any DNA variants, including large and/or remote DNA variant(s) as described herein.

In certain embodiments, the one or more targeted nucleases (e.g., one targeted nuclease or a pair of targeted nucleases) comprises a CRISPR-Cas nuclease, a transcription activator-like effector nuclease (TALEN), a zinc-finger nuclease (ZFN), or a meganuclease. In certain embodiments, the one or more targeted nucleases comprise a CRISPR-Cas nuclease. In certain embodiments, the one or more targeted nucleases comprise a CRISPR-Cas9 nuclease. In certain embodiments, the one or more targeted nucleases comprise Streptococcus pyogenes Cas9 nuclease (SpCas9). In certain embodiments, the one or more targeted nucleases comprise a Staphylococcus aureus Cas9 nuclease (SaCas9). In certain embodiments, the one or more targeted nucleases comprise a CRISPR-Casl2a nuclease.

In certain embodiments, the pair of targeted nucleases comprises two of the same class of nucleases (e.g., two Cas nucleases, or two ZFNs). In certain embodiments, the pair of targeted nucleases comprises a pair of CRISPR-Cas9 nucleases. In certain embodiments, the pair of targeted nucleases comprises two types of nucleases within the same class (e.g., a SpCas9 and a SaCas9; or a Cas9 and a Cas 12a). In certain embodiments, the pair of targeted nucleases comprises two different classes of nucleases (e.g., a Cas nuclease and a non-Cas nuclease such as a TALEN or ZFN).

The methods of using targeted nuclease systems (e.g., CRISPR-Cas, TALEN, or ZFN) for selectively creating DNA break at a targeted location of a DNA molecule are known in the art and described herein. For example, Class 2 Clustered Regularly Interspaced Short Palindromic Repeat (CRISPR) systems, which form an adaptive immune system in bacteria, were adapted for genome engineering. Due to its comparative simplicity and adaptability, CRISPR has rapidly become the most popular genome editing approach that gained widespread adoption in both industry and academic labs. Exemplary CRISPR-Cas systems comprises two components: a guide RNA (gRNA or sgRNA) and a CRISPR-associated endonuclease (Cas protein). The gRNA is a short synthetic RNA comprising a scaffold sequence necessary for Cas-binding and a user-defined about 20 nucleotide spacer that defines the genomic target to be modified. Thus, one can change the genomic target of the Cas protein by changing the sequence of the gRNA. Overall, design and generation of targeted nuclease systems, or further engineered CRISPR, TALEN and ZFN derivative systems are known/practiced in the field and further supported by commercially available services. In addition, exemplary U.S. patents directed to targeted nuclease systems, such as U.S. Patent 8,586,363; U.S. Patent 9,393,257; U.S. Patent 9,982,277; U.S. Patent 10,266,850; and U.S. Patent 10,570,418 are incorporated by reference herein for all purposes.

The term “high molecular weight DNA” or “HMW DNA fragment” as described herein refers to a DNA molecule having at least lOkb in length. For example, in certain embodiments, the HMW DNA fragment may have a length of at least about lOkb, 15kb, 20kb, 25kb, 30kb, 35kb, 40kb, 45kb, 50kb, 55kb, 60kb, 65kb, 70kb, 75kb, 80kb, 85kb, 90kb, 95kb, lOOkb, HOkb, 120kb, 130kb, 140kb, 150kb, 160kb, 170kb, 180kb, 190kb, 200kb, 210kb, 220kb, 230kb, 240kb, 250kb, 260kb, 270kb, 280kb, 290kb, 300kb, 310kb, 320kb, 330kb, 340kb, 350kb, 360kb, 370kb, 380kb, 390kb, 400kb, 410kb, 420kb, 430kb, 440kb, 450kb, 460kb, 470kb, 480kb, 490kb, 500kb, or longer. In certain embodiments, the HMW DNA fragment has a length of at least about 15kb. In certain embodiments, the HMW DNA fragment has a length of at least about 20kb. In certain embodiments, the HMW DNA fragment has a length of at least about 50kb. In certain embodiments, the HMW DNA fragment has a length of at least about 75kb. In certain embodiments, the HMW DNA fragment has a length of at least about lOOkb. In certain embodiments, the HMW DNA fragment has a length of at least about 150kb. In certain embodiments, the HMW DNA fragment has a length of at least about 200kb. In certain embodiments, the HMW DNA fragment has a length of at least about 250kb. In certain embodiments, the HMW DNA fragment has a length of at least about 300kb. In certain embodiments, the HMW DNA fragment has a length of at least about 350kb. In certain embodiments, the HMW DNA fragment has a length of at least about 400kb. In certain embodiments, the HMW DNA fragment has a length of about lOkb to 500kb, 20kb to 450kb, 30kb to 400kb, 40kb to 350kb, or 50kb to 300kb, as described above. In certain embodiments, the HMW DNA fragment has a length of about 15kb to 490kb, 25kb to 430kb, 35kb to 410kb, 45kb to 390kb, or 55kb to 360kb, as described above.

In certain embodiments, one or more targeted nucleases (e.g., a single targeted nuclease) may be used for excision and release of a DNA fragment of interest from one end of a genomic DNA (e.g., a linear genomic DNA), wherein the DNA fragment comprises the editing site, genome-editing induced DNA variant(s), and the end of the genomic DNA. In certain embodiments, the targeted nuclease cuts the linear genomic DNA at least about lOkb, 15kb, 20kb, 25kb, 30kb, 35kb, 40kb, 45kb, 50kb, 55kb, 60kb, 65kb, 70kb, 75kb, 80kb, 85kb, 90kb, 95kb, lOOkb, HOkb, 120kb, 130kb, 140kb, 150kb, 160kb, 170kb, 180kb, 190kb, 200kb, 210kb, 220kb, 230kb, 240kb, 250kb, 260kb, 270kb, 280kb, 290kb, 300kb, 310kb, 320kb, 330kb, 340kb, 350kb, 360kb, 370kb, 380kb, 390kb, 400kb, 410kb, 420kb, 430kb, 440kb, 450kb, 460kb, 470kb, 480kb, 490kb, 500kb, or further, away from the end (5' end or 3' end) of the linear genomic DNA. In certain embodiments, one or more targeted nucleases (e.g., a single targeted nuclease) cuts the linear genomic DNA at about lOkb to 500kb, 20kb to 450kb, 30kb to 400kb, 40kb to 350kb, or 50kb to 300kb, as described above, away from the end of the linear genomic DNA. For example, the targeted nuclease may cut the linear genomic DNA at least about lOkb (e.g., about lOOkb or 200kb) downstream of the 5' end of the linear genomic DNA and release a HMW DNA fragment comprising the editing site, genome editing induced DNA variant(s), and the 5' end of the genomic DNA. Alternatively, the targeted nuclease may cut the linear genomic DNA at least about lOkb (e.g., about lOOkb or 200kb) upstream of the 3' end of the linear genomic DNA and release a HMW DNA fragment comprising the editing site, genome editing induced DNA variant(s), and the 3' end of the genomic DNA. In certain embodiments, the released HMW DNA fragment of interest further comprises a telomere region. As used herein, telomere region is the end of linear chromosome and a region of repetitive nucleotide sequences that could be recognized by specialized protein(s) including telomerase. In such cases wherein the linear genomic DNA comprises a telomere at the genomic DNA terminal, the distance(s) as described above is measured by the cutting location relative to the first non-telomere nucleotide that abuts the telomere region. For example, the targeted nuclease may cut the linear genomic DNA at about lOkb to 500kb, 20kb to 450kb, 30kb to 400kb, 40kb to 350kb, or 50kb to 300kb, as described above, away from the first non-telomere nucleotide that abuts the telomere region. As one non-limiting example, the targeted nuclease may cut the linear genomic DNA at 200kb away from the first non-telomere nucleotide that abuts a telomere region of 8kb in length, releasing a DNA fragment of about 208kb in length.

In certain embodiments, for a pair of targeted nucleases capable of excising a DNA fragment of interest, one targeted nuclease of the pair (downstream nuclease) cuts the genomic DNA at least about lOObp, 200bp, 300bp, 400bp, 500bp, 600bp, 700bp, 800bp, 900bp, Ikb, 2kb, 3kb, 4kb, 5kb, 6kb, 7kb, 8kb, 9kb, lOkb, 15kb, 20kb, 25kb, 30kb, 35kb, 40kb, 45kb, 50kb, 55kb, 60kb, 65kb, 70kb, 75kb, 80kb, 85kb, 90kb, 95kb, lOOkb, HOkb, 120kb, 130kb, 140kb, 150kb, 160kb, 170kb, 180kb, 190kb, 200kb, or further, downstream of the editing site. In certain embodiments, downstream nuclease cuts the genomic DNA at least about 5kb downstream of the editing site. In certain embodiments, downstream nuclease cuts the genomic DNA at least about lOkb downstream of the editing site. In certain embodiments, downstream nuclease cuts the genomic DNA at least about 50kb downstream of the editing site. In certain embodiments, downstream nuclease cut the genomic DNA at least about lOOkb downstream of the editing site. In certain embodiments, downstream nuclease cuts the genomic DNA at least about 150kb downstream of the editing site. In certain embodiments, downstream nuclease cuts the genomic DNA at least about 200kb downstream of the editing site. The distance of downstream cutting location relative to the editing site is measured by the downstream cutting location relative to the first neighboring nucleotide downstream to the 3' end of the editing site.

In certain embodiments, for a pair of targeted nucleases capable of excising a DNA fragment of interest, one targeted nuclease of the pair (upstream nuclease) cuts the genomic DNA at least about lOObp, 200bp, 300bp, 400bp, 500bp, 600bp, 700bp, 800bp, 900bp, Ikb, 2kb, 3kb, 4kb, 5kb, 6kb, 7kb, 8kb, 9kb, lOkb, 15kb, 20kb, 25kb, 30kb, 35kb, 40kb, 45kb, 50kb, 55kb, 60kb, 65kb, 70kb, 75kb, 80kb, 85kb, 90kb, 95kb, lOOkb, HOkb, 120kb, 130kb, 140kb, 150kb, 160kb, 170kb, 180kb, 190kb, 200kb, or further, upstream of the editing site. In certain embodiments, the upstream nuclease cuts the genomic DNA at least about 5kb upstream of the editing site. In certain embodiments, the upstream nuclease cuts the genomic DNA at least about lOkb upstream of the editing site. In certain embodiments, the upstream nuclease cuts the genomic DNA at least about 50kb upstream of the editing site. In certain embodiments, the upstream nuclease cuts the genomic DNA at least about lOOkb upstream of the editing site. In certain embodiments, the upstream nuclease cuts the genomic DNA at least about 150kb upstream of the editing site. In certain embodiments, the upstream nuclease cuts the genomic DNA at least about 200kb upstream of the editing site. The distance of the upstream cutting location relative to the editing site is measured by the upstream cutting location relative to the first neighboring nucleotide upstream to the 5' end of the editing site. The distance of upstream or downstream cutting location relative to the editing site may be approximately symmetric or asymmetric. In certain embodiments, the pair of targeted nucleases cut symmetrically or asymmetrically at any combination of a downstream cutting distance described herein and an upstream cutting distance described herein.

For example, in certain embodiments, one targeted nuclease of the pair (downstream nuclease) cuts the genomic DNA at about 8kb downstream of the editing site, and the other targeted nuclease of the pair (upstream nuclease) cuts the genomic DNA at about 8kb upstream of the editing site. In certain embodiments, the downstream nuclease cuts the genomic DNA at about 50kb downstream of the editing site, and the upstream targeted nuclease cuts the genomic DNA at about 50kb upstream of the editing site. In certain embodiments, the downstream nuclease cuts the genomic DNA at about 60kb downstream of the editing site, and the upstream targeted nuclease cuts the genomic DNA at about 60kb upstream of the editing site. In certain embodiments, the downstream nuclease cuts the genomic DNA at about lOOkb downstream of the editing site, and the upstream nuclease cuts the genomic DNA at about lOOkb upstream of the editing site. In certain embodiments, the downstream nuclease cuts the genomic DNA at about 180kb downstream of the editing site, and the upstream nuclease cuts the genomic DNA at about 180kb upstream of the editing site.

In certain embodiments, one targeted nuclease of the pair (downstream nuclease) cuts the genomic DNA at about lOkb downstream of the editing site and the other targeted nuclease of the pair (upstream nuclease) cuts the genomic DNA at about 18kb upstream of the editing site. In certain embodiments, the downstream nuclease cuts the genomic DNA at about 72kb downstream of the editing site and the upstream nuclease cuts the genomic DNA at about 66kb upstream of the editing site. In certain embodiments, the downstream nuclease cuts the genomic DNA at about 120kb downstream of the editing site and the upstream nuclease cuts the genomic DNA at about 160kb upstream of the editing site.

In certain embodiments, a pair of targeted nucleases are CRISPR-Cas9 nucleases. Hence, the upstream and downstream cutting locations are targeted by specifically designed gRNA sequences.

In certain embodiments, the methods described herein do not involve amplifying a DNA fragment surrounding the editing site using a pair of PCR primers. In contrast, the HMW DNA is cut and released from the edited genomic DNA using one or more targeted nucleases, for example, a single targeted nuclease or a pair of targeted nucleases as described herein. Thus, the methods described herein may produce a HMW DNA fragment from a genome edited sample in a faithful and unbiased manner. However, once the HMW DNA fragment has been characterized using methods described herein, the design of appropriate PCR primers could be better informed, PCR may then be performed as secondary, confirmatory test for certain location or indel(s) discovered by the methods described herein.

Methods described herein can be used to characterize a genome edited sample in an unbiased and comprehensive manner to generate a full spectrum documentation on DNA variant(s) induced by a genome editing protocol. In certain embodiments, the sample comprises an edited DNA, or an edited cell comprising an edited DNA. In certain embodiments, the sample comprises a cell. In certain embodiments, the method described herein comprises lysing the sample (e.g., a cell) to release the edited genomic DNA (e.g., prior to contacting the DNA with one or more targeted nucleases, such as a single targeted nuclease or a pair of targeted nucleases as descried herein).

Any cell lysis method can be used so long as the integrity of the genomic DNA is not compromised during lysis process. In certain embodiments, cell can be lysed by chemical or biochemical methods. In certain embodiments, lysing comprises contacting the sample cell with hypotonic solution, enzyme (e.g., lysozyme or proteinase), and/or cell membrane disrupting agent such as detergent (e.g., SDS). In certain embodiments, cell can be lysed by physical or mechanical methods, including but not limited to, sonication, freeze-thawing, or other shearing methods.

In certain embodiments, the sample comprises a prokaryotic cell, or a eukaryotic cell. In certain embodiments, the sample comprises a bacterial cell, yeast cell, insect cell, plant cell, or mammalian cell. In certain embodiments, the sample comprises an E. coli cell. In certain embodiments, the sample comprises an animal cell. In certain embodiments, the sample comprises a mouse cell, a rat cell, a hamster cell, a cow cell, a pig cell, a horse cell, a dog cell, a cat cell, a fish cell, a goat cell, a camelids cell, a sheep cell, or a chicken cell. In certain embodiments, the sample comprises a zebra fish cell. In certain embodiments, the sample comprises a human cell. In certain embodiments, the sample comprises a human stem cell. In certain embodiments, the sample comprises a human somatic cell (e.g., muscle cell, or neuron).

Once a population of cells is subject to a genome editing protocol, an aliquot of cells is taken from such population to produce a sample for use in the methods described herein, while the rest of the population is reserved for future application or disposal. In certain embodiments, the edited cells are of prophylactic and/or therapeutic use. In certain embodiments, the sample comprises an edited cell that is suitable for being administered into an animal (if the edited cell harbors the desired edits at the editing site and is free of harmful or dangerous DNA variant induced by the genome editing protocol). Thus, the invention methods described herein provide comprehensive, robust quality control and assurance processes for subsequent applications of edited DNA or cells. In certain embodiments, the methods described herein comprises comparing the sequence of the DNA fragment such as HMW DNA fragment (after sequencing the HMW DNA fragment) to one or more reference sequences (e.g., the original sequence of the sample before genome-editing, and/or a control sequence having wildtype sequence of a gene). For example, the comparison may be conducted using suitable alignment or multiple alignment bioinformatic workflow.

In certain embodiments, the methods described herein comprises determining the nature of DNA variant(s) comprised within the DNA fragment (e.g., HMW DNA fragment). For example, in certain embodiments, a DNA variant(s) is determined to be indel (e.g., insertion or deletion). In certain embodiments, a DNA variant(s) is determined to be point mutation. In certain embodiments, a DNA variant(s) is determined to lead to a missense substitution that results in replacement of one amino acid into another. In certain embodiments, a DNA variant(s) is determined to lead to a nonsense substitution that results in a premature stop codon and shortened protein. In certain embodiments, a DNA variant(s) is determined to lead to frameshift. In certain embodiments, a DNA variant(s) is determined to be part of a rearrangement, or translocation event (e.g., as a result of chromothripsis). In certain embodiments, a DNA variant(s) is determined to be part of a duplication, or inversion event. A duplication occurs when a stretch of one or more nucleotides in a gene is copied and repeated (e.g., next to the original DNA sequence). An inversion changes more than one nucleotide in a gene by replacing the original sequence with the same sequence in reverse order. On the other hand, some regions of DNA contain short sequences of nucleotides (e.g., trinucleotide or tetranucleotide) that are repeated a number of times in a row. In certain embodiments, a DNA variant(s) is determined to be part of a repeat expansion event that increases the number of times that a short DNA sequence (e.g., trinucleotide or tetranucleotide) is repeated. Thus, depending on the target or purpose of a genome-editing protocol, once the nature of the DNA variant(s) comprised within the DNA fragment (e.g., HMW DNA fragment) is determined, it is evident to people skilled in the art that whether a population of cells subject to the same genome editing protocol as the sample cells tested in methods described herein may be only reserved for future study or discarded, or is suitable for subsequent application when the HMW DNA is generally free of detrimental DNA variant.

Accordingly, certain embodiments of the invention provide methods of treatment or a method of medical therapy for a disease. In certain embodiments, the methods described herein further comprise administering into an animal a population of cells, wherein the administered population of cells and the sample cell were previously edited in the same genome editing protocol (e.g., administered cells, and the sample cells used for quality control were edited in the same genome-editing batch/process).

In certain embodiments, the sample comprises a stem cell. In certain embodiments, the sample comprises a hematopoietic stem cell. In certain embodiments, the sample comprises an induced pluripotent stem cell (iPSC). In certain embodiments, the sample comprises a patient derived iPSC. In certain embodiments, the sample comprises a pluripotent cell. In certain embodiments, the sample comprises a progenitor cell. In certain embodiments, the sample comprises a blood cell. In certain embodiments, the sample comprises an immune cell. In certain embodiments, the sample comprises a T cell (e.g., CAR-T cell). In certain embodiments, the sample comprises a dendritic cell. In certain embodiments, the sample comprises a Natural Killer cell. In certain embodiments, the sample comprises a B cell. In certain embodiments, the sample comprises a cancer cell.

In certain embodiments, the disease is a hereditary disease. In certain embodiments, the disease is a blood disorder. In certain embodiments, the disease is sickle cell disease. In certain embodiments, the disease is thalassemia (e.g., beta thalassemia). In certain embodiments, the disease is cancer. In certain embodiments, the disease is an immune disorder. In certain embodiments, the disease is a neuronal disorder (e.g., fronto-temporal dementia). In certain embodiments, the disease is a muscular disorder (e.g., muscular dystrophy).

In certain embodiments, the edited cells may be of biomanufacturing use. In certain embodiments, the cell is a human embryonic kidney (HEK) 293 cell. In certain embodiments, the cell is a 293F cell. In certain embodiments, the cell is a 293T cell. In certain embodiments, the cell is a human embryonic retinal (PER.C6) cell. In certain embodiments, the cell is a HT- 1080 cell. In certain embodiments, the cell is a Huh-7 cell. In certain embodiments, the cell is a Monkey kidney epithelial (Vero) cell. In certain embodiments, the cell is a Chinese Hamster Ovary (CHO) cell. In certain embodiments, the cell is a baby hamster kidney (BHK) cell. In certain embodiments, the sample comprises a hybridoma cell.

Sample preparation, Generation of HMW DNA fragment, and DNA Isolation Methods

In certain embodiments, methods described herein comprises electrophoresing.

In certain embodiments, a sample (e.g., cell) is introduced into a loading compartment of a device or a gel that is suitable for size selection process (e.g., electrophoresis). In certain embodiments, the loading compartment comprises a solution. In certain embodiments, sample cells are pipetted into the loading compartment of the device or gel. In certain embodiments, the sample is lysed in situ within the loading compartment, releasing the edited genomic DNA from sample cell into the loading compartment. However, in certain embodiments, sample cells are not lysed in situ within the loading compartment of device or gel. It is understood by person skilled in the art that in certain embodiments, such sample preparations are conducted in a suitable container (e.g., a tube) and then introduced into the loading compartment.

Alternatively, in certain embodiments, sample cells are encapsulated within a gel matrix. In certain embodiments, the sample is lysed in situ within the gel matrix, releasing the edited genomic DNA from sample cell into the gel matrix.

To optionally clarify the content in the loading compartment or gel matrix, in certain embodiments, the method further comprises one or more pretreatment step to digest and/or elute lipid, protein, RNA, cellular metabolites, etc. For example, an initial electrophoresis step is conducted to elute smaller cellular content released from sample cell, while ultra-large genomic DNA are unable to migrate through the gel under electrophoretic field.

With or without the optional clarification step(s), in certain embodiments, the genomic DNA is contacted with one or more targeted nucleases (e.g., one targeted nuclease or a pair of targeted nucleases) as described herein and incubated for a period of time (e.g., about 15-45 minutes) to generate the DNA fragment (e.g., HWM DNA fragment) comprising the editing site and DNA variant(s).

In certain embodiments, the genomic DNA released from an edited sample (e.g., an edited cell) is contacted with one or more targeted nucleases (e.g., one targeted nuclease or a pair of targeted nucleases) in a liquid solution (e.g., in liquid phase within a container, or within loading compartment of device or gel).

In certain embodiments, the genomic DNA from an edited sample is contacted with one or more targeted nucleases (e.g., one targeted nuclease or a pair of targeted nucleases) in a gel matrix (e.g., agarose gel).

In certain embodiments, the genomic DNA from an edited sample is contacted with the pair of downstream and upstream targeted nucleases simultaneously or sequentially. In certain embodiments, the genomic DNA is contacted with one or the pair of targeted nuclease(s) for about 5 minutes to 6 hours, 10 minutes to 3 hours, 15 minutes to 2 hours, 20 minutes to 1.5 hours, 30 minutes to 1 hour, or 40 minutes to 50 minutes. In certain embodiments, the genomic DNA from an edited sample is contacted with one or the pair of targeted nuclease(s) for at least about 5 minutes, 10 minutes, 15 minutes, 20 minutes, 25 minutes, 30 minutes, 40 minutes, 50 minutes, 1 hour, 2 hours or 3 hours.

In certain embodiments, the targeted nucleases are optionally inactivated to stop further enzymatic activities.

It is apparent to people skilled in the art, that there are a variety of ways to purify genomic DNA from genome-edited sample cell and then contacted the DNA with one or more targeted nucleases (e.g., one targeted nuclease or a pair of targeted nucleases). For example, in certain embodiments, genomic DNA is purified using any suitable technique and then contacted with one or more targeted nucleases for incubation within a tube or any suitable container. The resultant genomic DNA mixture, including the released DNA fragment (e.g., HMW DNA fragment), is then transferred into loading compartment of a device or gel for separation (e.g., via electrophoresis).

In certain embodiments, the method comprises contacting the resultant genomic DNA mixture with a detergent (e.g., SDS). Without wanting to be bound by theory, this step may improve electrophoresis efficiency, separate certain DNA binding proteins from genomic DNA, and/or change the charge level of the genomic DNA or fragment.

In certain embodiments, the method comprises an isolating step that isolates the DNA fragment (e.g., HMW DNA fragment) from the genomic DNA mixture. Any DNA isolation technology that isolates, purifies, or separates DNA fragment including high molecular weight (HMW) DNA fragment may be used for methods described herein. In certain embodiments, the DNA isolation technology involves separating DNA molecules or fragments based on size.

In certain embodiments, the DNA isolating step comprises electrophoresing. In certain embodiments, the DNA isolating step comprises one dimensional electrophoresing.

In certain embodiments, the DNA isolating step comprises electrophoresing the DNA fragment, such as HMW DNA fragment (e.g., for a first period of time in a first direction). In certain embodiments, the DNA isolating step further comprises electrophoresing the DNA fragment, such as HMW DNA fragment, for a second period of time in a second direction. Accordingly, in certain embodiments, the DNA isolating step comprises two-dimensional electrophoresing.

In certain embodiments, the isolating step is conducted in a device suitable for one dimensional, or two-dimensional electrophoresis. For non-limiting examples, in certain embodiments, the isolating step may be conducted in a SageHLS™ device/protocol as disclosed in U.S. Patent Application 2020/0041449, which is incorporated by reference herein for all purposes. In certain embodiments using a SageHLS™ device, HMW DNA fragment is electrophoresed for a first period of time in one direction for separation by size and then electrophoresed for a second period of time in another direction (e.g., a perpendicular direction) for elution from the gel and then isolated into a collection chamber.

However, it is also apparent to people skilled in the art, that there are a variety of ways to conduct the isolating step in any electrophoresis gel/device or protocol that is suitable for isolating HMW DNA fragment by size. In certain embodiments, a DNA ladder is used to help locate the HMW DNA fragment. At the end of the isolating step, in certain embodiments, the HMW DNA fragment is retrieved by cutting the gel cube containing the HMW DNA fragment, followed by dissolving the gel to release the HMW DNA fragment. In certain embodiments, the HMW DNA fragment is eluted from the gel for collection.

Deep Sequencing Methods

In certain embodiments, isolated DNA fragment (e.g., HMW DNA fragments) are sequenced to provide a sequence result readout. Any DNA sequencing technology that can provide sequence result over a high molecular weight (HMW) DNA fragment may be used for methods described herein.

Seminal sequencing technology such as Sanger sequencing has a read-length limit of about 500bp~800bp. Modem sequencing technologies have since overcome the read-length limit of early generation technology and continue to evolve/improve. Distinct from Sanger sequencing, Next Generation Sequencing (NGS) methods are suitable for methods described herein. People skilled in the art are familiar with a variety of modem sequencing technologies and platforms capable of sequencing a HMW DNA fragments. Specific terms described herein are merely non-limiting exemplary sequencing technologies or terms suitable for methods described herein.

In certain embodiments, the sequencing method is a high-throughput sequencing method, for example, a massive parallel signature sequencing (MPSS) method. In certain embodiments, the sequencing method is a deep sequencing method. In certain embodiments, the sequencing method is a shotgun sequencing method. In certain embodiments, the sequencing method is a short-read sequencing method. In certain embodiments, the sequencing method is a pyrosequencing method.

In certain embodiments, the sequencing method is a long-read sequencing method. In certain embodiments, the sequencing method is a Nanopore DNA sequencing method. In certain embodiments, the sequencing method is a single molecule real time (SMRT) sequencing method.

Certain Definitions

The term "nucleic acid" refers to deoxyribonucleotides or ribonucleotides and polymers thereof in either single- or double-stranded form, composed of monomers (nucleotides) containing a sugar, phosphate and a base which is either a purine or pyrimidine. Unless specifically limited, the term encompasses nucleic acids containing known analogs of natural nucleotides that have similar binding properties as the reference nucleic acid and are metabolized in a manner similar to naturally occurring nucleotides. Unless otherwise indicated, a particular nucleic acid sequence also implicitly encompasses conservatively modified position thereof (e.g., degenerate codon substitutions) and complementary sequences as well as the sequence explicitly indicated. Specifically, degenerate codon substitutions may be achieved by generating sequences in which the third position of one or more selected (or all) codons is substituted with mixed-base and/or deoxyinosine residues (Batzer et al., Nucl. Acids Res., 19:508 (1991); Ohtsuka et al., JBC, 260:2605 (1985); Rossolini et al., Mol. Cell. Probes, 8:91 (1994). A "nucleic acid fragment" is a fraction of a given nucleic acid molecule. Deoxyribonucleic acid (DNA) in the majority of organisms is the genetic material while ribonucleic acid (RNA) is involved in the transfer of information contained within DNA into proteins. The term "nucleotide sequence" refers to a polymer of DNA or RNA that can be single- or double-stranded, optionally containing synthetic, non-natural or altered nucleotide bases capable of incorporation into DNA or RNA polymers. The terms "nucleic acid," "nucleic acid molecule," "nucleic acid fragment," "nucleic acid sequence or segment," or "polynucleotide" may also be used interchangeably with gene, cDNA, DNA and RNA encoded by a gene.

By “portion” or “fragment,” as it relates to a nucleic acid molecule, sequence or segment of the invention, when it is linked to other sequences for expression, is meant a sequence having at least 80 nucleotides, more specifically at least 150 nucleotides, and still more specifically at least 400 nucleotides. If not employed for expressing, a “portion” or “fragment” means at least 9, specifically 12, more specifically 15, even more specifically at least 20, consecutive nucleotides, e.g., probes and primers (oligonucleotides), corresponding to the nucleotide sequence of the nucleic acid molecules of the invention.

The invention encompasses isolated or substantially purified nucleic acid or protein compositions. In the context of the present invention, an "isolated" or "purified" DNA molecule or an "isolated" or "purified" polypeptide is a DNA molecule or polypeptide that exists apart from its native environment and is therefore not a product of nature. An isolated DNA molecule or polypeptide may exist in a purified form or may exist in a non-native environment such as, for example, outside a host cell. For example, an "isolated" or "purified" nucleic acid molecule or protein, or biologically active portion thereof, is substantially free of other cellular material, or culture medium when produced by recombinant techniques, or substantially free of chemical precursors or other chemicals when chemically synthesized. In one embodiment, an "isolated" nucleic acid is free of sequences that naturally flank the nucleic acid (i.e., sequences located at the 5' and 3' ends of the nucleic acid) in the genomic DNA of the organism from which the nucleic acid is derived. For example, in various embodiments, the isolated nucleic acid molecule can contain less than about 5 kb, 4 kb, 3 kb, 2 kb, 1 kb, 0.5 kb, or 0.1 kb of nucleotide sequences that naturally flank the nucleic acid molecule in genomic DNA of the cell from which the nucleic acid is derived. A protein that is substantially free of cellular material includes preparations of protein or polypeptide having less than about 30%, 20%, 10%, 5%, (by dry weight) of contaminating protein. When the protein of the invention, or biologically active portion thereof, is recombinantly produced, culture medium may represent less than about 30%, 20%, 10%, or 5% (by dry weight) of chemical precursors or non-protein-of- interest chemicals. Fragments and variants of the disclosed nucleotide sequences and proteins or partial-length proteins encoded thereby are also encompassed by the present invention. By "fragment" or "portion" is meant a full length or less than full length of the nucleotide sequence encoding, or the amino acid sequence of, a polypeptide or protein.

"Naturally occurring" is used to describe an object that can be found in nature as distinct from being artificially produced. For example, a protein or nucleotide sequence present in an organism (including a virus), which can be isolated from a source in nature and which has not been intentionally modified by man in the laboratory, is naturally occurring.

“Recombinant DNA molecule” is a combination of DNA sequences that are joined together using recombinant DNA technology and procedures used to join together DNA sequences as described, for example, in Sambrook and Russell, Molecular Cloning: A Laboratory Manual, Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press (3^rd edition, 2001).

The terms "heterologous DNA sequence," "exogenous DNA segment" or "heterologous nucleic acid," each refer to a sequence that originates from a source foreign to the particular host cell or, if from the same source, is modified from its original form. Thus, a heterologous gene in a host cell includes a gene that is endogenous to the particular host cell but has been modified. The terms also include non-naturally occurring multiple copies of a naturally occurring DNA sequence. Thus, the terms refer to a DNA segment that is foreign or heterologous to the cell, or homologous to the cell but in a position within the host cell nucleic acid in which the element is not ordinarily found. Exogenous DNA segments are expressed to yield exogenous polypeptides.

A "homologous" DNA sequence is a DNA sequence that is naturally associated with a host cell into which it is introduced.

"Wild-type" refers to the normal gene, or organism found in nature without any known mutation.

“Genome” refers to the complete genetic material of an organism.

A “vector" is defined to include, inter alia, any plasmid, cosmid, phage or binary vector in double or single stranded linear or circular form which may or may not be self-transmissible or mobilizable, and which can transform prokaryotic or eukaryotic host either by integration into the cellular genome or exist extrachromosomally (e.g., autonomous replicating plasmid with an origin of replication).

"Regulatory sequences" and "suitable regulatory sequences" each refer to nucleotide sequences located upstream (5' non-coding sequences), within, or downstream (3' non-coding sequences) of a coding sequence, and which influence the transcription, RNA processing or stability, or translation of the associated coding sequence. Regulatory sequences include enhancers, promoters, translation leader sequences, introns, and polyadenylation signal sequences. They include natural and synthetic sequences as well as sequences that may be a combination of synthetic and natural sequences. As is noted above, the term "suitable regulatory sequences" is not limited to promoters. However, some suitable regulatory sequences useful in the present invention will include, but are not limited to constitutive promoters, tissue-specific promoters, development-specific promoters, inducible promoters and viral promoters.

"5' non-coding sequence" refers to a nucleotide sequence located 5' (upstream) to the coding sequence. It is present in the fully processed mRNA upstream of the initiation codon and may affect processing of the primary transcript to mRNA, mRNA stability or translation efficiency (Turner et al., Mol. Biotech., 3:225 (1995).

"3' non-coding sequence" refers to nucleotide sequences located 3' (downstream) to a coding sequence and include polyadenylation signal sequences and other sequences encoding regulatory signals capable of affecting mRNA processing or gene expression. The polyadenylation signal is usually characterized by affecting the addition of polyadenylic acid tracts to the 3 ' end of the mRNA precursor.

"Promoter" refers to a nucleotide sequence, usually upstream (5') to its coding sequence, which controls the expression of the coding sequence by providing the recognition for RNA polymerase and other factors required for proper transcription. "Promoter" includes a minimal promoter that is a short DNA sequence comprised of a TATA- box and other sequences that serve to specify the site of transcription initiation, to which regulatory elements are added for control of expression. "Promoter" also refers to a nucleotide sequence that includes a minimal promoter plus regulatory elements that is capable of controlling the expression of a coding sequence or functional RNA. This type of promoter sequence consists of proximal and more distal upstream elements, the latter elements often referred to as enhancers. Accordingly, an "enhancer" is a DNA sequence that can stimulate promoter activity and may be an innate element of the promoter or a heterologous element inserted to enhance the level or tissue specificity of a promoter. Promoters may be derived in their entirety from a native gene, or be composed of different elements derived from different promoters found in nature, or even be comprised of synthetic DNA segments. A promoter may also contain DNA sequences that are involved in the binding of protein factors that control the effectiveness of transcription initiation in response to physiological or developmental conditions.

The "initiation site" is the position surrounding the first nucleotide that is part of the transcribed sequence, which is also defined as position +1. With respect to this site all other sequences of the gene and its controlling regions are numbered. Downstream sequences (i.e. further protein encoding sequences in the 3' direction) are denominated positive, while upstream sequences (mostly of the controlling regions in the 5' direction) are denominated negative.

Promoter elements, particularly a TATA element, that are inactive or that have greatly reduced promoter activity in the absence of upstream activation are referred to as "minimal or core promoters." In the presence of a suitable transcription factor, the minimal promoter functions to permit transcription. A “minimal or core promoter” thus consists only of all basal elements needed for transcription initiation, e.g., a TATA box and/or an initiator.

The following terms are used to describe the sequence relationships between two or more sequences (e.g., nucleic acids, polynucleotides or polypeptides): (a) "reference sequence," (b) "comparison window," (c) "sequence identity," (d) "percentage of sequence identity," and (e) "substantial identity."

(a) As used herein, "reference sequence" is a defined sequence used as a basis for sequence comparison. A reference sequence may be a subset or the entirety of a specified sequence; for example, as a segment of a full length cDNA, gene sequence or peptide sequence, or the complete cDNA, gene sequence or peptide sequence.

(b) As used herein, "comparison window" makes reference to a contiguous and specified segment of a sequence, wherein the sequence in the comparison window may comprise additions or deletions (i.e., gaps) compared to the reference sequence (which does not comprise additions or deletions) for optimal alignment of the two sequences. Generally, the comparison window is at least 20 contiguous nucleotides in length, and optionally can be 30, 40, 50, 100, or longer. Those of skill in the art understand that to avoid a high similarity to a reference sequence due to inclusion of gaps in the sequence a gap penalty is typically introduced and is subtracted from the number of matches.

Methods of alignment of sequences for comparison are well known in the art. Thus, the determination of percent identity between any two sequences can be accomplished using a mathematical algorithm. Non-limiting examples of such mathematical algorithms are the algorithm of Myers and Miller, CABIOS, 4: 11 (1988); the local homology algorithm of Smith et al., Adv. Appl. Math., 2:482 (1981); the homology alignment algorithm of Needleman and Wunsch, JMB, 48:443 (1970); the search-for-similarity-method of Pearson and Lipman, Proc. Natl. Acad. Sci. USA, 85:2444 (1988); the algorithm of Karlin and Altschul, Proc. Natl. Acad. Sci. USA, 87:2264 (1990), modified as in Karlin and Altschul, Proc. Natl. Acad. Sci. USA, 90:5873 (1993).

Computer implementations of these mathematical algorithms can be utilized for comparison of sequences to determine sequence identity. Such implementations include, but are not limited to, CLUSTAL in the PC/Gene program (available from Intelligenetics, Mountain View, California); the ALIGN program (Version 2.0) and GAP, BESTFIT, BLAST, FASTA, and TFASTA in the Wisconsin Genetics Software Package, Version 8 (available from Genetics Computer Group (GCG), 575 Science Drive, Madison, Wisconsin, USA). Alignments using these programs can be performed using the default parameters. The CLUSTAL program is well described by Higgins et al., Gene, 73:237 (1988); Higgins et al., CABIOS, 5: 151 (1989); Corpet et al., Nucl. Acids Res., 16: 10881 (1988); Huang et al., CABIOS, 8: 155 (1992); and Pearson et al., Meth. Mol. Biol., 24:307 (1994). The ALIGN program is based on the algorithm of Myers and Miller, supra. The BLAST programs of Altschul et al., JMB, 215:403 (1990); Nucl. Acids Res., 25:3389 (1990), are based on the algorithm of Karlin and Altschul supra.

Software for performing BLAST analyses is publicly available through the National Center for Biotechnology Information (available on the world wide web at ncbi.nlm.nih.gov/). This algorithm involves first identifying high scoring sequence pairs (HSPs) by identifying short words of length W in the query sequence, which either match or satisfy some positive-valued threshold score T when aligned with a word of the same length in a database sequence. T is referred to as the neighborhood word score threshold. These initial neighborhood word hits act as seeds for initiating searches to find longer HSPs containing them. The word hits are then extended in both directions along each sequence for as far as the cumulative alignment score can be increased. Cumulative scores are calculated using, for nucleotide sequences, the parameters M (reward score for a pair of matching residues; always > 0) and N (penalty score for mismatching residues; always < 0). For amino acid sequences, a scoring matrix is used to calculate the cumulative score. Extension of the word hits in each direction are halted when the cumulative alignment score falls off by the quantity X from its maximum achieved value, the cumulative score goes to zero or below due to the accumulation of one or more negative-scoring residue alignments, or the end of either sequence is reached.

In addition to calculating percent sequence identity, the BLAST algorithm also performs a statistical analysis of the similarity between two sequences. One measure of similarity provided by the BLAST algorithm is the smallest sum probability (P(N)), which provides an indication of the probability by which a match between two nucleotide or amino acid sequences would occur by chance. For example, a test nucleic acid sequence is considered similar to a reference sequence if the smallest sum probability in a comparison of the test nucleic acid sequence to the reference nucleic acid sequence is less than about 0.1, more specifically less than about 0.01, and most specifically less than about 0.001.

To obtain gapped alignments for comparison purposes, Gapped BLAST (in BLAST 2.0) can be utilized as described in Altschul et al., Nucleic Acids Res. 25:3389 (1997). Alternatively, PSLBLAST (in BLAST 2.0) can be used to perform an iterated search that detects distant relationships between molecules. See Altschul et al., supra. When utilizing BLAST, Gapped BLAST, PSLBLAST, the default parameters of the respective programs (e.g., BLASTN for nucleotide sequences, BLASTX for proteins) can be used. The BLASTN program (for nucleotide sequences) uses as defaults a wordlength (W) of 11, an expectation (E) of 10, a cutoff of 100, M=5, N=-4, and a comparison of both strands. For amino acid sequences, the BLASTP program uses as defaults a wordlength (W) of 3, an expectation (E) of 10, and the BLOSUM62 scoring matrix. See the world wide web at ncbi.nlm.nih.gov. Alignment may also be performed manually by visual inspection.

For purposes of the present invention, comparison of sequences for determination of percent sequence identity to another sequence may be made using the BlastN program (version 1.4.7 or later) with its default parameters or any equivalent program. By "equivalent program" is intended any sequence comparison program that, for any two sequences in question, generates an alignment having identical nucleotide or amino acid residue matches and an identical percent sequence identity when compared to the corresponding alignment generated by the preferred program.

(c) As used herein, "sequence identity" or "identity" in the context of two nucleic acid or polypeptide sequences makes reference to a specified percentage of residues in the two sequences that are the same when aligned for maximum correspondence over a specified comparison window, as measured by sequence comparison algorithms or by visual inspection. When percentage of sequence identity is used in reference to proteins it is recognized that residue positions which are not identical often differ by conservative amino acid substitutions, where amino acid residues are substituted for other amino acid residues with similar chemical properties (e.g., charge or hydrophobicity) and therefore do not change the functional properties of the molecule. When sequences differ in conservative substitutions, the percent sequence identity may be adjusted upwards to correct for the conservative nature of the substitution. Sequences that differ by such conservative substitutions are said to have "sequence similarity" or "similarity." Means for making this adjustment are well known to those of skill in the art. Typically this involves scoring a conservative substitution as a partial rather than a full mismatch, thereby increasing the percentage sequence identity. Thus, for example, where an identical amino acid is given a score of 1 and a non-conservative substitution is given a score of zero, a conservative substitution is given a score between zero and 1. The scoring of conservative substitutions is calculated, e.g., as implemented in the program PC/GENE (Intelligenetics, Mountain View, California).

(d) As used herein, "percentage of sequence identity" means the value determined by comparing two optimally aligned sequences over a comparison window, wherein the portion of the sequence in the comparison window may comprise additions or deletions (i.e., gaps) as compared to the reference sequence (which does not comprise additions or deletions) for optimal alignment of the two sequences. The percentage is calculated by determining the number of positions at which the identical nucleic acid base or amino acid residue occurs in both sequences to yield the number of matched positions, dividing the number of matched positions by the total number of positions in the window of comparison, and multiplying the result by 100 to yield the percentage of sequence identity.

(e)(i) The term "substantial identity" of sequences means that a polynucleotide comprises a sequence that has at least 70%, 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, or 79%, at least 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, or 89%, at least 90%, 91%, 92%, 93%, or 94%, and at least 95%, 96%, 97%, 98%, or 99% sequence identity, compared to a reference sequence using one of the alignment programs described using standard parameters. One of skill in the art will recognize that these values can be appropriately adjusted to determine corresponding identity of proteins encoded by two nucleotide sequences by taking into account codon degeneracy, amino acid similarity, reading frame positioning, and the like. Substantial identity of amino acid sequences for these purposes normally means sequence identity of at least 70%, at least 80%, 90%, at least 95%.

Another indication that nucleotide sequences are substantially identical is if two molecules hybridize to each other under stringent conditions (see below). Generally, stringent conditions are selected to be about 5 °C lower than the thermal melting point (T_m) for the specific sequence at a defined ionic strength and pH. However, stringent conditions encompass temperatures in the range of about 1°C to about 20°C, depending upon the desired degree of stringency as otherwise qualified herein. Nucleic acids that do not hybridize to each other under stringent conditions are still substantially identical if the polypeptides they encode are substantially identical. This may occur, e.g., when a copy of a nucleic acid is created using the maximum codon degeneracy permitted by the genetic code. One indication that two nucleic acid sequences are substantially identical is when the polypeptide encoded by the first nucleic acid is immunologically cross reactive with the polypeptide encoded by the second nucleic acid. (e)(ii) The term "substantial identity" in the context of a peptide indicates that a peptide comprises a sequence with at least 70%, 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, or 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, or 89%, at least 90%, 91%, 92%, 93%, or 94%, or 95%, 96%, 97%, 98% or 99%, sequence identity to the reference sequence over a specified comparison window. Optimal alignment is conducted using the homology alignment algorithm of Needleman and Wunsch, J. Mol. Biol. 48:443 (1970). An indication that two peptide sequences are substantially identical is that one peptide is immunologically reactive with antibodies raised against the second peptide. Thus, a peptide is substantially identical to a second peptide, for example, where the two peptides differ only by a conservative substitution.

For sequence comparison, typically one sequence acts as a reference sequence to which test sequences are compared. When using a sequence comparison algorithm, test and reference sequences are input into a computer, subsequence coordinates are designated if necessary, and sequence algorithm program parameters are designated. The sequence comparison algorithm then calculates the percent sequence identity for the test sequence(s) relative to the reference sequence, based on the designated program parameters.

As noted above, another indication that two nucleic acid sequences are substantially identical is that the two molecules hybridize to each other under stringent conditions. The phrase "hybridizing specifically to" refers to the binding, duplexing, or hybridizing of a molecule only to a particular nucleotide sequence under stringent conditions when that sequence is present in a complex mixture (e.g., total cellular) DNA or RNA. “Bind(s) substantially” refers to complementary hybridization between a probe nucleic acid and a target nucleic acid and embraces minor mismatches that can be accommodated by reducing the stringency of the hybridization media to achieve the desired detection of the target nucleic acid sequence.

"Stringent hybridization conditions" and "stringent hybridization wash conditions" in the context of nucleic acid hybridization experiments such as Southern and Northern hybridizations are sequence dependent, and are different under different environmental parameters. Longer sequences hybridize specifically at higher temperatures. The thermal melting point (T_m) is the temperature (under defined ionic strength and pH) at which 50% of the target sequence hybridizes to a perfectly matched probe. Specificity is typically the function of post-hybridization washes, the critical factors being the ionic strength and temperature of the final wash solution. For DNA-DNA hybrids, the T_m can be approximated from the equation of Meinkoth and Wahl, Anal. Biochem., 138:267 (1984); T_m 81.5°C + 16.6 (log M) +0.41 (%GC) - 0.61 (% form) - 500/L; where M is the molarity of monovalent cations, %GC is the percentage of guanosine and cytosine nucleotides in the DNA, % form is the percentage of formamide in the hybridization solution, and L is the length of the hybrid in base pairs. T_m is reduced by about 1°C for each 1% of mismatching; thus, T_m, hybridization, and/or wash conditions can be adjusted to hybridize to sequences of the desired identity. For example, if sequences with >90% identity are sought, the T_m can be decreased 10°C. Generally, stringent conditions are selected to be about 5°C lower than the T_m for the specific sequence and its complement at a defined ionic strength and pH. However, severely stringent conditions can utilize a hybridization and/or wash at 1, 2, 3, or 4°C lower than the T_m; moderately stringent conditions can utilize a hybridization and/or wash at 6, 7, 8, 9, or 10°C lower than the T_m; low stringency conditions can utilize a hybridization and/or wash at 11, 12, 13, 14, 15, or 20°C lower than the T_m. Using the equation, hybridization and wash compositions, and desired temperature, those of ordinary skill will understand that variations in the stringency of hybridization and/or wash solutions are inherently described. If the desired degree of mismatching results in a temperature of less than 45°C (aqueous solution) or 32°C (formamide solution), it is preferred to increase the SSC concentration so that a higher temperature can be used. An extensive guide to the hybridization of nucleic acids is found in Tijssen, Laboratory Techniques in Biochemistry and Molecular Biology Hybridization with Nucleic Acid Probes, part I chapter 2 "Overview of principles of hybridization and the strategy of nucleic acid probe assays" Elsevier, New York (1993). Generally, highly stringent hybridization and wash conditions are selected to be about 5°C lower than the T_m for the specific sequence at a defined ionic strength and pH.

An example of highly stringent wash conditions is 0.15 M NaCl at 72°C for about 15 minutes. An example of stringent wash conditions is a 0.2X SSC wash at 65°C for 15 minutes (see, Sambrook, infra, for a description of SSC buffer). Often, a high stringency wash is preceded by a low stringency wash to remove background probe signal. An example medium stringency wash for a duplex of, e.g., more than 100 nucleotides, is IX SSC at 45°C for 15 minutes. An example low stringency wash for a duplex of, e.g., more than 100 nucleotides, is 4-6X SSC at 40°C for 15 minutes. For short probes (e.g., about 10 to 50 nucleotides), stringent conditions typically involve salt concentrations of less than about 1.5 M, more specifically about 0.01 to 1.0 M, Na ion concentration (or other salts) at pH 7.0 to 8.3, and the temperature is typically at least about 30°C and at least about 60°C for long probes (e.g., >50 nucleotides). Stringent conditions may also be achieved with the addition of destabilizing agents such as formamide. In general, a signal to noise ratio of 2X (or higher) than that observed for an unrelated probe in the particular hybridization assay indicates detection of a specific hybridization. Nucleic acids that do not hybridize to each other under stringent conditions are still substantially identical if the proteins that they encode are substantially identical. This occurs, e.g., when a copy of a nucleic acid is created using the maximum codon degeneracy permitted by the genetic code.

Very stringent conditions are selected to be equal to the T_m for a particular probe. An example of stringent conditions for hybridization of complementary nucleic acids which have more than 100 complementary residues on a filter in a Southern or Northern blot is 50% formamide, e.g., hybridization in 50% formamide, 1 M NaCl, 1% SDS at 37°C, and a wash in 0. IX SSC at 60 to 65°C. Exemplary low stringency conditions include hybridization with a buffer solution of 30 to 35% formamide, IM NaCl, 1% SDS (sodium dodecyl sulphate) at 37°C, and a wash in IX to 2X SSC (20X SSC = 3.0 M NaCl/0.3 M trisodium citrate) at 50 to 55°C. Exemplary moderate stringency conditions include hybridization in 40 to 45% formamide, 1.0 M NaCl, 1% SDS at 37°C, and a wash in 0.5X to IX SSC at 55 to 60°C.

The invention will now be illustrated by the following non-limiting Examples.

EXAMPLE 1

Full Spectrum Characterization oflndels Induced by A Genome Editing Protocol

A genome editing protocol was implemented in this Example for correction of the sickle mutation in human hematopoietic stem cells, using a Cas9 ribonucleoprotein targeting cleavage of the B-globin gene (HBB) near the mutation site, and an oligonucleotide serving as a donor template for homology-directed repair (HDR) of cleavage and correction of the disease-causing mutation. This protocol drives correction of about >20% of HBB alleles in human hematopoietic stem cells. However, the remainder of edited alleles in the cell population are repaired by NHEJ. Indels induced by the genome editing protocol may produce null alleles equivalent to P-thalassemia mutations, which is a major safety concern. The invention as described herein was partly designed to detect and quantify longer indels, up to tens of kb in size, that may be produced by repair of Cas9-induced DSBs.

In this Example, to capture a large molecular weight fragment of DNA encompassing the HBB editing site with the very large regions surrounding it, the SageHLS™ CATCH method was used with SageHLS™ apparatus and protocol for the extraction of high molecular weight DNA. Subsequently, the fragment was deep sequenced to detect large-scale indels.

Human hematopoietic stem cells that had been edited with a CRISPR/Cas9 based protocol were loaded onto the SageHLS™ chip and then were further treated with a pair of targeted nucleases (Cas9/guide RNA complexes) that cleave chromosome 11 approximately lOOkb upstream and downstream of the editing site. Cleavage liberated a fragment of approximately 200kb spanning the editing site for the comprehensive detection of a full spectrum of indels, including large deletions (up to 200kb) that may be present in the edited region. The liberated region was separated by gel electrophoresis and eluted from the gel using the SageHLS™ apparatus.

The recovered DNA was fragmented and cloned into an Illumina® sequencing library using standard procedures and sequenced in one lane of an Illumina® Novaseq® apparatus to produce paired end 150 bp reads at a depth of 955,551,306 read pairs.

Reads were processed with a bioinformatic pipeline that identifies structural variations (specifically large deletions) by identifying breakpoints that bracket the cut site to the up and downstream region. Sequence reads (150bp PE reads) were aligned to the human reference genome (hg38) using BWA. Reads were then labeled and extracted if they were split reads (a single read whose ends map to non-adjacent regions) or reads that were part of a pair that maps discordantly (farther from each other in the genome than would be expected for the average insert size). Breakpoints with ends that map to distant regions identify the boundaries of large deletions. The reads were then analyzed using structural variation callers and events that span the targeted cut site locus were extracted as putative large deletion events.

This analysis revealed the presence of a 5,143 bp deletion that was located approximately 66kb upstream of the B-globin gene and not spanning the cut site. Analysis of read depth indicated that the deletion is heterozygous. To validate the presence of the deletion at the detected breakpoints, PCR primers were designed to amplify the genomic interval spanning the deletion breakpoint and amplify both the deleted and the normal alleles. Sequencing of the amplified alleles confirmed the presence of the deletion in unedited cell and validates the capability of the invention described herein to detect large deletions.

These data show that the invention described herein characterizes the full spectrum of insertions/deletions produced by editing with a gene editing protocol. The invention described herein detects short or long insertions/deletions at the cut site and in the full genomic interval containing the editing site targeted by CRISPR/Cas9.

Although the foregoing specification and examples fully disclose and enable the present invention, they are not intended to limit the scope of the invention, which is defined by the claims appended hereto.

All publications, patents and patent applications are incorporated herein by reference. While in the foregoing specification this invention has been described in relation to certain embodiments thereof, and many details have been set forth for purposes of illustration, it will be apparent to those skilled in the art that the invention is susceptible to additional embodiments and that certain of the details described herein may be varied considerably without departing from the basic principles of the invention.

The use of the terms “a” and “an” and “the” and similar referents in the context of describing the invention are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. The terms “comprising,” “having,” “including,” and “containing” are to be construed as open-ended terms (i.e., meaning “including, but not limited to”) unless otherwise noted. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate the invention and does not pose a limitation on the scope of the invention unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the invention.

Embodiments of this invention are described herein, including the best mode known to the inventors for carrying out the invention. Variations of those embodiments may become apparent to those of ordinary skill in the art upon reading the foregoing description. The inventors expect skilled artisans to employ such variations as appropriate, and the inventors intend for the invention to be practiced otherwise than as specifically described herein. Accordingly, this invention includes all modifications and equivalents of the subject matter recited in the claims appended hereto as permitted by applicable law. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the invention unless otherwise indicated herein or otherwise clearly contradicted by context.

With respect to ranges of values, the invention encompasses each intervening value between the upper and lower limits of the range to at least a tenth of the lower limit's unit, unless the context clearly indicates otherwise. Further, the invention encompasses any other stated intervening values. Moreover, the invention also encompasses ranges excluding either or both of the upper and lower limits of the range, unless specifically excluded from the stated range.

Further, all numbers expressing quantities of ingredients, reaction conditions, % purity, polypeptide and polynucleotide lengths, and so forth, used in the specification and claims, are modified by the term "about," unless otherwise indicated. Accordingly, the numerical parameters set forth in the specification and claims are approximations that may vary depending upon the desired properties of the present invention. At the very least, and not as an attempt to limit the application of the doctrine of equivalents to the scope of the claims, each numerical parameter should at least be construed in light of the number of reported significant digits, applying ordinary rounding techniques. Nonetheless, the numerical values set forth in the specific examples are reported as precisely as possible. Any numerical value, however, inherently contains certain errors from the standard deviation of its experimental measurement. Unless defined otherwise, the meanings of all technical and scientific terms used herein are those commonly understood by one of skill in the art to which this invention belongs. One of skill in the art will also appreciate that any methods and materials similar or equivalent to those described herein can also be used to practice or test the invention. Further, all publications mentioned herein are incorporated by reference in their entireties.

Claims

WHAT IS CLAIMED IS: A method of identifying in a sample a DNA variant induced by a genome editing protocol, comprising: contacting a genomic DNA of the genome edited sample with one or more targeted nucleases that is capable of excising a DNA fragment from the genomic DNA; isolating the DNA fragment; and sequencing the isolated DNA fragment; wherein the genomic DNA comprises an editing site targeted by the genome editing protocol, and the DNA fragment comprises the editing site and the DNA variant. The method of claim 1, wherein the DNA fragment is a high molecular weight (HMW) DNA fragment that is at least lOkb in length. The method of any one of claims 1-2, wherein the DNA variant is an indel. The method of any one of claims 1-3, wherein the DNA fragment is at least lOOkb in length. The method of any one of claims 1-4, wherein the DNA fragment is at least 200kb in length. The method of any one of claims 1-5, wherein the one or more targeted nucleases comprises a single targeted nuclease, the genomic DNA is a linear DNA comprising two ends, and the DNA fragment further comprises one end of the genomic DNA. The method of claim 6, wherein the targeted nuclease cuts the genomic DNA at least about lOkb away from the end of the genomic DNA. The method of claim 6, wherein the targeted nuclease cuts the genomic DNA at least about lOOkb away from the end of the genomic DNA.

32 The method of claim 6, wherein the targeted nuclease cuts the genomic DNA at least about 200kb away from the end of the genomic DNA. The method of any one of claims 1-5, wherein the one or more targeted nucleases comprise a pair of targeted nucleases. The method of claim 10, wherein one targeted nuclease of the pair cuts the genomic DNA at least 5kb downstream of the editing site, and/or the other targeted nuclease of the pair cuts the genomic DNA at least 5kb upstream of the editing site. The method of claim 10, wherein one targeted nuclease of the pair cuts the genomic DNA at least 50kb downstream of the editing site, and/or the other targeted nuclease of the pair cuts the genomic DNA at least 50kb upstream of the editing site. The method of claim 10, wherein one targeted nuclease of the pair cuts the genomic DNA at least lOOkb downstream of the editing site, and/or the other targeted nuclease of the pair cuts the genomic DNA at least lOOkb upstream of the editing site. The method of any one of claims 1-13, wherein the DNA variant is located at least 500bp away from the editing site. The method of any one of claims 1-14, wherein the DNA variant is located at least Ikb away from the editing site. The method of any one of claims 1-15, wherein the DNA variant is located at least 5 kb away from the editing site. The method of any one of claims 1-16, wherein the method is capable of detecting a DNA variant located at least lOkb away from the editing site. The method of any one of claims 1-17, wherein the DNA variant is an indel that is at least lOObp in length. The method of any one of claims 1-18, wherein the DNA variant is an indel that is at least 2kb in length.

33 The method of any one of claims 1-19, wherein the method is capable of detecting a DNA variant that is at least 5kb in length. The method of any one of claims 1-20, wherein the one or more targeted nucleases comprises a CRISPR-Cas nuclease, a transcription activator-like effector nuclease (TALEN), a zinc-finger nuclease (ZFN), or a meganuclease. The method of any one of claims 1-21, wherein the one or more targeted nucleases comprise a pair of CRISPR-Cas9 nucleases. The method of any one of claims 1-22, wherein contacting comprises contacting the genomic DNA with the one or more targeted nucleases in a solution. The method of any one of claims 1-23, wherein contacting comprises contacting the genomic DNA with the one or more targeted nucleases in a gel matrix. The method of any one of claims 1-24, wherein isolating comprises electrophoresing. The method of any one of claims 1-25, wherein isolating comprises two-dimensional electrophoresing. The method of any one of claims 1-26, wherein sequencing comprises deep sequencing the DNA fragment. The method of any one of claims 1-26, wherein sequencing comprises long-read sequencing the DNA fragment. The method of any one of claims 1-28, wherein the sample comprises a bacterial cell, yeast cell, plant cell, or mammalian cell. The method of any one of claims 1-29, wherein the sample comprises a mammalian cell. The method of any one of claims 1-30, wherein the sample comprises a stem cell. The method of any one of claims 1-31, wherein the sample comprises a hematopoietic stem cell. The method of any one of claims 1-32, wherein the sample comprises an immune cell. The method of any one of claims 29-33, further comprises lysing the sample cell to release the genomic DNA. The method of any one of claims 1-34, wherein the sample was edited with a genome editing protocol selected from the group consisting of CRISPR-Cas based protocol, TALEN based protocol, and ZFN based protocol. The method of any one of claims 29-35, further comprises administering into an animal a population of cells, wherein the administered population of cells and the sample cell were previously edited in the same genome editing protocol.