WO2024123789A1 - Prédiction de fréquences d'indel - Google Patents

Prédiction de fréquences d'indel Download PDF

Info

Publication number
WO2024123789A1
WO2024123789A1 PCT/US2023/082543 US2023082543W WO2024123789A1 WO 2024123789 A1 WO2024123789 A1 WO 2024123789A1 US 2023082543 W US2023082543 W US 2023082543W WO 2024123789 A1 WO2024123789 A1 WO 2024123789A1
Authority
WO
WIPO (PCT)
Prior art keywords
nucleic acid
nuclease
sequence
sequencing reads
guide
Prior art date
Application number
PCT/US2023/082543
Other languages
English (en)
Inventor
Sourav Roy CHOUDHURY
Eugenia LYASHENKO
Youngji NA
Tess TORREGROSA
Original Assignee
Sanofi
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sanofi filed Critical Sanofi
Publication of WO2024123789A1 publication Critical patent/WO2024123789A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing

Definitions

  • This disclosure relates to quantification of insertions and/or deletions in the vicinity of cleavage sites within a polynucleotide target sequence.
  • Nucleic acid-guided nucleases can be used to edit polynucleotide sequences, for example a genome of an organism, at targeted locations with high precision. Nucleases are enzymes capable of cleaving the phosphodiester bonds between nucleotides of nucleic acids. Genome editing methods include use of CRISPR (clustered regularly interspaced short palindromic repeats)-associated proteins or similar nucleic acid-guided nucleases to induce DNA double-strand breaks (DSBs) at predictable genomic positions relative to the user-designated target sequence. DNA DSBs are repaired by intracellular machinery, for example, by non-homologous end joining (NHEJ).
  • NHEJ non-homologous end joining
  • the repair process can result in sequence variants including, for example, insertions and deletions (indels).
  • Quantification of the frequency of indels at the target cleavage site within an edited cell population is important for evaluating the efficacy of a nucleic-acid guided nuclease editing system.
  • nucleic-acid guided nucleases are now well known, with additional naturally-occurring and engineered nucleases being discovered and characterized.
  • Novel nucleases may be poorly characterized, i.e. their cleavage site and editing window relative to the user-designated target sequence being unknown.
  • Existing indel quantification data analysis pipelines for Next Generation Sequencing (NGS) rely on the assumption that cleavage site and editing window are established, which is not the case for novel poorly characterized nucleases.
  • NGS Next Generation Sequencing
  • existing computations tools can undercount indels at the target cleavage site within an edited cell population. Few computational methods have been developed so far to address this need.
  • the present disclosure is based, in part, on the discovery that a data analysis pipeline can be configured to quantify insertion and deletions of a target polynucleotide sequence that has been cleaved by a nucleic-acid guided nuclease, even if the cleavage site of the nuclease relative to the user-designated target sequence is not known.
  • the methods disclosed herein include a computational pipeline for indel quantification for polynucleotide sequences that have been cleaved by nucleic-acid guided nucleases, including uncharacterized nucleases.
  • the methods disclosed herein include parameters for alignment of next-generation sequencing (NGS) reads from targeted amplicon sequencing (TAS) of nu cl ease-edited or control samples that have been obtained and aligned to reference amplicon sequences.
  • NGS next-generation sequencing
  • TAS targeted amplicon sequencing
  • Disclosed herein are exemplary experimental and computational data demonstrating that the methods can be used to accurately quantify indels of novel uncharacterized nucleases, for example, nucleases having unknown cleavage sites relative to the user-designated target sequence.
  • provided herein are methods for quantifying insertions and/or deletions in a polynucleotide sequence caused by cleavage of a target polynucleotide sequence by a nucleic acid-guided nuclease.
  • the methods include, in a computer system, receiving sample sequence data comprising a plurality of sequencing reads; filtering the plurality of sequencing reads; aligning the plurality of sequencing reads to a reference sequence; defining a window based on a sequence location within a nucleic acid guide sequence and the locations of the ends of the nucleic acid guide sequence; determining, based on the alignment of each sequencing read of the plurality of sequencing reads to the reference sequence, the number of sequencing reads comprising an insertion or deletion within the window relative to the reference sequence; estimating, based on the number of sequencing reads comprising an insertion or deletion, the quantity of insertions and/or deletions in the polynucleotide sequence mediated by a nucleotide-directed nuclease.
  • the nucleic acid-guided nuclease is a Class 2 nuclease. In some embodiments, the nucleic acid- guided nuclease is a type II nuclease. In some embodiments, the nucleic acid-guided nuclease is SpCas9. In some embodiments, the nucleic acid-guided nuclease is AsCasl2a. In some embodiments, the nucleic acid-guided nuclease is a type V nuclease. In some embodiments, the nucleic acid guide is a guide RNA (gRNA). In some embodiments, the nucleic acid guide is a single guide RNA (sgRNA).
  • the center of the window is located at the center of the nucleic acid guide sequence. In some embodiments, the center of the window is located at the site where the nuclease cleaved the polynucleotide sequence. In some embodiments, the length of the window is equivalent to the length of the nucleic acid guide sequence. In some embodiments, the 5' end of the window extends 50 basepairs 5' to the 5' end of the nucleic acid guide and the 3' end of the window extends 50 basepairs 3' to the 3' end of the nucleic acid guide.
  • the 5' end of the window extends 40 basepairs 5' to the 5' end of the nucleic acid guide and the 3' end of the window extends 40 basepairs 3' to the 3' end of the nucleic acid guide. In some embodiments, the 5' end of the window extends 30 basepairs 5' to the 5' end of the nucleic acid guide and the 3' end of the window extends 30 basepairs 3' to the 3' end of the nucleic acid guide. In some embodiments, the 5' end of the window extends 20 basepairs 5' to the 5' end of the nucleic acid guide and the 3' end of the window extends 20 basepairs 3' to the 3' end of the nucleic acid guide.
  • the 5' end of the window extends 10 basepairs 5' to the 5' end of the nucleic acid guide and the 3' end of the window extends 10 basepairs 3' to the 3' end of the nucleic acid guide. In some embodiments, the 5' end of the window extends 5 basepairs 5' to the 5' end of the nucleic acid guide and the 3' end of the window extends 5 basepairs 3' to the 3' end of the nucleic acid guide.
  • the method further includes trimming of adapter sequences from the plurality of sequencing reads.
  • the plurality of sequencing reads are generated from targeted amplicon sequencing.
  • the plurality of sequencing reads are generated from targeted amplicon sequencing of DNA isolated from cells edited by the nucleic-acid guided nuclease.
  • the plurality of sequencing reads are paired-end sequencing reads.
  • the method further includes read merging of the paired-end sequencing reads to produce a single read for alignment to the reference sequence.
  • the read merging of the paired-end reads further includes applying a minimum paired-end read overlap score to the plurality of sequencing reads.
  • the minimum paired-end read overlap score is 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, or 100. In some embodiments, the minimum paired-end read overlap score is 10. In some embodiments, the read merging of paired-end reads further comprises applying a maximum paired-end read overlap score to the plurality of sequencing reads. In some embodiments, the maximum paired-end read overlap score is 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, or 300. In some embodiments, the maximum paired-end read overlap score is 100.
  • the filtering step further comprises applying a minimum average read quality score to the plurality of sequencing reads. In some embodiments, the minimum average read quality score is 0, 5, 10, 15, 20, 25, 30, 35, or 40. In some embodiments, the filtering step further comprises applying a minimum single basepair score to the plurality of sequencing reads. In some embodiments, the minimum single basepair score is 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20. In some embodiments, the aligning step further comprises applying an amplicon minimum alignment score to the plurality of sequencing reads. In some embodiments, the amplicon minimum alignment score is 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, or 100.
  • the nucleic acid-guided nuclease is a FokI nuclease. In some embodiments, the FokI nuclease is fused to a transcription activator-like (TAL) protein. In some embodiments, the nucleic acid-guided nuclease is a zinc-finger nuclease.
  • TAL transcription activator-like
  • a computer program product tangibly embodied on a computer-readable medium, comprising instructions that when executed by one or more processors are configured to: receive sample sequence data comprising a plurality of sequencing reads; filter the plurality of sequencing reads; align the plurality of sequencing reads to a reference sequence; define a window based on a sequence location within a nucleic acid guide sequence and the locations of the ends of the nucleic acid guide sequence; determine, based on the alignment of each sequencing read of the plurality of sequencing reads to the reference sequence, the number of sequencing reads comprising an insertion or deletion within the window relative to the reference sequence; and estimate, based on the number of sequencing reads comprising an insertion or deletion, the quantity of insertions and/or deletions in the polynucleotide sequence mediated by a nucleotide-directed nuclease.
  • FIG. 1A is a cartoon schematic of a Type II nuclease, featuring an sgRNA interacting with a target DNA sequence.
  • PAM NGG sequence, guide sequence, target DNA sequence, and cleavage site are indicated.
  • FIG. IB is a cartoon schematic of a Type V nuclease, featuring a gRNA interacting with a target DNA sequence.
  • PAM TTTN
  • guide sequence guide sequence
  • target DNA sequence and cleavage site are indicated.
  • FIG. 2A is a diagram of an upstream molecular biology workflow for the methods disclosed herein. Steps include: design primers to amplify editing target; isolate cellular DNA; PCR-amplify editing target with adapters; add next-generation sequencing (NGS) adapters to PCR products; pool barcoded amplicons; bead-purify amplicons; spike in diversity DNA; perform NGS on sequencing instrument.
  • NGS next-generation sequencing
  • FIG. 2B is a diagram of the computational pipeline disclosed herein. Steps include quality control of received sequencing reads; alignment of sequencing reads to a target sequence; and quantification of indels according to the methods disclosed herein.
  • FIG. 3A is a schematic representation of received sequencing reads of nucleic- acid guided nuclease edited sample, comprising a plurality of sequencing reads, a subset of which contain indels as a result of cleavage by the nucleic-acid guided nuclease.
  • the nuclease is Cas9, a Type II nuclease with a known cleavage site relative to the guide sequence. For each type of indel, the number of sequencing reads comprising that indel is indicated at right. The previously known cleavage site is indicated by a rectangular box.
  • FIG. 3B is a schematic representation of received sequencing reads of nucleic- acid guided nuclease edited sample, comprising a plurality of sequencing reads, a subset of which contain indels as a result of cleavage by the nucleic-acid guided nuclease.
  • the nuclease is a novel Type V nuclease. For each type of indel, the number of sequencing reads comprising that indel is indicated at right. Several indels are missed by standard computational processing methods.
  • FIG. 3C is a schematic representation of received sequencing reads of nucleic- acid guided nuclease edited sample, comprising a plurality of sequencing reads, a subset of which contain indels as a result of cleavage by the nucleic-acid guided nuclease.
  • the nuclease is a novel Type V nuclease. For each type of indel, the number of sequencing reads comprising that indel is indicated at right. Custom data processing methods capture all indels generated by cleavage by the nuclease.
  • FIG. 4A is a plot comparing a standard data processing pipeline for a known nuclease, Cas9, with the data processing pipeline disclosed herein for Cas9.
  • FIG. 4B is schematic of a dilution experiment to validate the data processing pipeline disclosed herein, wherein nucleic acid-guided nuclease-edited target sequence is serially diluted with non-edited target sequence in order to evaluate the efficacy of the data processing methods disclosed herein.
  • FIG. 4C is a plot of expected indel percentage (x-axis) vs. observed indel percentage (y-axis) for the experiment depicted in the schematic of FIG. 4B.
  • FIG. 5 is a diagram of computer system components that can be used to implement a computational pipeline for indel quantification for polynucleotide sequences that have been cleaved by nucleic-acid guided nucleases, including uncharacterized nucleases.
  • FIG. 6 is a plot of indel percentage (y-axis) for two nucleases (x-axis): Cas9 (reference) and novel Type V nuclease with escalating doses of the novel nuclease using two different guide sequences, Al and F2.
  • FIG. 7A is a schematic of an experiment where frequency of a 21 -nucleotide insertion was quantified by the computational pipeline disclosed herein.
  • FIG. 7B is a schematic representation of received sequencing reads of the amplicons depicted in FIG. 7A, comprising a plurality of sequencing reads, a subset of which contain the 21 -nucleotide insertion in a spike-in dilution experiment.
  • FIG. 7C is a schematic representation of received sequencing reads of the amplicons depicted in FIG. 7A, comprising a plurality of sequencing reads, a subset of which contain the 21 -nucleotide insertion in a spike-in dilution experiment.
  • FIG. 7D is a schematic representation of received sequencing reads of the amplicons depicted in FIG. 7A, comprising a plurality of sequencing reads, a subset of which contain the 21 -nucleotide insertion in a spike-in dilution experiment.
  • nucleic acid As used herein, the terms “nucleic acid,” “polynucleotide,” and “oligonucleotide” are interchangeable and refer to a deoxyribonucleotide or ribonucleotide polymer, in linear or circular conformation, and in either single- or double-stranded form. For the purposes of the present disclosure, these terms are not to be construed as limiting with respect to the length of the polymer.
  • the terms can encompass known analogues of natural nucleotides, as well as nucleotides that are modified in the base, sugar and/or phosphate moi eties (e.g., phosphorothioate backbones). In general and unless otherwise specified, an analogue of a particular nucleotide has the same base-pairing specificity; i.e., an analogue of A will base-pair with T.
  • CRISPR refers to clustered regularly interspaced short palindromic repeats or any of the DNA loci that serve to direct CRISPR-associated proteins or similar nucleotide-directed nucleases. It also describes man-made, constructed, or selected systems derived using these frameworks or proteins. CRISPR systems and the related proteins vary among the currently described type I, type II and type III systems, though it is possible other analogous systems have yet to be described.
  • CRISPR system refers collectively to transcripts and other elements involved in the expression of or directing the activity of CRISPR-associated (“Cas”) genes, including sequences encoding a Cas gene, a tracr (trans-activating CRISPR) sequence (e.g., tracrRNA or an active partial tracrRNA), a tracr-mate sequence (encompassing a "direct repeat” and a tracrRNA-processed partial direct repeat in the context of an endogenous CRISPR system), a guide sequence (also referred to as a "spacer” in the context of an endogenous CRISPR system), and other sequences and transcripts from a CRISPR locus.
  • a tracr trans-activating CRISPR
  • tracrRNA or an active partial tracrRNA e.g., tracrRNA or an active partial tracrRNA
  • a tracr-mate sequence encompassing a "direct repeat” and a tracrRNA-processed partial direct repeat in the context of an
  • One or more tracr mate sequences operably linked to a guide sequence can also be referred to as precrRNA (pre-CRISPR RNA) before processing or crRNA after processing by a nuclease.
  • CRISPR systems can also include modified, swapped or engineered, guide, tracr or chimeric RNA sequences and the protein to which they interact (For example, Briner, et al., Mai Cell 56(2)333-9 (2014)).
  • the methods disclosed herein may also be applicable to other, non-CRISPR nucleotide-directed nucleases.
  • the term “guide sequence” refers to the portion of, for example, a guide RNA (gRNA) or single guide RNA (sgRNA) that confers the specificity of the nucleic acid-guide nuclease to its target, and that mediates the formation of the RNA- DNA duplex between the targeting RNA and the target DNA sequence.
  • gRNA guide RNA
  • sgRNA single guide RNA
  • the targeting specificity of a CRISPR-Cas9 complex is determined by the approximately 20- nt sequence at the 5' end of the gRNA.
  • the length of a guide sequence is typically between 17-24bp.
  • “center of the guide sequence” refers to the midpoint of the guide sequence.
  • cleavage refers to the breakage of the covalent backbone of a nucleic acid molecule. Cleavage can be initiated by a variety of methods including, but not limited to, enzymatic or chemical hydrolysis of a phosphodiester bond.
  • cleavage refers to the double-stranded cleavage between nucleic acids within a double-stranded DNA or RNA chain.
  • genomic region or “genomic segment”, as used interchangeably herein, denote a contiguous length of nucleotides in a genome of an organism.
  • a genomic region may be of a length as small as a few kb (e.g., at least 5 kb, at least 10 kb or at least 20 kb), up to an entire chromosome or more.
  • nucleotide sequences are provided using character representations recommended by the International Union of Pure and Applied Chemistry (IUPAC) or a subset thereof.
  • the set ⁇ A, C, G, T, U ⁇ for adenosine, cytidine, guanosine, thymidine, and uridine respectively.
  • the set ⁇ A, C, G, T, U, I, X, W ⁇ for adenosine, cytidine, guanosine, thymidine, uridine, inosine, uridine, xanthosine, pseudouridine respectively.
  • the set of characters is ⁇ A, C, G, T, U, I, X, P, R, Y, N ⁇ for adenosine, cytidine, guanosine, thymidine, uridine, inosine, uridine, xanthosine, pseudouridine, unspecified purine, unspecified pyrimidine, and unspecified nucleotide respectively.
  • the modified sequences, non-natural sequences, or sequences with modified binding, may be in the genomic, the guide or the tracr sequences.
  • Nucleotide and/or amino acid sequence identity percent is understood as the percentage of nucleotide or amino acid residues that are identical with nucleotide or amino acid residues in a candidate sequence in comparison to a reference sequence when the two sequences are aligned. To determine percent identity, sequences are aligned and if necessary, gaps are introduced to achieve the maximum percent sequence identity. Sequence alignment procedures to determine percent identity are well known to those of skill in the art. Often publicly available computer software such as BLAST, BLAST2, ALIGN2 or MEGALIGN (DNASTAR) software is used to align sequences. Those skilled in the art can determine appropriate parameters for measuring alignment, including any algorithms needed to achieve maximal alignment over the full-length of the sequences being compared.
  • mutation encompasses any change in a DNA, RNA, or protein sequence from the wild type sequence or some other reference, including without limitation point mutations, transitions, insertions, transversions, translocations, deletions, inversions, duplications, recombinations, or combinations thereof.
  • insertion is used when the polynucleotide sequence has one or more extra bases compared with the polynucleotide sequence before cleavage by the RNA-guided nuclease occurred.
  • diseletion is used when the polynucleotide sequence has one or more missing bases compared with the polynucleotide sequence before cleavage by the RNA-guided nuclease occurred.
  • RNA-guided nuclease indicates either insertions or deletions. Cleavage by an RNA-guided nuclease can result in multiple indels, multiple insertions, multiple deletions or combinations of insertions of one or more nucleotides and deletions of one or more nucleotides.
  • CRISPR Clustered Regularly Interspaced Short Palindromic Repeats
  • CRISPR/Cas system has been adapted for use as gene editing (silencing, enhancing or changing specific genes) for use in eukaryotes (see, for example, Cong, Science, 15 :339(6121): 819-823 (2013) and Jinek, et al., Science, 337(6096):816-21 (2012)).
  • a polynucleotide sequence By transfecting a cell with the required elements including a cas gene and specifically designed CRISPRs, a polynucleotide sequence can be cut and modified at virtually any desired location by unique targeting by, for example, a guide RNA that confers specificity to the nuclease.
  • a guide RNA that confers specificity to the nuclease.
  • a number of methods exist for introducing the guide strand and Cas protein into cells including viral transduction, injection or micro-injection, nano-particle or other delivery, uptake of proteins, uptake of RNA or DNA, uptake of combination of protein and RNA or DNA. Combinations of methods can also be used, simultaneously or in sequence.
  • RNA, DNA or protein can occur with or without further protein expression.
  • Methods of preparing compositions for use in genome editing using the CRISPR/Cas systems are described in detail in WO 2013/176772 and WO 2014/018423, which are specifically incorporated by reference herein in their entireties.
  • the nuclease for use in the methods described herein is a Class 2 Cas nuclease.
  • the nuclease has double-strand endonuclease activity.
  • the nuclease comprises a Cas nuclease, such as a Class 2 Cas nuclease (which may be, e.g., a Cas nuclease of Type II, V, or VI).
  • Class 2 Cas nucleases include, for example, Cas9, Cpfl, C2cl, C2c2, and C2c3 proteins and modifications thereof. Examples of Cas9 nucleases include those of the type II CRISPR systems of S. pyogenes, S.
  • FIG. 1A shows a Type II nuclease, featuring an sgRNA interacting with a target DNA sequence.
  • PAM (NGG) sequence, guide sequence, target DNA sequence, and cleavage site are indicated.
  • Cas nucleases include a Csm or Cmr complex of a type III CRISPR system or the Cas 10, Csml, or Cmr2 subunit thereof; and a Cascade complex of a type I CRISPR system, or the Cas3 subunit thereof.
  • FIG. IB shows a Type V nuclease, featuring a gRNA interacting with a target DNA sequence. PAM (TTTN) sequence, guide sequence, target DNA sequence, and cleavage site are indicated.
  • the Cas nuclease may be from a Type-IIA, Type-1 IB, or Type-IIC system.
  • the RNA-guided DNA binding agent is a Cas nickase, e.g. a Cas9 nickase.
  • the RNA-guided DNA binding agent is an S. pyogenes Cas9 nuclease.
  • Non-limiting exemplary species that the nuclease can be derived from include but are not limited to Streptococcus pyogenes, Streptococcus thermophilus, Streptococcus sp., Staphylococcus aureus, Listeria innocua, Lactobacillus gasseri, Francisella novicida, Wolinella succinogenes, Sutterella wadsworthensis, Gammaproteobacterhim, Neisseria meningitidis, Campylobacter Jejuni, Pasteurella multocida, Fibrobacter succinogene, Rhodospirillum rubrum, Nocardiopsis rougevillei, Streptomyces pristinaespiralis, Streptomyces viridochromogenes, Streptomyces viridochromogenes, Streptosporangium roseum, Streptosporangium roseum, AU
  • the Cas nuclease is the Cas9 nuclease from Streptococcus pyogenes. In some embodiments, the Cas nuclease is the Cas9 nuclease from Streptococcus thermophilus. In some embodiments, the Cas nuclease is the Cas9 nuclease from Neisseria meningitidis. In some embodiments, the Cas nuclease is the Cas9 nuclease is from Staphylococcus aureus. In some embodiments, the Cas nuclease is the Cpfl nuclease from Francisella novicida.
  • the Cas nuclease is the Cpfl nuclease from Acidaminococcus sp. In some embodiments, the Cas nuclease is the Cpfl nuclease from Lachnospiraceae bacterium ND2006.
  • the Cas nuclease is the Cpfl nuclease from Francisella tularensis, Lachnospiraceae bacterium, Butyrivibrio proteoclasticus, Peregrinibacteria bacterium, Parcubacteria bacterium, Smithella, Acidaminococcus, Candidates Methanoplasma termitum, Eubacterium eligens, Moraxella bovoculi, Leptospira inadai, Porphyromonas crevioricanis, Prevotella disiens, or Porphyromonas macacae.
  • the Cas nuclease is a Cpfl nuclease from an Acidaminococcus or Lachnospiraceae.
  • Wild type Cas9 has two nuclease domains: RuvC and HNH.
  • the RuvC domain cleaves the non-target DNA strand
  • the HNH domain cleaves the target strand of DNA.
  • the Cas9 nuclease comprises more than one RuvC domain and/or more than one HNH domain.
  • the Cas9 nuclease is a wild type Cas9.
  • the Cas9 is capable of inducing a double strand break in target DNA.
  • the Cas nuclease can cleave one or both strands of dsDNA.
  • the Cas nuclease can cleave a single strand of DNA.
  • the Cas nuclease may not have DNA nickase activity.
  • chimeric Cas nucleases are used, where one domain or region of the protein is replaced by a portion of a different protein.
  • a Cas nuclease domain may be replaced with a domain from a different nuclease such as Fok 1.
  • a Cas nuclease may be a modified nuclease, wherein the polypeptide sequence of the nuclease has been modified to confer, in some examples, advantageous properties to the nuclease.
  • the cleavage site of the nuclease relative to the location of the user-designated target sequence is unknown.
  • the Cas nuclease may be from a Type-I CRISPR/Cas system. In some embodiments, the Cas nuclease may be a component of the Cascade complex of a Type-I CRISPR/Cas system In some embodiments, the Cas nuclease may be a Cas3 protein. In some embodiments, the Cas nuclease may be from a Type-III CRISPR/Cas system. In some embodiments, the Cas nuclease may have an RNA cleavage activity.
  • a data processing pipeline including the steps of receiving sample sequence data; applying quality control filters to the sample sequence data; aligning sequencing reads of the sample sequence data to a reference sequence; defining a window based on a sequence location within a nucleic acid guide and the locations of the ends of the nucleic acid guide sequence; and quantifying indels in the sample sequence data.
  • the plurality of sequencing reads is obtained from a next-generation sequencing (NGS) instrument.
  • NGS sequencing instrument is an Illumina MiSeqTM machine.
  • sequencing and amplification adapters are trimmed by default, and so the returned NGS reads data do not have adapter sequences in the reads.
  • sequencing and amplification adapters are trimmed as a part of the data processing pipeline of the methods disclosed herein.
  • FIG. 2A shows steps of an upstream molecular biology workflow that can be performed in to generate data that is subsequently processed by the data processing pipeline disclosed herein.
  • Steps of the upstream molecular biology workflow can include designing primers to amplify an editing target; isolating cellular DNA; PCR-amplifying the editing target with adapters; adding next-generation sequencing (NGS) adapters to PCR products; pooling barcoded amplicons; bead-purifying amplicons; spiking in diversity DNA; performing next-generation sequencing (NGS) on a sequencing instrument.
  • NGS next-generation sequencing
  • FIG. 2B shows steps of the computational pipeline disclosed herein. Steps can include quality control of received sequencing reads; alignment of sequencing reads to a target sequence; and quantification of indels according to the methods disclosed herein. These steps are described in further detail below.
  • FIG. 3A is a schematic representation of received sequencing reads of nucleic-acid guided nuclease edited sample, comprising a plurality of sequencing reads, a subset of which contain indels as a result of cleavage by the nucleic-acid guided nuclease.
  • the nuclease is Cas9, a Type II nuclease with a known cleavage site relative to the guide sequence.
  • the number of sequencing reads comprising that indel is indicated at right.
  • the previously known cleavage site is indicated by a rectangular box.
  • FIG. 3B is a schematic representation of received sequencing reads of nucleic-acid guided nuclease edited sample, comprising a plurality of sequencing reads, a subset of which contain indels as a result of cleavage by the nucleic-acid guided nuclease.
  • the nuclease is a novel Type V nuclease. For each type of indel, the number of sequencing reads comprising that indel is indicated at right. Several indels are missed by standard computational processing methods.
  • FIG. 3C is a schematic representation of received sequencing reads of nucleic-acid guided nuclease edited sample, comprising a plurality of sequencing reads, a subset of which contain indels as a result of cleavage by the nucleic-acid guided nuclease.
  • the nuclease is a novel Type V nuclease. For each type of indel, the number of sequencing reads comprising that indel is indicated at right. Custom data processing methods capture all indels generated by cleavage by the nuclease.
  • the data processing pipeline disclosed herein includes one or more modules executed by the package of CRISPResso software tools (Clement K, et al. CRISPResso2 provides accurate and rapid genome editing sequence analysis. Nat Biotechnol. 2019 Mar;37(3):224-226., and Canver MC, et al. Integrated design, execution, and analysis of arrayed and pooled CRISPR genome-editing experiments. Nat Protoc. 2018 May;13(5):946-986., incorporated herein by reference in their entirety).
  • CRISPResso software tools Clement K, et al. CRISPResso2 provides accurate and rapid genome editing sequence analysis. Nat Biotechnol. 2019 Mar;37(3):224-226., and Canver MC, et al. Integrated design, execution, and analysis of arrayed and pooled CRISPR genome-editing experiments. Nat Protoc. 2018 May;13(5):946-986., incorporated herein by reference in their entirety).
  • one or more read filtering parameters are applied to sample sequence data using CRISPResso in order to remove potentially false-positive indels from the sample sequence data, in order to improve the accuracy of the estimation of the frequency of indels in the sample sequence data.
  • Read filtering parameters are described in further detail below.
  • read filtering is performed based on PHRED quality scores, which are described in Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998 Mar;8(3): 186-94., which is incorporated by reference herein in its entirety.
  • PHRED quality scores measure the quality of the identification of nucleotide base calls in sequencing reads generated by automated DNA sequence instruments.
  • minimum average read quality (“q” or “min average read quality”) is applied to filter sample sequence data, in order to remove potentially false-positive indels.
  • This parameter allow for the specification of the minimum average quality score for inclusion of a read in subsequent analysis.
  • the PHRED score represents the confidence in the assignment of a particular nucleotide in a read.
  • the maximum score of 40 corresponds to an error rate of 0.01%. This average quality of a read is useful to filter out low-quality reads.
  • a “min_average_read_quality” value of 0, 5, 10, 15, 20, 25, 30, 35, or 40, is applied to sample sequence data.
  • minimum single basepair score (“s” or “min single bp quality”) is applied to filter sample sequence data, in order to remove potentially false-positive indels.
  • This parameter allow for the specification of the minimum single-bp score for inclusion of a read in subsequent analysis. This parameter provides for more-stringent filtering; any read with a single-bp quality below the threshold will be discarded.
  • a “min single bp quality” value of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20, is applied to sample sequence data.
  • an amplicon minimum alignment score (“amas” or “amplicon min alignment score”) is applied to filter sample sequence data. After reads are aligned to a reference sequence, the homology is calculated as the number of basepairs they have in common. This is useful for filtering erroneous reads that do not align to the target sequence, for example arising from alternate primer locations.
  • a “amplicon_min_alignment_score” value of 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, or 100, is applied to sample sequence data.
  • a combination of a “min average read quality” score, a “min_single_bp_quality” score, and a “amplicon_min_alignment_score” is applied to filter sample sequence data in order to improve the estimation of indel frequency in sample sequence data after cleavage by a target polynucleotide sequence by a nuclease.
  • a combination of a “min average read quality” score, a “min_single_bp_quality” score, and a “amplicon_min_alignment_score,” further in combination with a window size based on a sequence location within a guide sequence, is applied to filter sample sequence data in order to improve the estimation of indel frequency in sample sequence data after cleavage by a target polynucleotide sequence by a nuclease.
  • amplicon_min_alignment_score is applied to filter sample sequence data in order to improve the estimation of indel frequency in sample sequence data after cleavage by a target polynucleotide sequence by a nuclease.
  • combinations of these parameters may be used to enhance the accuracy of indel quantification according to the methods disclosed herein.
  • a combination of a “min average read quality” score, a “min_single_bp_quality” score, and a “amplicon_min_alignment_score” is applied to filter sample sequence data in order to improve the estimation of indel frequency in sample sequence data after cleavage by a target polynucleotide sequence by a nuclease.
  • a combination of a “min average read quality” score, a “min_single_bp_quality” score, and a “amplicon_min_alignment_score,” further in combination with a window size based on a sequence location of a known insertion site, is applied to fdter sample sequence data in order to improve the estimation of indel frequency in sample sequence data after cleavage by a target polynucleotide sequence by a nuclease and insertion of a polynucleotide sequence.
  • combinations of these parameters may be used to enhance the accuracy of indel quantification according to the methods disclosed herein.
  • the data processing pipeline disclosed herein includes one or modules executed by the package of FLASH software tools (Magoc T, Salzberg SL. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. 2011 Nov l;27(21):2957-63., incorporated by reference herein in its entirety).
  • FLASH is a rapid and accurate software tool to merge paired-end reads from next-generation sequencing experiments, designed to merge pairs of reads when the original DNA fragments are shorter than twice the length of reads. The resulting longer reads can significantly improve genome assemblies.
  • FLASH calculates a mismatch ratio within two overlapped regions.
  • FLASH determines that the reads are an incorrect overlap.
  • paired-end reads are merged using FLASH in order to produce single reads for alignment to the target reference sequence and reduces sequencing errors that may be present at the end of sequencing reads.
  • a maximum paired-end reads overlap (“max_paired_end_reads_overlap”) is applied to paired-end sample sequence data for the FLASH read merging step of the data processing pipeline. This parameter represents the maximum overlap length expected in approximately 90% of read pairs.
  • a “max _paired_end_reads_overlap” value of 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, or 300 is applied to sample sequence data.
  • a minimum paired-end reads overlap (“min_paired_end_reads_overlap”) is applied to paired-end sample sequence data for the FLASH read merging step of the data processing pipeline. This parameter represents the minimum required overlap length between two reads to provide a confident overlap.
  • a “max_paired end reads overlap” value of 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, or 100, is applied to sample sequence data.
  • a combination of a “min average read quality” score, a “min single bp quality” score, a “min_paired_end_reads_overlap,” and a “max_paired_end_reads_overlap” is applied to filter sample sequence data in order to improve the estimation of indel frequency in sample sequence data after cleavage by a target polynucleotide sequence by a nuclease.
  • a combination of a “min_average_read_quality” score, a “min_single_bp_quality” score, a “min_paired_end_reads_overlap,” and a “max_paired_end_reads_overlap” further in combination with a window size based on a sequence location within a guide sequence is applied to filter sample sequence data in order to improve the estimation of indel frequency in sample sequence data after cleavage by a target polynucleotide sequence by a nuclease.
  • combinations of these parameters may be used to enhance the accuracy of indel quantification according to the methods disclosed herein.
  • a combination of a “min average read quality” score, a “min_single_bp_quality” score, a “min_paired_end_reads_overlap,” and a “max_paired_end_reads_overlap” is applied to filter sample sequence data in order to improve the estimation of indel frequency in sample sequence data after cleavage by a target polynucleotide sequence by a nuclease.
  • a combination of a “min_average_read_quality” score, a “min_single_bp_quality” score, a “min_paired_end_reads_overlap,” and a “max_paired_end_reads_overlap” further in combination with a window size based on a sequence location of a known insertion site is applied to filter sample sequence data in order to improve the estimation of indel frequency in sample sequence data after cleavage by a target polynucleotide sequence by a nuclease and insertion of a polynucleotide sequence.
  • combinations of these parameters may be used to enhance the accuracy of indel quantification according to the methods disclosed herein.
  • the quantification window extends 50, 40, 30, 20, 10, or 5 nucleotides 5' to the 5' end of the guide sequence. In some embodiments, the quantification window extends 50, 40, 30, 20, 10, or 5 nucleotides 3' to the 3' end of the guide sequence. In some embodiments, the quantification window extends 50, 40, 30, 20, 10, or 5 nucleotides 5' to a known insertion site. In some embodiments, the quantification window extends 50, 40, 30, 20, 10, or 5 nucleotides 3' to a known insertion site. In some embodiments, the quantification window is 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, or 500 nucleotides in length.
  • the indel to be detected is about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more nucleotides in length. In some embodiments, the indel to be detected is about 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 or more nucleotides in length. In some embodiments, the indel comprises a CRISPR-mediated donor insertion. In some embodiments, the indel is the result of homology directed repair (HDR). In some embodiments, the indel is the result of insertion by a CRISPR-associated transposase (CAST). In some embodiments, the insertion comprises sequences from a genomic library.
  • HDR homology directed repair
  • CAST CRISPR-associated transposase
  • FIG. 5 is a diagram of computer system 500 components that can be used to implement a computational pipeline for indel quantification for polynucleotide sequences that have been cleaved by nucleic-acid guided nucleases, including uncharacterized nucleases.
  • Computer system 500 can be used to implement methods that include parameters for alignment of next-generation sequencing (NGS) reads from targeted amplicon sequencing (TAS) of nuclease-edited or control samples that have been obtained and aligned to reference amplicon sequences.
  • NGS next-generation sequencing
  • TAS targeted amplicon sequencing
  • Computing device 500 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers.
  • Computing device 550 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smartphones, and other similar computing devices. Additionally, computing device 500 or 550 can include Universal Serial Bus (USB) flash drives.
  • USB flash drives can store operating systems and other applications.
  • the USB flash drives can include input/output components, such as a wireless transmitter or USB connector that can be inserted into a USB port of another computing device.
  • the components shown here, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the methods and compositions described and/or claimed in this document.
  • Computing device 500 includes a processor 502, memory 504, a storage device 506, a high-speed interface 508 connecting to memory 504 and high-speed expansion ports 510, and a low speed interface 512 connecting to low speed bus 514 and storage device 506.
  • Each of the components 502, 504, 506, 508, 510, and 512 are interconnected using various busses, and can be mounted on a common motherboard or in other manners as appropriate.
  • the processor 502 can process instructions for execution within the computing device 500, including instructions stored in the memory 504 or on the storage device 506 to display graphical information for a GUI on an external input/output device, such as display 516 coupled to high speed interface 508.
  • multiple processors and/or multiple buses can be used, as appropriate, along with multiple memories and types of memory.
  • multiple computing devices 500 can be connected, with each device providing portions of the necessary operations, e.g., as a server bank, a group of blade servers, or a multi-processor system.
  • the memory 504 stores information within the computing device 500.
  • the memory 504 is a volatile memory unit or units.
  • the memory 504 is a non-volatile memory unit or units.
  • the memory 504 can also be another form of computer-readable medium, such as a magnetic or optical disk.
  • the storage device 506 is capable of providing mass storage for the computing device 500.
  • the storage device 506 can be or contain a computer- readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid-state memory device, or an array of devices, including devices in a storage area network or other configurations.
  • a computer program product can be tangibly embodied in an information carrier.
  • the computer program product can also contain instructions that, when executed, perform one or more methods, such as those described above.
  • the information carrier is a computer- or machine-readable medium, such as the memory 504, the storage device 506, or memory on processor 502.
  • the high-speed controller 508 manages bandwidth-intensive operations for the computing device 500, while the low speed controller 512 manages lower bandwidth intensive operations. Such allocation of functions is only an example.
  • the high-speed controller 508 is coupled to memory 504, display 516, e.g., through a graphics processor or accelerator, and to high-speed expansion ports 510, which can accept various expansion cards (not shown).
  • low-speed controller 512 is coupled to storage device 506 and low-speed expansion port 514.
  • the low-speed expansion port which can include various communication ports, e.g., USB, Bluetooth, Ethernet, wireless Ethernet can be coupled to one or more input/output devices, such as a keyboard, a pointing device, microphone/speaker pair, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
  • the computing device 500 can be implemented in a number of different forms, as shown in the figure. For example, it can be implemented as a standard server 520, or multiple times in a group of such servers. It can also be implemented as part of a rack server system 524. In addition, it can be implemented in a personal computer such as a laptop computer 522.
  • components from computing device 500 can be combined with other components in a mobile device (not shown), such as device 550.
  • a mobile device not shown
  • Each of such devices can contain one or more of computing device 500, 550, and an entire system can be made up of multiple computing devices 500, 550 communicating with each other.
  • the computing device 500 can be implemented in a number of different forms, as shown in the figure. For example, it can be implemented as a standard server 520, or multiple times in a group of such servers. It can also be implemented as part of a rack server system 524. In addition, it can be implemented in a personal computer such as a laptop computer 522. Alternatively, components from computing device 500 can be combined with other components in a mobile device (not shown), such as device 550. Each of such devices can contain one or more of computing device 500, 550, and an entire system can be made up of multiple computing devices 500, 550 communicating with each other.
  • Computing device 550 includes a processor 552, memory 564, and an input/output device such as a display 554, a communication interface 566, and a transceiver 568, among other components.
  • the device 550 can also be provided with a storage device, such as a micro-drive or other device, to provide additional storage.
  • a storage device such as a micro-drive or other device, to provide additional storage.
  • Each of the components 550, 552, 564, 554, 566, and 568 are interconnected using various buses, and several of the components can be mounted on a common motherboard or in other manners as appropriate.
  • the processor 552 can execute instructions within the computing device 550, including instructions stored in the memory 564.
  • the processor can be implemented as a chipset of chips that include separate and multiple analog and digital processors. Additionally, the processor can be implemented using any of a number of architectures.
  • the processor 510 can be a CISC (Complex Instruction Set Computers) processor, a RISC (Reduced Instruction Set Computer) processor, or a MISC (Minimal Instruction Set Computer) processor.
  • the processor can provide, for example, for coordination of the other components of the device 550, such as control of user interfaces, applications run by device 550, and wireless communication by device 550.
  • Processor 552 can communicate with a user through control interface 558 and display interface 556 coupled to a display 554.
  • the display 554 can be, for example, a TFT (Thin-Film-Transistor Liquid Crystal Display) display or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology.
  • the display interface 556 can comprise appropriate circuitry for driving the display 554 to present graphical and other information to a user.
  • the control interface 558 can receive commands from a user and convert them for submission to the processor 552.
  • an external interface 562 can be provided in communication with processor 552, so as to enable near area communication of device 550 with other devices.
  • External interface 562 can provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces can also be used.
  • the memory 564 stores information within the computing device 550.
  • the memory 564 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units.
  • Expansion memory 574 can also be provided and connected to device 550 through expansion interface 572, which can include, for example, a SIMM (Single In Line Memory Module) card interface.
  • SIMM Single In Line Memory Module
  • expansion memory 574 can provide extra storage space for device 550, or can also store applications or other information for device 550.
  • expansion memory 574 can include instructions to carry out or supplement the processes described above, and can also include secure information.
  • expansion memory 574 can be provided as a security module for device 550, and can be programmed with instructions that permit secure use of device 550.
  • secure applications can be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.
  • the memory can include, for example, flash memory and/or NVRAM memory, as discussed below.
  • a computer program product is tangibly embodied in an information carrier.
  • the computer program product contains instructions that, when executed, perform one or more methods, such as those described above.
  • the information carrier is a computer- or machine-readable medium, such as the memory 564, expansion memory 574, or memory on processor 552 that can be received, for example, over transceiver 568 or external interface 562.
  • Device 550 can communicate wirelessly through communication interface 566, which can include digital signal processing circuitry where necessary. Communication interface 566 can provide for communications under various modes or protocols, such as GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others. Such communication can occur, for example, through radio-frequency transceiver 568. In addition, short-range communication can occur, such as using a Bluetooth, Wi-Fi, or other such transceiver (not shown). In addition, GPS (Global Positioning System) receiver module 570 can provide additional navigation- and location-related wireless data to device 550, which can be used as appropriate by applications running on device 550.
  • GPS Global Positioning System
  • Device 550 can also communicate audibly using audio codec 560, which can receive spoken information from a user and convert it to usable digital information. Audio codec 560 can likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 550. Such sound can include sound from voice telephone calls, can include recorded sound, e.g., voice messages, music files, etc. and can also include sound generated by applications operating on device 550.
  • Audio codec 560 can receive spoken information from a user and convert it to usable digital information. Audio codec 560 can likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 550. Such sound can include sound from voice telephone calls, can include recorded sound, e.g., voice messages, music files, etc. and can also include sound generated by applications operating on device 550.
  • the computing device 550 can be implemented in a number of different forms, as shown in the figure. For example, it can be implemented as a cellular telephone 580. It can also be implemented as part of a smartphone 582, personal digital assistant, or other similar mobile device.
  • implementations of the systems and methods described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations of such implementations.
  • ASICs application specific integrated circuits
  • These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which can be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
  • the systems and techniques described here can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball by which the user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball by which the user can provide input to the computer.
  • Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
  • the systems and techniques described here can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here, or any combination of such back end, middleware, or front end components.
  • the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network ("LAN”), a wide area network (“WAN”), and the Internet.
  • LAN local area network
  • WAN wide area network
  • the Internet the global information network
  • the computing system can include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • a number of embodiments have been described. Nevertheless, it will be understood that various modifications can be made without departing from the spirit and scope of the invention.
  • the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results.
  • other steps can be provided, or steps can be eliminated, from the described flows, and other components can be added to, or removed from, the described systems. Accordingly, other embodiments are within the scope of the following claims.
  • Embodiments of the disclosure and all of the functional operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
  • Embodiments of the methods and compositions can be implemented as one or more computer program products, e.g., one or more modules of computer program instructions encoded on a computer readable medium for execution by, or to control the operation of, data processing apparatus.
  • the computer readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or a combination of one or more of them.
  • data processing apparatus encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
  • the apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
  • a propagated signal is an artificially generated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus.
  • a computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • a computer program does not necessarily correspond to a file in a file system.
  • a program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code).
  • a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
  • the processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output.
  • the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
  • processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer.
  • a processor will receive instructions and data from a read only memory or a random access memory or both.
  • the essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
  • mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
  • a computer need not have such devices.
  • a computer can be embedded in another device, e.g., a tablet computer, a mobile telephone, a personal digital assistant (PDA), a mobile audio player, a Global Positioning System (GPS) receiver, to name just a few.
  • Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.
  • the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
  • embodiments of the disclosure can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • keyboard and a pointing device e.g., a mouse or a trackball
  • Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
  • Embodiments of the disclosure can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the methods, or any combination of one or more such back end, middleware, or front end components.
  • the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
  • LAN local area network
  • WAN wide area network
  • the computing system can include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. While this specification contains many specifics, these should not be construed as limitations on the scope of the invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of the invention. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
  • HTML file In each instance where an HTML file is mentioned, other file types or formats may be substituted. For instance, an HTML file may be replaced by an XML, JSON, plain text, or other types of files. Moreover, where a table or hash table is mentioned, other data structures (such as spreadsheets, relational databases, or structured files) may be used.
  • EXAMPLE 1 Benchmarking the indel quantification method using a well- characterized nuclease
  • the data processing pipeline is applied to data generated by next-generation sequencing of mammalian cells edited by SpCas9, a Type II nuclease with a known cleavage site and editing pattern.
  • the percentage of cells estimated to comprise an indel at the target DNA sequence after editing is determined by standard CRISPResso2 processing parameters, and by the methods disclosed herein. As shown in FIG.
  • the methods disclosed herein perform similarly to the standard CRISPResso2 processing parameters, indicating that the method does not over- or underestimate the percentage of cells estimated to comprise an indel at the target DNA sequence after editing with a well- characterized nuclease with a known cleavage site and editing pattern.
  • the data processing pipeline was applied to data generated by next-generation sequencing of mammalian cells edited by a Type V nuclease with an unknown cleavage site in a serial dilution experiment.
  • DNA isolated from mammalian cells edited by the Type V nuclease was serially diluted with DNA isolated from non-edited cells in proportions of, for example, 0% non-edited / 100% edited; 25% non-edited / 75% edited; 50% non-edited / 50% edited; 75% non-edited / 25% edited; 0% non-edited / 100% edited (FIG. 4B).
  • serially diluted DNA mixtures were sequenced to produce sequencing reads comprising the edited target sequence.
  • observed indel percentages y-axis
  • expected indel percentages x-axis
  • the data processing pipeline is applied to data generated by nextgeneration sequencing of mammalian cells edited a novel Type V nuclease, at escalating concentrations and with two different guides: guide sequence Al and F2.
  • the data processing pipeline accurately estimates indel percentages of editing by the novel Type V nuclease at nuclease concentrations of 5 pM, 11 pM, and 22 pM for guide sequence Al and F2.
  • the data processing pipeline is applied to data generated by next-generation sequencing of mammalian cells edited by nuclease AsCasl2a to validate the efficacy of the indel quantification methods disclosed herein.
  • the data processing pipeline was applied to data generated by next-generation sequencing of mammalian cells where a sequence of known length, 21 nucleotides, was introduced at a particular insertion site in a serial dilution spike-in experiment (see, FIG. 7A).
  • DNA isolated from mammalian cells comprising the insertion was serially diluted with DNA isolated from cells not comprising the insertion in proportions of, for example, 0% insertion / no insertion; 25% insertion / 75% no insertion; 50% insertion / 50% no insertion; 75% insertion / 25% no insertion; 0% insertion / 100% no insertion (FIG. 4B)
  • the serially diluted DNA mixtures were sequenced to produce sequencing reads comprising the insertion site. As shown in FIG. 7B, for the experiment where the DNA comprising the insertion was spiked in at 25%, the spiked in sequence was detected at approximately 23%. As shown in FIG.

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

La présente invention concerne des procédés pour estimer la fréquence d'insertions et/ou de délétions médiées par des nucléases guidées par ARN. L'invention concerne des procédés et des systèmes pour le traitement, par un pipeline de calcul, de lectures de séquençage de séquences polynucléotidiques cibles éditées.
PCT/US2023/082543 2022-12-07 2023-12-05 Prédiction de fréquences d'indel WO2024123789A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202263430827P 2022-12-07 2022-12-07
US63/430,827 2022-12-07
EP23315141 2023-05-02
EP23315141.4 2023-05-02

Publications (1)

Publication Number Publication Date
WO2024123789A1 true WO2024123789A1 (fr) 2024-06-13

Family

ID=89507456

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/082543 WO2024123789A1 (fr) 2022-12-07 2023-12-05 Prédiction de fréquences d'indel

Country Status (1)

Country Link
WO (1) WO2024123789A1 (fr)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013176772A1 (fr) 2012-05-25 2013-11-28 The Regents Of The University Of California Procédés et compositions permettant la modification de l'adn cible dirigée par l'arn et la modulation de la transcription dirigée par l'arn
WO2014018423A2 (fr) 2012-07-25 2014-01-30 The Broad Institute, Inc. Protéines de liaison à l'adn inductibles et outils de perturbation du génome et leurs applications
US20160312198A1 (en) 2015-03-03 2016-10-27 The General Hospital Corporation Engineered CRISPR-CAS9 NUCLEASES WITH ALTERED PAM SPECIFICITY

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013176772A1 (fr) 2012-05-25 2013-11-28 The Regents Of The University Of California Procédés et compositions permettant la modification de l'adn cible dirigée par l'arn et la modulation de la transcription dirigée par l'arn
WO2014018423A2 (fr) 2012-07-25 2014-01-30 The Broad Institute, Inc. Protéines de liaison à l'adn inductibles et outils de perturbation du génome et leurs applications
US20160312198A1 (en) 2015-03-03 2016-10-27 The General Hospital Corporation Engineered CRISPR-CAS9 NUCLEASES WITH ALTERED PAM SPECIFICITY
US20160312199A1 (en) 2015-03-03 2016-10-27 The General Hospital Corporation Engineered CRISPR-CAS9 Nucleases with Altered PAM Specificity

Non-Patent Citations (12)

* Cited by examiner, † Cited by third party
Title
BRINER ET AL., MAL CELL, vol. 56, no. 2, 2014, pages 333 - 9
CANVER MC ET AL.: "Integrated design, execution, and analysis of arrayed and pooled CRISPR genome-editing experiments", NAT PROTOC., vol. 13, no. 5, May 2018 (2018-05-01), pages 946 - 986, XP055730891, DOI: 10.1038/nprot.2018.005
CLEMENT K ET AL.: "CRISPResso2 provides accurate and rapid genome editing sequence analysis", NAT BIOTECHNOL., vol. 37, no. 3, March 2019 (2019-03-01), pages 224 - 226, XP036900605, DOI: 10.1038/s41587-019-0032-3
CONG, SCIENCE, vol. 339, no. 6121, 2013, pages 819 - 823
EWING BGREEN P: "Base-calling of automated sequencer traces using phred. II. Error probabilities", GENOME RES, vol. 8, no. 3, March 1998 (1998-03-01), pages 186 - 94, XP000915053
JINEK ET AL., SCIENCE, vol. 337, no. 6096, 2012, pages 816 - 21
KURGAN GAVIN ET AL: "CRISPAltRations: A validated cloud-based approach for interrogation of double-strand break repair mediated by CRISPR genome editing", MOLECULAR THERAPY- METHODS & CLINICAL DEVELOPMENT, vol. 21, 1 June 2021 (2021-06-01), GB, pages 478 - 491, XP093140503, ISSN: 2329-0501, DOI: 10.1016/j.omtm.2021.03.024 *
LABUN KORNEL: "In silico design and analysis of targeted genome editing with CRISPR", 1 January 2020 (2020-01-01), XP093141039, Retrieved from the Internet <URL:https://bora.uib.no/bora-xmlui/handle/1956/21443> [retrieved on 20240313] *
MAGOC TSALZBERG SL: "FLASH: fast length adjustment of short reads to improve genome assemblies", BIOINFORMATICS, vol. 27, no. 21, 1 November 2011 (2011-11-01), pages 2957 - 63, XP055332486, DOI: 10.1093/bioinformatics/btr507
MAKAROVA ET AL., NAT. REV. MICROBIOL, vol. 13, 2015, pages 722 - 36
MAKAROVA ET AL., NAT. REV. MICROBIOL., vol. 9, 2011, pages 467 - 477
SHMAKOV ET AL., MOLECULAR CELL, vol. 60, 2015, pages 385 - 397

Similar Documents

Publication Publication Date Title
US12116571B2 (en) Compositions and methods for detecting nucleic acid regions
EP3565907B1 (fr) Procédés d&#39;évaluation de la coupure par les nucléases
Kebschull et al. Sources of PCR-induced distortions in high-throughput sequencing data sets
EP3149168B1 (fr) Assemblage à haut rendement d&#39;éléments génétiques
EP3724214A1 (fr) Systèmes et procédés de prédiction des résultats de la réparation en ingénierie génétique
US20220333186A1 (en) Method and system for targeted nucleic acid sequencing
US20230056763A1 (en) Methods of targeted sequencing
Maxwell et al. A detailed cell-free transcription-translation-based assay to decipher CRISPR protospacer-adjacent motifs
EP3018604B1 (fr) Procédé d&#39;attribution de lectures de séquences enrichies de manière ciblée à un emplacement génomique
US20130123117A1 (en) Capture probe and assay for analysis of fragmented nucleic acids
Marinov On the design and prospects of direct RNA sequencing
Kramme et al. MegaGate: A toxin-less gateway molecular cloning tool
WO2024123789A1 (fr) Prédiction de fréquences d&#39;indel
JP2022515085A (ja) 一本鎖dnaの合成方法
CN106319033B (zh) 一种检测染色体异常以及重组位点dna序列的方法
US20240182951A1 (en) Methods for targeted nucleic acid sequencing
US20230122979A1 (en) Methods of sample normalization
Selinger et al. CRISPR-MIP replaces PCR and reveals GC and oversampling bias in pooled CRISPR screens
JP2023538537A (ja) 核酸の標的化除去のための方法
WO2024157194A1 (fr) Procédés et dosages pour analyse hors cible
Maxwell et al. Original publication
WO2023137292A1 (fr) Procédés et compositions pour l&#39;analyse du transcriptome
McDiarmid et al. Diversified, miniaturized and ancestral parts for mammalian genome engineering and molecular recording
Jakimo Precise and expansive genomic positioning for CRISPR edits
Mighell et al. Cas12a-Capture: a novel, low-cost, and scalable method for targeted sequencing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23837105

Country of ref document: EP

Kind code of ref document: A1