WO2024123789A1 - Prédiction de fréquences d'indel - Google Patents
Prédiction de fréquences d'indel Download PDFInfo
- Publication number
- WO2024123789A1 WO2024123789A1 PCT/US2023/082543 US2023082543W WO2024123789A1 WO 2024123789 A1 WO2024123789 A1 WO 2024123789A1 US 2023082543 W US2023082543 W US 2023082543W WO 2024123789 A1 WO2024123789 A1 WO 2024123789A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- nucleic acid
- nuclease
- sequence
- sequencing reads
- guide
- Prior art date
Links
- 101710163270 Nuclease Proteins 0.000 claims abstract description 175
- 238000012163 sequencing technique Methods 0.000 claims abstract description 96
- 238000000034 method Methods 0.000 claims abstract description 95
- 238000003780 insertion Methods 0.000 claims abstract description 60
- 230000037431 insertion Effects 0.000 claims abstract description 60
- 108091033319 polynucleotide Proteins 0.000 claims abstract description 39
- 102000040430 polynucleotide Human genes 0.000 claims abstract description 39
- 239000002157 polynucleotide Substances 0.000 claims abstract description 39
- 238000012217 deletion Methods 0.000 claims abstract description 23
- 230000037430 deletion Effects 0.000 claims abstract description 23
- 230000001404 mediated effect Effects 0.000 claims abstract description 8
- 102000039446 nucleic acids Human genes 0.000 claims description 67
- 108020004707 nucleic acids Proteins 0.000 claims description 67
- 150000007523 nucleic acids Chemical class 0.000 claims description 67
- 238000003776 cleavage reaction Methods 0.000 claims description 58
- 230000007017 scission Effects 0.000 claims description 58
- 108020004414 DNA Proteins 0.000 claims description 28
- 108091093088 Amplicon Proteins 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 20
- 108090000623 proteins and genes Proteins 0.000 claims description 20
- 102000004169 proteins and genes Human genes 0.000 claims description 18
- 108020005004 Guide RNA Proteins 0.000 claims description 14
- 238000001914 filtration Methods 0.000 claims description 11
- 101000910035 Streptococcus pyogenes serotype M1 CRISPR-associated endonuclease Cas9/Csn1 Proteins 0.000 claims description 3
- 108010017070 Zinc Finger Nucleases Proteins 0.000 claims description 2
- 238000013518 transcription Methods 0.000 claims description 2
- 230000035897 transcription Effects 0.000 claims description 2
- 238000009966 trimming Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 abstract description 5
- 230000015654 memory Effects 0.000 description 36
- 239000002773 nucleotide Substances 0.000 description 30
- 238000012545 processing Methods 0.000 description 28
- 125000003729 nucleotide group Chemical group 0.000 description 26
- 238000011002 quantification Methods 0.000 description 25
- 238000007481 next generation sequencing Methods 0.000 description 23
- 108091033409 CRISPR Proteins 0.000 description 21
- 238000004891 communication Methods 0.000 description 21
- 238000003860 storage Methods 0.000 description 16
- 238000002474 experimental method Methods 0.000 description 15
- 108091028043 Nucleic acid sequence Proteins 0.000 description 13
- 210000004027 cell Anatomy 0.000 description 11
- DRTQHJPVMGBUCF-XVFCMESISA-N Uridine Chemical compound O[C@@H]1[C@H](O)[C@@H](CO)O[C@H]1N1C(=O)NC(=O)C=C1 DRTQHJPVMGBUCF-XVFCMESISA-N 0.000 description 10
- 210000004962 mammalian cell Anatomy 0.000 description 7
- 230000003287 optical effect Effects 0.000 description 7
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 6
- NYHBQMYGNKIUIF-UUOKFMHZSA-N Guanosine Chemical compound C1=NC=2C(=O)NC(N)=NC=2N1[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O NYHBQMYGNKIUIF-UUOKFMHZSA-N 0.000 description 6
- IQFYYKKMVGJFEH-XLPZGREQSA-N Thymidine Chemical compound O=C1NC(=O)C(C)=CN1[C@@H]1O[C@H](CO)[C@@H](O)C1 IQFYYKKMVGJFEH-XLPZGREQSA-N 0.000 description 6
- OIRDTQYFTABQOQ-KQYNXXCUSA-N adenosine Chemical compound C1=NC=2C(N)=NC=NC=2N1[C@@H]1O[C@H](CO)[C@@H](O)[C@H]1O OIRDTQYFTABQOQ-KQYNXXCUSA-N 0.000 description 6
- 238000010790 dilution Methods 0.000 description 6
- 239000012895 dilution Substances 0.000 description 6
- 238000010362 genome editing Methods 0.000 description 6
- 239000000203 mixture Substances 0.000 description 6
- 238000010453 CRISPR/Cas method Methods 0.000 description 5
- DRTQHJPVMGBUCF-PSQAKQOGSA-N beta-L-uridine Natural products O[C@H]1[C@@H](O)[C@H](CO)O[C@@H]1N1C(=O)NC(=O)C=C1 DRTQHJPVMGBUCF-PSQAKQOGSA-N 0.000 description 5
- 238000003672 processing method Methods 0.000 description 5
- DRTQHJPVMGBUCF-UHFFFAOYSA-N uracil arabinoside Natural products OC1C(O)C(CO)OC1N1C(=O)NC(=O)C=C1 DRTQHJPVMGBUCF-UHFFFAOYSA-N 0.000 description 5
- 229940045145 uridine Drugs 0.000 description 5
- 241000193996 Streptococcus pyogenes Species 0.000 description 4
- 150000001413 amino acids Chemical class 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000005782 double-strand break Effects 0.000 description 4
- 230000034431 double-strand break repair via homologous recombination Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 229920000642 polymer Polymers 0.000 description 4
- UHDGCWIWMRVCDJ-UHFFFAOYSA-N 1-beta-D-Xylofuranosyl-NH-Cytosine Natural products O=C1N=C(N)C=CN1C1C(O)C(O)C(CO)O1 UHDGCWIWMRVCDJ-UHFFFAOYSA-N 0.000 description 3
- DWRXFEITVBNRMK-UHFFFAOYSA-N Beta-D-1-Arabinofuranosylthymine Natural products O=C1NC(=O)C(C)=CN1C1C(O)C(O)C(CO)O1 DWRXFEITVBNRMK-UHFFFAOYSA-N 0.000 description 3
- 239000002126 C01EB10 - Adenosine Substances 0.000 description 3
- MIKUYHXYGGJMLM-GIMIYPNGSA-N Crotonoside Natural products C1=NC2=C(N)NC(=O)N=C2N1[C@H]1O[C@@H](CO)[C@H](O)[C@@H]1O MIKUYHXYGGJMLM-GIMIYPNGSA-N 0.000 description 3
- UHDGCWIWMRVCDJ-PSQAKQOGSA-N Cytidine Natural products O=C1N=C(N)C=CN1[C@@H]1[C@@H](O)[C@@H](O)[C@H](CO)O1 UHDGCWIWMRVCDJ-PSQAKQOGSA-N 0.000 description 3
- NYHBQMYGNKIUIF-UHFFFAOYSA-N D-guanosine Natural products C1=2NC(N)=NC(=O)C=2N=CN1C1OC(CO)C(O)C1O NYHBQMYGNKIUIF-UHFFFAOYSA-N 0.000 description 3
- 102000053602 DNA Human genes 0.000 description 3
- 108091028113 Trans-activating crRNA Proteins 0.000 description 3
- 229960005305 adenosine Drugs 0.000 description 3
- 125000000539 amino acid group Chemical group 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- IQFYYKKMVGJFEH-UHFFFAOYSA-N beta-L-thymidine Natural products O=C1NC(=O)C(C)=CN1C1OC(CO)C(O)C1 IQFYYKKMVGJFEH-UHFFFAOYSA-N 0.000 description 3
- 238000000205 computational method Methods 0.000 description 3
- UHDGCWIWMRVCDJ-ZAKLUEHWSA-N cytidine Chemical compound O=C1N=C(N)C=CN1[C@H]1[C@H](O)[C@@H](O)[C@H](CO)O1 UHDGCWIWMRVCDJ-ZAKLUEHWSA-N 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 229940029575 guanosine Drugs 0.000 description 3
- 239000004973 liquid crystal related substance Substances 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 108090000765 processed proteins & peptides Proteins 0.000 description 3
- 238000003908 quality control method Methods 0.000 description 3
- 238000013207 serial dilution Methods 0.000 description 3
- 230000008685 targeting Effects 0.000 description 3
- 229940104230 thymidine Drugs 0.000 description 3
- 238000011144 upstream manufacturing Methods 0.000 description 3
- KDCGOANMDULRCW-UHFFFAOYSA-N 7H-purine Chemical compound N1=CNC2=NC=NC2=C1 KDCGOANMDULRCW-UHFFFAOYSA-N 0.000 description 2
- 241000604451 Acidaminococcus Species 0.000 description 2
- 241000093740 Acidaminococcus sp. Species 0.000 description 2
- 230000004568 DNA-binding Effects 0.000 description 2
- 108010008532 Deoxyribonuclease I Proteins 0.000 description 2
- 102000007260 Deoxyribonuclease I Human genes 0.000 description 2
- 241000589599 Francisella tularensis subsp. novicida Species 0.000 description 2
- 229930010555 Inosine Natural products 0.000 description 2
- UGQMRVRMYYASKQ-KQYNXXCUSA-N Inosine Chemical compound O[C@@H]1[C@H](O)[C@@H](CO)O[C@H]1N1C2=NC=NC(O)=C2N=C1 UGQMRVRMYYASKQ-KQYNXXCUSA-N 0.000 description 2
- UBORTCNDUKBEOP-UHFFFAOYSA-N L-xanthosine Natural products OC1C(O)C(CO)OC1N1C(NC(=O)NC2=O)=C2N=C1 UBORTCNDUKBEOP-UHFFFAOYSA-N 0.000 description 2
- 241000689670 Lachnospiraceae bacterium ND2006 Species 0.000 description 2
- 241000588650 Neisseria meningitidis Species 0.000 description 2
- 229930185560 Pseudouridine Natural products 0.000 description 2
- PTJWIQPHWPFNBW-UHFFFAOYSA-N Pseudouridine C Natural products OC1C(O)C(CO)OC1C1=CNC(=O)NC1=O PTJWIQPHWPFNBW-UHFFFAOYSA-N 0.000 description 2
- 241000191967 Staphylococcus aureus Species 0.000 description 2
- 241000194020 Streptococcus thermophilus Species 0.000 description 2
- 241000187191 Streptomyces viridochromogenes Species 0.000 description 2
- 241000203587 Streptosporangium roseum Species 0.000 description 2
- 102000008579 Transposases Human genes 0.000 description 2
- 108010020764 Transposases Proteins 0.000 description 2
- ISAKRJDGNUQOIC-UHFFFAOYSA-N Uracil Chemical group O=C1C=CNC(=O)N1 ISAKRJDGNUQOIC-UHFFFAOYSA-N 0.000 description 2
- UBORTCNDUKBEOP-HAVMAKPUSA-N Xanthosine Natural products O[C@@H]1[C@H](O)[C@H](CO)O[C@H]1N1C(NC(=O)NC2=O)=C2N=C1 UBORTCNDUKBEOP-HAVMAKPUSA-N 0.000 description 2
- 230000003321 amplification Effects 0.000 description 2
- 238000000429 assembly Methods 0.000 description 2
- 230000000712 assembly Effects 0.000 description 2
- WGDUUQDYDIIBKT-UHFFFAOYSA-N beta-Pseudouridine Natural products OC1OC(CN2C=CC(=O)NC2=O)C(O)C1O WGDUUQDYDIIBKT-UHFFFAOYSA-N 0.000 description 2
- 230000027455 binding Effects 0.000 description 2
- 239000011230 binding agent Substances 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 108091092356 cellular DNA Proteins 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- OPTASPLRGRRNAP-UHFFFAOYSA-N cytosine Chemical compound NC=1C=CNC(=O)N=1 OPTASPLRGRRNAP-UHFFFAOYSA-N 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000013537 high throughput screening Methods 0.000 description 2
- 230000001939 inductive effect Effects 0.000 description 2
- 229960003786 inosine Drugs 0.000 description 2
- 230000035772 mutation Effects 0.000 description 2
- 230000006780 non-homologous end joining Effects 0.000 description 2
- 238000003199 nucleic acid amplification method Methods 0.000 description 2
- 229920001184 polypeptide Polymers 0.000 description 2
- 102000004196 processed proteins & peptides Human genes 0.000 description 2
- 239000000047 product Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- PTJWIQPHWPFNBW-GBNDHIKLSA-N pseudouridine Chemical compound O[C@@H]1[C@H](O)[C@@H](CO)O[C@H]1C1=CNC(=O)NC1=O PTJWIQPHWPFNBW-GBNDHIKLSA-N 0.000 description 2
- 238000013515 script Methods 0.000 description 2
- 230000001953 sensory effect Effects 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- RWQNBRDOKXIBIV-UHFFFAOYSA-N thymine Chemical group CC1=CNC(=O)NC1=O RWQNBRDOKXIBIV-UHFFFAOYSA-N 0.000 description 2
- NMEHNETUFHBYEG-IHKSMFQHSA-N tttn Chemical group C([C@@H](C(=O)N[C@@H]([C@@H](C)CC)C(=O)N[C@@H](CC=1C=CC(O)=CC=1)C(=O)N[C@@H](CO)C(=O)N[C@@H](CC=1NC=NC=1)C(=O)N[C@@H](CC=1C=CC=CC=1)C(=O)N[C@@H](CCCCN)C(=O)N[C@@H](CCC(N)=O)C(=O)N[C@@H](C)C(=O)N[C@@H]([C@@H](C)O)C(=O)N[C@@H](C(C)C)C(=O)NCC(=O)N[C@@H](CC(O)=O)C(=O)N[C@@H](C(C)C)C(=O)N[C@@H](CC(N)=O)C(=O)N[C@@H]([C@@H](C)O)C(=O)N[C@@H](CC(O)=O)C(=O)N[C@@H](CCCNC(N)=N)C(=O)N1[C@@H](CCC1)C(=O)NCC(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CC(O)=O)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCCN)C(O)=O)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CCSC)NC(=O)[C@H](CCC(O)=O)NC(=O)[C@H](CCC(O)=O)NC(=O)[C@H](CC(O)=O)NC(=O)[C@@H](NC(=O)[C@H]1N(CCC1)C(=O)[C@H](CCC(N)=O)NC(=O)[C@@H](N)[C@@H](C)O)[C@@H](C)O)C1=CC=CC=C1 NMEHNETUFHBYEG-IHKSMFQHSA-N 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- UBORTCNDUKBEOP-UUOKFMHZSA-N xanthosine Chemical compound O[C@@H]1[C@H](O)[C@@H](CO)O[C@H]1N1C(NC(=O)NC2=O)=C2N=C1 UBORTCNDUKBEOP-UUOKFMHZSA-N 0.000 description 2
- 102000040650 (ribonucleotides)n+m Human genes 0.000 description 1
- 241000007910 Acaryochloris marina Species 0.000 description 1
- 241001135192 Acetohalobium arabaticum Species 0.000 description 1
- 241001464929 Acidithiobacillus caldus Species 0.000 description 1
- 241000605222 Acidithiobacillus ferrooxidans Species 0.000 description 1
- 229930024421 Adenine Natural products 0.000 description 1
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 1
- 241000190857 Allochromatium vinosum Species 0.000 description 1
- 241000147155 Ammonifex degensii Species 0.000 description 1
- 241000620196 Arthrospira maxima Species 0.000 description 1
- 240000002900 Arthrospira platensis Species 0.000 description 1
- 235000016425 Arthrospira platensis Nutrition 0.000 description 1
- 241001495183 Arthrospira sp. Species 0.000 description 1
- 241000906059 Bacillus pseudomycoides Species 0.000 description 1
- 241000894006 Bacteria Species 0.000 description 1
- 241000823281 Burkholderiales bacterium Species 0.000 description 1
- 241000168061 Butyrivibrio proteoclasticus Species 0.000 description 1
- 108091079001 CRISPR RNA Proteins 0.000 description 1
- 238000010354 CRISPR gene editing Methods 0.000 description 1
- 108010040467 CRISPR-Associated Proteins Proteins 0.000 description 1
- 238000010356 CRISPR-Cas9 genome editing Methods 0.000 description 1
- 241000589876 Campylobacter Species 0.000 description 1
- 241000589875 Campylobacter jejuni Species 0.000 description 1
- 241001496650 Candidatus Desulforudis Species 0.000 description 1
- 241001040999 Candidatus Methanoplasma termitum Species 0.000 description 1
- 241000243205 Candidatus Parcubacteria Species 0.000 description 1
- 241000223282 Candidatus Peregrinibacteria Species 0.000 description 1
- 108091092236 Chimeric RNA Proteins 0.000 description 1
- 241000193163 Clostridioides difficile Species 0.000 description 1
- 241000193155 Clostridium botulinum Species 0.000 description 1
- 241000907165 Coleofasciculus chthonoplastes Species 0.000 description 1
- 241000186216 Corynebacterium Species 0.000 description 1
- 241000065716 Crocosphaera watsonii Species 0.000 description 1
- 241000159506 Cyanothece Species 0.000 description 1
- 101710150423 DNA nickase Proteins 0.000 description 1
- 230000007018 DNA scission Effects 0.000 description 1
- 108010042407 Endonucleases Proteins 0.000 description 1
- 102000004533 Endonucleases Human genes 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- 102000004190 Enzymes Human genes 0.000 description 1
- 241000206602 Eukaryota Species 0.000 description 1
- 241000326311 Exiguobacterium sibiricum Species 0.000 description 1
- 241000605896 Fibrobacter succinogenes Species 0.000 description 1
- 241001617393 Finegoldia Species 0.000 description 1
- 241000589602 Francisella tularensis Species 0.000 description 1
- 229940123611 Genome editing Drugs 0.000 description 1
- UYTPUPDQBNUYGX-UHFFFAOYSA-N Guanine Natural products O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 1
- 241001430080 Ktedonobacter racemifer Species 0.000 description 1
- 241001112693 Lachnospiraceae Species 0.000 description 1
- 241000904817 Lachnospiraceae bacterium Species 0.000 description 1
- 241000186679 Lactobacillus buchneri Species 0.000 description 1
- 241000186673 Lactobacillus delbrueckii Species 0.000 description 1
- 241000186606 Lactobacillus gasseri Species 0.000 description 1
- 241000186869 Lactobacillus salivarius Species 0.000 description 1
- 241001148627 Leptospira inadai Species 0.000 description 1
- 241000186805 Listeria innocua Species 0.000 description 1
- 241001134698 Lyngbya Species 0.000 description 1
- 241000501784 Marinobacter sp. Species 0.000 description 1
- 241000204637 Methanohalobium evestigatum Species 0.000 description 1
- 241000192710 Microcystis aeruginosa Species 0.000 description 1
- 241000542065 Moraxella bovoculi Species 0.000 description 1
- 241000167285 Natranaerobius thermophilus Species 0.000 description 1
- 241000588654 Neisseria cinerea Species 0.000 description 1
- 241000192147 Nitrosococcus Species 0.000 description 1
- 241000919925 Nitrosococcus halophilus Species 0.000 description 1
- 241000203619 Nocardiopsis dassonvillei Species 0.000 description 1
- 241001223105 Nodularia spumigena Species 0.000 description 1
- 241000192673 Nostoc sp. Species 0.000 description 1
- 108091034117 Oligonucleotide Proteins 0.000 description 1
- 241000192520 Oscillatoria sp. Species 0.000 description 1
- 229910019142 PO4 Inorganic materials 0.000 description 1
- 241001386755 Parvibaculum lavamentivorans Species 0.000 description 1
- 241000606856 Pasteurella multocida Species 0.000 description 1
- 241000142651 Pelotomaculum thermopropionicum Species 0.000 description 1
- 241000983938 Petrotoga mobilis Species 0.000 description 1
- 241001599925 Polaromonas naphthalenivorans Species 0.000 description 1
- 241001472610 Polaromonas sp. Species 0.000 description 1
- 241000878522 Porphyromonas crevioricanis Species 0.000 description 1
- 241001135241 Porphyromonas macacae Species 0.000 description 1
- 241001135219 Prevotella disiens Species 0.000 description 1
- 241000590028 Pseudoalteromonas haloplanktis Species 0.000 description 1
- CZPWVGJYEJSRLH-UHFFFAOYSA-N Pyrimidine Chemical compound C1=CN=CN=C1 CZPWVGJYEJSRLH-UHFFFAOYSA-N 0.000 description 1
- 230000007022 RNA scission Effects 0.000 description 1
- 241000190984 Rhodospirillum rubrum Species 0.000 description 1
- 108091028664 Ribonucleotide Proteins 0.000 description 1
- 238000012300 Sequence Analysis Methods 0.000 description 1
- 241001063963 Smithella Species 0.000 description 1
- 241001501869 Streptococcus pasteurianus Species 0.000 description 1
- 241000194022 Streptococcus sp. Species 0.000 description 1
- 241001518258 Streptomyces pristinaespiralis Species 0.000 description 1
- 241000123713 Sutterella wadsworthensis Species 0.000 description 1
- 241000192560 Synechococcus sp. Species 0.000 description 1
- 241000206213 Thermosipho africanus Species 0.000 description 1
- 241000078013 Trichormus variabilis Species 0.000 description 1
- 241000605939 Wolinella succinogenes Species 0.000 description 1
- 241001673106 [Bacillus] selenitireducens Species 0.000 description 1
- 241001531273 [Eubacterium] eligens Species 0.000 description 1
- 229960000643 adenine Drugs 0.000 description 1
- 229940011019 arthrospira platensis Drugs 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007073 chemical hydrolysis Effects 0.000 description 1
- 210000000349 chromosome Anatomy 0.000 description 1
- 229940104302 cytosine Drugs 0.000 description 1
- 239000005547 deoxyribonucleotide Substances 0.000 description 1
- 125000002637 deoxyribonucleotide group Chemical group 0.000 description 1
- 206010013023 diphtheria Diseases 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000007071 enzymatic hydrolysis Effects 0.000 description 1
- 238000006047 enzymatic hydrolysis reaction Methods 0.000 description 1
- -1 for example Proteins 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 229940118764 francisella tularensis Drugs 0.000 description 1
- 230000030279 gene silencing Effects 0.000 description 1
- IVSXFFJGASXYCL-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=NC=N[C]21 IVSXFFJGASXYCL-UHFFFAOYSA-N 0.000 description 1
- 229910052739 hydrogen Inorganic materials 0.000 description 1
- 238000006460 hydrolysis reaction Methods 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 230000003834 intracellular effect Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000000520 microinjection Methods 0.000 description 1
- 239000002105 nanoparticle Substances 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 229940051027 pasteurella multocida Drugs 0.000 description 1
- NBIIXXVUZAFLBC-UHFFFAOYSA-K phosphate Chemical compound [O-]P([O-])([O-])=O NBIIXXVUZAFLBC-UHFFFAOYSA-K 0.000 description 1
- 239000010452 phosphate Substances 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 239000002336 ribonucleotide Substances 0.000 description 1
- 125000002652 ribonucleotide group Chemical group 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000002864 sequence alignment Methods 0.000 description 1
- 125000006850 spacer group Chemical group 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 238000012421 spiking Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- RYYWUUFWQRZTIU-UHFFFAOYSA-K thiophosphate Chemical group [O-]P([O-])([O-])=S RYYWUUFWQRZTIU-UHFFFAOYSA-K 0.000 description 1
- 229940113082 thymine Drugs 0.000 description 1
- 238000010361 transduction Methods 0.000 description 1
- 230000026683 transduction Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 230000005945 translocation Effects 0.000 description 1
- 229940035893 uracil Drugs 0.000 description 1
- 230000003612 virological effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
Definitions
- This disclosure relates to quantification of insertions and/or deletions in the vicinity of cleavage sites within a polynucleotide target sequence.
- Nucleic acid-guided nucleases can be used to edit polynucleotide sequences, for example a genome of an organism, at targeted locations with high precision. Nucleases are enzymes capable of cleaving the phosphodiester bonds between nucleotides of nucleic acids. Genome editing methods include use of CRISPR (clustered regularly interspaced short palindromic repeats)-associated proteins or similar nucleic acid-guided nucleases to induce DNA double-strand breaks (DSBs) at predictable genomic positions relative to the user-designated target sequence. DNA DSBs are repaired by intracellular machinery, for example, by non-homologous end joining (NHEJ).
- NHEJ non-homologous end joining
- the repair process can result in sequence variants including, for example, insertions and deletions (indels).
- Quantification of the frequency of indels at the target cleavage site within an edited cell population is important for evaluating the efficacy of a nucleic-acid guided nuclease editing system.
- nucleic-acid guided nucleases are now well known, with additional naturally-occurring and engineered nucleases being discovered and characterized.
- Novel nucleases may be poorly characterized, i.e. their cleavage site and editing window relative to the user-designated target sequence being unknown.
- Existing indel quantification data analysis pipelines for Next Generation Sequencing (NGS) rely on the assumption that cleavage site and editing window are established, which is not the case for novel poorly characterized nucleases.
- NGS Next Generation Sequencing
- existing computations tools can undercount indels at the target cleavage site within an edited cell population. Few computational methods have been developed so far to address this need.
- the present disclosure is based, in part, on the discovery that a data analysis pipeline can be configured to quantify insertion and deletions of a target polynucleotide sequence that has been cleaved by a nucleic-acid guided nuclease, even if the cleavage site of the nuclease relative to the user-designated target sequence is not known.
- the methods disclosed herein include a computational pipeline for indel quantification for polynucleotide sequences that have been cleaved by nucleic-acid guided nucleases, including uncharacterized nucleases.
- the methods disclosed herein include parameters for alignment of next-generation sequencing (NGS) reads from targeted amplicon sequencing (TAS) of nu cl ease-edited or control samples that have been obtained and aligned to reference amplicon sequences.
- NGS next-generation sequencing
- TAS targeted amplicon sequencing
- Disclosed herein are exemplary experimental and computational data demonstrating that the methods can be used to accurately quantify indels of novel uncharacterized nucleases, for example, nucleases having unknown cleavage sites relative to the user-designated target sequence.
- provided herein are methods for quantifying insertions and/or deletions in a polynucleotide sequence caused by cleavage of a target polynucleotide sequence by a nucleic acid-guided nuclease.
- the methods include, in a computer system, receiving sample sequence data comprising a plurality of sequencing reads; filtering the plurality of sequencing reads; aligning the plurality of sequencing reads to a reference sequence; defining a window based on a sequence location within a nucleic acid guide sequence and the locations of the ends of the nucleic acid guide sequence; determining, based on the alignment of each sequencing read of the plurality of sequencing reads to the reference sequence, the number of sequencing reads comprising an insertion or deletion within the window relative to the reference sequence; estimating, based on the number of sequencing reads comprising an insertion or deletion, the quantity of insertions and/or deletions in the polynucleotide sequence mediated by a nucleotide-directed nuclease.
- the nucleic acid-guided nuclease is a Class 2 nuclease. In some embodiments, the nucleic acid- guided nuclease is a type II nuclease. In some embodiments, the nucleic acid-guided nuclease is SpCas9. In some embodiments, the nucleic acid-guided nuclease is AsCasl2a. In some embodiments, the nucleic acid-guided nuclease is a type V nuclease. In some embodiments, the nucleic acid guide is a guide RNA (gRNA). In some embodiments, the nucleic acid guide is a single guide RNA (sgRNA).
- the center of the window is located at the center of the nucleic acid guide sequence. In some embodiments, the center of the window is located at the site where the nuclease cleaved the polynucleotide sequence. In some embodiments, the length of the window is equivalent to the length of the nucleic acid guide sequence. In some embodiments, the 5' end of the window extends 50 basepairs 5' to the 5' end of the nucleic acid guide and the 3' end of the window extends 50 basepairs 3' to the 3' end of the nucleic acid guide.
- the 5' end of the window extends 40 basepairs 5' to the 5' end of the nucleic acid guide and the 3' end of the window extends 40 basepairs 3' to the 3' end of the nucleic acid guide. In some embodiments, the 5' end of the window extends 30 basepairs 5' to the 5' end of the nucleic acid guide and the 3' end of the window extends 30 basepairs 3' to the 3' end of the nucleic acid guide. In some embodiments, the 5' end of the window extends 20 basepairs 5' to the 5' end of the nucleic acid guide and the 3' end of the window extends 20 basepairs 3' to the 3' end of the nucleic acid guide.
- the 5' end of the window extends 10 basepairs 5' to the 5' end of the nucleic acid guide and the 3' end of the window extends 10 basepairs 3' to the 3' end of the nucleic acid guide. In some embodiments, the 5' end of the window extends 5 basepairs 5' to the 5' end of the nucleic acid guide and the 3' end of the window extends 5 basepairs 3' to the 3' end of the nucleic acid guide.
- the method further includes trimming of adapter sequences from the plurality of sequencing reads.
- the plurality of sequencing reads are generated from targeted amplicon sequencing.
- the plurality of sequencing reads are generated from targeted amplicon sequencing of DNA isolated from cells edited by the nucleic-acid guided nuclease.
- the plurality of sequencing reads are paired-end sequencing reads.
- the method further includes read merging of the paired-end sequencing reads to produce a single read for alignment to the reference sequence.
- the read merging of the paired-end reads further includes applying a minimum paired-end read overlap score to the plurality of sequencing reads.
- the minimum paired-end read overlap score is 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, or 100. In some embodiments, the minimum paired-end read overlap score is 10. In some embodiments, the read merging of paired-end reads further comprises applying a maximum paired-end read overlap score to the plurality of sequencing reads. In some embodiments, the maximum paired-end read overlap score is 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, or 300. In some embodiments, the maximum paired-end read overlap score is 100.
- the filtering step further comprises applying a minimum average read quality score to the plurality of sequencing reads. In some embodiments, the minimum average read quality score is 0, 5, 10, 15, 20, 25, 30, 35, or 40. In some embodiments, the filtering step further comprises applying a minimum single basepair score to the plurality of sequencing reads. In some embodiments, the minimum single basepair score is 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20. In some embodiments, the aligning step further comprises applying an amplicon minimum alignment score to the plurality of sequencing reads. In some embodiments, the amplicon minimum alignment score is 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, or 100.
- the nucleic acid-guided nuclease is a FokI nuclease. In some embodiments, the FokI nuclease is fused to a transcription activator-like (TAL) protein. In some embodiments, the nucleic acid-guided nuclease is a zinc-finger nuclease.
- TAL transcription activator-like
- a computer program product tangibly embodied on a computer-readable medium, comprising instructions that when executed by one or more processors are configured to: receive sample sequence data comprising a plurality of sequencing reads; filter the plurality of sequencing reads; align the plurality of sequencing reads to a reference sequence; define a window based on a sequence location within a nucleic acid guide sequence and the locations of the ends of the nucleic acid guide sequence; determine, based on the alignment of each sequencing read of the plurality of sequencing reads to the reference sequence, the number of sequencing reads comprising an insertion or deletion within the window relative to the reference sequence; and estimate, based on the number of sequencing reads comprising an insertion or deletion, the quantity of insertions and/or deletions in the polynucleotide sequence mediated by a nucleotide-directed nuclease.
- FIG. 1A is a cartoon schematic of a Type II nuclease, featuring an sgRNA interacting with a target DNA sequence.
- PAM NGG sequence, guide sequence, target DNA sequence, and cleavage site are indicated.
- FIG. IB is a cartoon schematic of a Type V nuclease, featuring a gRNA interacting with a target DNA sequence.
- PAM TTTN
- guide sequence guide sequence
- target DNA sequence and cleavage site are indicated.
- FIG. 2A is a diagram of an upstream molecular biology workflow for the methods disclosed herein. Steps include: design primers to amplify editing target; isolate cellular DNA; PCR-amplify editing target with adapters; add next-generation sequencing (NGS) adapters to PCR products; pool barcoded amplicons; bead-purify amplicons; spike in diversity DNA; perform NGS on sequencing instrument.
- NGS next-generation sequencing
- FIG. 2B is a diagram of the computational pipeline disclosed herein. Steps include quality control of received sequencing reads; alignment of sequencing reads to a target sequence; and quantification of indels according to the methods disclosed herein.
- FIG. 3A is a schematic representation of received sequencing reads of nucleic- acid guided nuclease edited sample, comprising a plurality of sequencing reads, a subset of which contain indels as a result of cleavage by the nucleic-acid guided nuclease.
- the nuclease is Cas9, a Type II nuclease with a known cleavage site relative to the guide sequence. For each type of indel, the number of sequencing reads comprising that indel is indicated at right. The previously known cleavage site is indicated by a rectangular box.
- FIG. 3B is a schematic representation of received sequencing reads of nucleic- acid guided nuclease edited sample, comprising a plurality of sequencing reads, a subset of which contain indels as a result of cleavage by the nucleic-acid guided nuclease.
- the nuclease is a novel Type V nuclease. For each type of indel, the number of sequencing reads comprising that indel is indicated at right. Several indels are missed by standard computational processing methods.
- FIG. 3C is a schematic representation of received sequencing reads of nucleic- acid guided nuclease edited sample, comprising a plurality of sequencing reads, a subset of which contain indels as a result of cleavage by the nucleic-acid guided nuclease.
- the nuclease is a novel Type V nuclease. For each type of indel, the number of sequencing reads comprising that indel is indicated at right. Custom data processing methods capture all indels generated by cleavage by the nuclease.
- FIG. 4A is a plot comparing a standard data processing pipeline for a known nuclease, Cas9, with the data processing pipeline disclosed herein for Cas9.
- FIG. 4B is schematic of a dilution experiment to validate the data processing pipeline disclosed herein, wherein nucleic acid-guided nuclease-edited target sequence is serially diluted with non-edited target sequence in order to evaluate the efficacy of the data processing methods disclosed herein.
- FIG. 4C is a plot of expected indel percentage (x-axis) vs. observed indel percentage (y-axis) for the experiment depicted in the schematic of FIG. 4B.
- FIG. 5 is a diagram of computer system components that can be used to implement a computational pipeline for indel quantification for polynucleotide sequences that have been cleaved by nucleic-acid guided nucleases, including uncharacterized nucleases.
- FIG. 6 is a plot of indel percentage (y-axis) for two nucleases (x-axis): Cas9 (reference) and novel Type V nuclease with escalating doses of the novel nuclease using two different guide sequences, Al and F2.
- FIG. 7A is a schematic of an experiment where frequency of a 21 -nucleotide insertion was quantified by the computational pipeline disclosed herein.
- FIG. 7B is a schematic representation of received sequencing reads of the amplicons depicted in FIG. 7A, comprising a plurality of sequencing reads, a subset of which contain the 21 -nucleotide insertion in a spike-in dilution experiment.
- FIG. 7C is a schematic representation of received sequencing reads of the amplicons depicted in FIG. 7A, comprising a plurality of sequencing reads, a subset of which contain the 21 -nucleotide insertion in a spike-in dilution experiment.
- FIG. 7D is a schematic representation of received sequencing reads of the amplicons depicted in FIG. 7A, comprising a plurality of sequencing reads, a subset of which contain the 21 -nucleotide insertion in a spike-in dilution experiment.
- nucleic acid As used herein, the terms “nucleic acid,” “polynucleotide,” and “oligonucleotide” are interchangeable and refer to a deoxyribonucleotide or ribonucleotide polymer, in linear or circular conformation, and in either single- or double-stranded form. For the purposes of the present disclosure, these terms are not to be construed as limiting with respect to the length of the polymer.
- the terms can encompass known analogues of natural nucleotides, as well as nucleotides that are modified in the base, sugar and/or phosphate moi eties (e.g., phosphorothioate backbones). In general and unless otherwise specified, an analogue of a particular nucleotide has the same base-pairing specificity; i.e., an analogue of A will base-pair with T.
- CRISPR refers to clustered regularly interspaced short palindromic repeats or any of the DNA loci that serve to direct CRISPR-associated proteins or similar nucleotide-directed nucleases. It also describes man-made, constructed, or selected systems derived using these frameworks or proteins. CRISPR systems and the related proteins vary among the currently described type I, type II and type III systems, though it is possible other analogous systems have yet to be described.
- CRISPR system refers collectively to transcripts and other elements involved in the expression of or directing the activity of CRISPR-associated (“Cas”) genes, including sequences encoding a Cas gene, a tracr (trans-activating CRISPR) sequence (e.g., tracrRNA or an active partial tracrRNA), a tracr-mate sequence (encompassing a "direct repeat” and a tracrRNA-processed partial direct repeat in the context of an endogenous CRISPR system), a guide sequence (also referred to as a "spacer” in the context of an endogenous CRISPR system), and other sequences and transcripts from a CRISPR locus.
- a tracr trans-activating CRISPR
- tracrRNA or an active partial tracrRNA e.g., tracrRNA or an active partial tracrRNA
- a tracr-mate sequence encompassing a "direct repeat” and a tracrRNA-processed partial direct repeat in the context of an
- One or more tracr mate sequences operably linked to a guide sequence can also be referred to as precrRNA (pre-CRISPR RNA) before processing or crRNA after processing by a nuclease.
- CRISPR systems can also include modified, swapped or engineered, guide, tracr or chimeric RNA sequences and the protein to which they interact (For example, Briner, et al., Mai Cell 56(2)333-9 (2014)).
- the methods disclosed herein may also be applicable to other, non-CRISPR nucleotide-directed nucleases.
- the term “guide sequence” refers to the portion of, for example, a guide RNA (gRNA) or single guide RNA (sgRNA) that confers the specificity of the nucleic acid-guide nuclease to its target, and that mediates the formation of the RNA- DNA duplex between the targeting RNA and the target DNA sequence.
- gRNA guide RNA
- sgRNA single guide RNA
- the targeting specificity of a CRISPR-Cas9 complex is determined by the approximately 20- nt sequence at the 5' end of the gRNA.
- the length of a guide sequence is typically between 17-24bp.
- “center of the guide sequence” refers to the midpoint of the guide sequence.
- cleavage refers to the breakage of the covalent backbone of a nucleic acid molecule. Cleavage can be initiated by a variety of methods including, but not limited to, enzymatic or chemical hydrolysis of a phosphodiester bond.
- cleavage refers to the double-stranded cleavage between nucleic acids within a double-stranded DNA or RNA chain.
- genomic region or “genomic segment”, as used interchangeably herein, denote a contiguous length of nucleotides in a genome of an organism.
- a genomic region may be of a length as small as a few kb (e.g., at least 5 kb, at least 10 kb or at least 20 kb), up to an entire chromosome or more.
- nucleotide sequences are provided using character representations recommended by the International Union of Pure and Applied Chemistry (IUPAC) or a subset thereof.
- the set ⁇ A, C, G, T, U ⁇ for adenosine, cytidine, guanosine, thymidine, and uridine respectively.
- the set ⁇ A, C, G, T, U, I, X, W ⁇ for adenosine, cytidine, guanosine, thymidine, uridine, inosine, uridine, xanthosine, pseudouridine respectively.
- the set of characters is ⁇ A, C, G, T, U, I, X, P, R, Y, N ⁇ for adenosine, cytidine, guanosine, thymidine, uridine, inosine, uridine, xanthosine, pseudouridine, unspecified purine, unspecified pyrimidine, and unspecified nucleotide respectively.
- the modified sequences, non-natural sequences, or sequences with modified binding, may be in the genomic, the guide or the tracr sequences.
- Nucleotide and/or amino acid sequence identity percent is understood as the percentage of nucleotide or amino acid residues that are identical with nucleotide or amino acid residues in a candidate sequence in comparison to a reference sequence when the two sequences are aligned. To determine percent identity, sequences are aligned and if necessary, gaps are introduced to achieve the maximum percent sequence identity. Sequence alignment procedures to determine percent identity are well known to those of skill in the art. Often publicly available computer software such as BLAST, BLAST2, ALIGN2 or MEGALIGN (DNASTAR) software is used to align sequences. Those skilled in the art can determine appropriate parameters for measuring alignment, including any algorithms needed to achieve maximal alignment over the full-length of the sequences being compared.
- mutation encompasses any change in a DNA, RNA, or protein sequence from the wild type sequence or some other reference, including without limitation point mutations, transitions, insertions, transversions, translocations, deletions, inversions, duplications, recombinations, or combinations thereof.
- insertion is used when the polynucleotide sequence has one or more extra bases compared with the polynucleotide sequence before cleavage by the RNA-guided nuclease occurred.
- diseletion is used when the polynucleotide sequence has one or more missing bases compared with the polynucleotide sequence before cleavage by the RNA-guided nuclease occurred.
- RNA-guided nuclease indicates either insertions or deletions. Cleavage by an RNA-guided nuclease can result in multiple indels, multiple insertions, multiple deletions or combinations of insertions of one or more nucleotides and deletions of one or more nucleotides.
- CRISPR Clustered Regularly Interspaced Short Palindromic Repeats
- CRISPR/Cas system has been adapted for use as gene editing (silencing, enhancing or changing specific genes) for use in eukaryotes (see, for example, Cong, Science, 15 :339(6121): 819-823 (2013) and Jinek, et al., Science, 337(6096):816-21 (2012)).
- a polynucleotide sequence By transfecting a cell with the required elements including a cas gene and specifically designed CRISPRs, a polynucleotide sequence can be cut and modified at virtually any desired location by unique targeting by, for example, a guide RNA that confers specificity to the nuclease.
- a guide RNA that confers specificity to the nuclease.
- a number of methods exist for introducing the guide strand and Cas protein into cells including viral transduction, injection or micro-injection, nano-particle or other delivery, uptake of proteins, uptake of RNA or DNA, uptake of combination of protein and RNA or DNA. Combinations of methods can also be used, simultaneously or in sequence.
- RNA, DNA or protein can occur with or without further protein expression.
- Methods of preparing compositions for use in genome editing using the CRISPR/Cas systems are described in detail in WO 2013/176772 and WO 2014/018423, which are specifically incorporated by reference herein in their entireties.
- the nuclease for use in the methods described herein is a Class 2 Cas nuclease.
- the nuclease has double-strand endonuclease activity.
- the nuclease comprises a Cas nuclease, such as a Class 2 Cas nuclease (which may be, e.g., a Cas nuclease of Type II, V, or VI).
- Class 2 Cas nucleases include, for example, Cas9, Cpfl, C2cl, C2c2, and C2c3 proteins and modifications thereof. Examples of Cas9 nucleases include those of the type II CRISPR systems of S. pyogenes, S.
- FIG. 1A shows a Type II nuclease, featuring an sgRNA interacting with a target DNA sequence.
- PAM (NGG) sequence, guide sequence, target DNA sequence, and cleavage site are indicated.
- Cas nucleases include a Csm or Cmr complex of a type III CRISPR system or the Cas 10, Csml, or Cmr2 subunit thereof; and a Cascade complex of a type I CRISPR system, or the Cas3 subunit thereof.
- FIG. IB shows a Type V nuclease, featuring a gRNA interacting with a target DNA sequence. PAM (TTTN) sequence, guide sequence, target DNA sequence, and cleavage site are indicated.
- the Cas nuclease may be from a Type-IIA, Type-1 IB, or Type-IIC system.
- the RNA-guided DNA binding agent is a Cas nickase, e.g. a Cas9 nickase.
- the RNA-guided DNA binding agent is an S. pyogenes Cas9 nuclease.
- Non-limiting exemplary species that the nuclease can be derived from include but are not limited to Streptococcus pyogenes, Streptococcus thermophilus, Streptococcus sp., Staphylococcus aureus, Listeria innocua, Lactobacillus gasseri, Francisella novicida, Wolinella succinogenes, Sutterella wadsworthensis, Gammaproteobacterhim, Neisseria meningitidis, Campylobacter Jejuni, Pasteurella multocida, Fibrobacter succinogene, Rhodospirillum rubrum, Nocardiopsis rougevillei, Streptomyces pristinaespiralis, Streptomyces viridochromogenes, Streptomyces viridochromogenes, Streptosporangium roseum, Streptosporangium roseum, AU
- the Cas nuclease is the Cas9 nuclease from Streptococcus pyogenes. In some embodiments, the Cas nuclease is the Cas9 nuclease from Streptococcus thermophilus. In some embodiments, the Cas nuclease is the Cas9 nuclease from Neisseria meningitidis. In some embodiments, the Cas nuclease is the Cas9 nuclease is from Staphylococcus aureus. In some embodiments, the Cas nuclease is the Cpfl nuclease from Francisella novicida.
- the Cas nuclease is the Cpfl nuclease from Acidaminococcus sp. In some embodiments, the Cas nuclease is the Cpfl nuclease from Lachnospiraceae bacterium ND2006.
- the Cas nuclease is the Cpfl nuclease from Francisella tularensis, Lachnospiraceae bacterium, Butyrivibrio proteoclasticus, Peregrinibacteria bacterium, Parcubacteria bacterium, Smithella, Acidaminococcus, Candidates Methanoplasma termitum, Eubacterium eligens, Moraxella bovoculi, Leptospira inadai, Porphyromonas crevioricanis, Prevotella disiens, or Porphyromonas macacae.
- the Cas nuclease is a Cpfl nuclease from an Acidaminococcus or Lachnospiraceae.
- Wild type Cas9 has two nuclease domains: RuvC and HNH.
- the RuvC domain cleaves the non-target DNA strand
- the HNH domain cleaves the target strand of DNA.
- the Cas9 nuclease comprises more than one RuvC domain and/or more than one HNH domain.
- the Cas9 nuclease is a wild type Cas9.
- the Cas9 is capable of inducing a double strand break in target DNA.
- the Cas nuclease can cleave one or both strands of dsDNA.
- the Cas nuclease can cleave a single strand of DNA.
- the Cas nuclease may not have DNA nickase activity.
- chimeric Cas nucleases are used, where one domain or region of the protein is replaced by a portion of a different protein.
- a Cas nuclease domain may be replaced with a domain from a different nuclease such as Fok 1.
- a Cas nuclease may be a modified nuclease, wherein the polypeptide sequence of the nuclease has been modified to confer, in some examples, advantageous properties to the nuclease.
- the cleavage site of the nuclease relative to the location of the user-designated target sequence is unknown.
- the Cas nuclease may be from a Type-I CRISPR/Cas system. In some embodiments, the Cas nuclease may be a component of the Cascade complex of a Type-I CRISPR/Cas system In some embodiments, the Cas nuclease may be a Cas3 protein. In some embodiments, the Cas nuclease may be from a Type-III CRISPR/Cas system. In some embodiments, the Cas nuclease may have an RNA cleavage activity.
- a data processing pipeline including the steps of receiving sample sequence data; applying quality control filters to the sample sequence data; aligning sequencing reads of the sample sequence data to a reference sequence; defining a window based on a sequence location within a nucleic acid guide and the locations of the ends of the nucleic acid guide sequence; and quantifying indels in the sample sequence data.
- the plurality of sequencing reads is obtained from a next-generation sequencing (NGS) instrument.
- NGS sequencing instrument is an Illumina MiSeqTM machine.
- sequencing and amplification adapters are trimmed by default, and so the returned NGS reads data do not have adapter sequences in the reads.
- sequencing and amplification adapters are trimmed as a part of the data processing pipeline of the methods disclosed herein.
- FIG. 2A shows steps of an upstream molecular biology workflow that can be performed in to generate data that is subsequently processed by the data processing pipeline disclosed herein.
- Steps of the upstream molecular biology workflow can include designing primers to amplify an editing target; isolating cellular DNA; PCR-amplifying the editing target with adapters; adding next-generation sequencing (NGS) adapters to PCR products; pooling barcoded amplicons; bead-purifying amplicons; spiking in diversity DNA; performing next-generation sequencing (NGS) on a sequencing instrument.
- NGS next-generation sequencing
- FIG. 2B shows steps of the computational pipeline disclosed herein. Steps can include quality control of received sequencing reads; alignment of sequencing reads to a target sequence; and quantification of indels according to the methods disclosed herein. These steps are described in further detail below.
- FIG. 3A is a schematic representation of received sequencing reads of nucleic-acid guided nuclease edited sample, comprising a plurality of sequencing reads, a subset of which contain indels as a result of cleavage by the nucleic-acid guided nuclease.
- the nuclease is Cas9, a Type II nuclease with a known cleavage site relative to the guide sequence.
- the number of sequencing reads comprising that indel is indicated at right.
- the previously known cleavage site is indicated by a rectangular box.
- FIG. 3B is a schematic representation of received sequencing reads of nucleic-acid guided nuclease edited sample, comprising a plurality of sequencing reads, a subset of which contain indels as a result of cleavage by the nucleic-acid guided nuclease.
- the nuclease is a novel Type V nuclease. For each type of indel, the number of sequencing reads comprising that indel is indicated at right. Several indels are missed by standard computational processing methods.
- FIG. 3C is a schematic representation of received sequencing reads of nucleic-acid guided nuclease edited sample, comprising a plurality of sequencing reads, a subset of which contain indels as a result of cleavage by the nucleic-acid guided nuclease.
- the nuclease is a novel Type V nuclease. For each type of indel, the number of sequencing reads comprising that indel is indicated at right. Custom data processing methods capture all indels generated by cleavage by the nuclease.
- the data processing pipeline disclosed herein includes one or more modules executed by the package of CRISPResso software tools (Clement K, et al. CRISPResso2 provides accurate and rapid genome editing sequence analysis. Nat Biotechnol. 2019 Mar;37(3):224-226., and Canver MC, et al. Integrated design, execution, and analysis of arrayed and pooled CRISPR genome-editing experiments. Nat Protoc. 2018 May;13(5):946-986., incorporated herein by reference in their entirety).
- CRISPResso software tools Clement K, et al. CRISPResso2 provides accurate and rapid genome editing sequence analysis. Nat Biotechnol. 2019 Mar;37(3):224-226., and Canver MC, et al. Integrated design, execution, and analysis of arrayed and pooled CRISPR genome-editing experiments. Nat Protoc. 2018 May;13(5):946-986., incorporated herein by reference in their entirety).
- one or more read filtering parameters are applied to sample sequence data using CRISPResso in order to remove potentially false-positive indels from the sample sequence data, in order to improve the accuracy of the estimation of the frequency of indels in the sample sequence data.
- Read filtering parameters are described in further detail below.
- read filtering is performed based on PHRED quality scores, which are described in Ewing B, Green P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998 Mar;8(3): 186-94., which is incorporated by reference herein in its entirety.
- PHRED quality scores measure the quality of the identification of nucleotide base calls in sequencing reads generated by automated DNA sequence instruments.
- minimum average read quality (“q” or “min average read quality”) is applied to filter sample sequence data, in order to remove potentially false-positive indels.
- This parameter allow for the specification of the minimum average quality score for inclusion of a read in subsequent analysis.
- the PHRED score represents the confidence in the assignment of a particular nucleotide in a read.
- the maximum score of 40 corresponds to an error rate of 0.01%. This average quality of a read is useful to filter out low-quality reads.
- a “min_average_read_quality” value of 0, 5, 10, 15, 20, 25, 30, 35, or 40, is applied to sample sequence data.
- minimum single basepair score (“s” or “min single bp quality”) is applied to filter sample sequence data, in order to remove potentially false-positive indels.
- This parameter allow for the specification of the minimum single-bp score for inclusion of a read in subsequent analysis. This parameter provides for more-stringent filtering; any read with a single-bp quality below the threshold will be discarded.
- a “min single bp quality” value of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, or 20, is applied to sample sequence data.
- an amplicon minimum alignment score (“amas” or “amplicon min alignment score”) is applied to filter sample sequence data. After reads are aligned to a reference sequence, the homology is calculated as the number of basepairs they have in common. This is useful for filtering erroneous reads that do not align to the target sequence, for example arising from alternate primer locations.
- a “amplicon_min_alignment_score” value of 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, or 100, is applied to sample sequence data.
- a combination of a “min average read quality” score, a “min_single_bp_quality” score, and a “amplicon_min_alignment_score” is applied to filter sample sequence data in order to improve the estimation of indel frequency in sample sequence data after cleavage by a target polynucleotide sequence by a nuclease.
- a combination of a “min average read quality” score, a “min_single_bp_quality” score, and a “amplicon_min_alignment_score,” further in combination with a window size based on a sequence location within a guide sequence, is applied to filter sample sequence data in order to improve the estimation of indel frequency in sample sequence data after cleavage by a target polynucleotide sequence by a nuclease.
- amplicon_min_alignment_score is applied to filter sample sequence data in order to improve the estimation of indel frequency in sample sequence data after cleavage by a target polynucleotide sequence by a nuclease.
- combinations of these parameters may be used to enhance the accuracy of indel quantification according to the methods disclosed herein.
- a combination of a “min average read quality” score, a “min_single_bp_quality” score, and a “amplicon_min_alignment_score” is applied to filter sample sequence data in order to improve the estimation of indel frequency in sample sequence data after cleavage by a target polynucleotide sequence by a nuclease.
- a combination of a “min average read quality” score, a “min_single_bp_quality” score, and a “amplicon_min_alignment_score,” further in combination with a window size based on a sequence location of a known insertion site, is applied to fdter sample sequence data in order to improve the estimation of indel frequency in sample sequence data after cleavage by a target polynucleotide sequence by a nuclease and insertion of a polynucleotide sequence.
- combinations of these parameters may be used to enhance the accuracy of indel quantification according to the methods disclosed herein.
- the data processing pipeline disclosed herein includes one or modules executed by the package of FLASH software tools (Magoc T, Salzberg SL. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. 2011 Nov l;27(21):2957-63., incorporated by reference herein in its entirety).
- FLASH is a rapid and accurate software tool to merge paired-end reads from next-generation sequencing experiments, designed to merge pairs of reads when the original DNA fragments are shorter than twice the length of reads. The resulting longer reads can significantly improve genome assemblies.
- FLASH calculates a mismatch ratio within two overlapped regions.
- FLASH determines that the reads are an incorrect overlap.
- paired-end reads are merged using FLASH in order to produce single reads for alignment to the target reference sequence and reduces sequencing errors that may be present at the end of sequencing reads.
- a maximum paired-end reads overlap (“max_paired_end_reads_overlap”) is applied to paired-end sample sequence data for the FLASH read merging step of the data processing pipeline. This parameter represents the maximum overlap length expected in approximately 90% of read pairs.
- a “max _paired_end_reads_overlap” value of 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, or 300 is applied to sample sequence data.
- a minimum paired-end reads overlap (“min_paired_end_reads_overlap”) is applied to paired-end sample sequence data for the FLASH read merging step of the data processing pipeline. This parameter represents the minimum required overlap length between two reads to provide a confident overlap.
- a “max_paired end reads overlap” value of 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, or 100, is applied to sample sequence data.
- a combination of a “min average read quality” score, a “min single bp quality” score, a “min_paired_end_reads_overlap,” and a “max_paired_end_reads_overlap” is applied to filter sample sequence data in order to improve the estimation of indel frequency in sample sequence data after cleavage by a target polynucleotide sequence by a nuclease.
- a combination of a “min_average_read_quality” score, a “min_single_bp_quality” score, a “min_paired_end_reads_overlap,” and a “max_paired_end_reads_overlap” further in combination with a window size based on a sequence location within a guide sequence is applied to filter sample sequence data in order to improve the estimation of indel frequency in sample sequence data after cleavage by a target polynucleotide sequence by a nuclease.
- combinations of these parameters may be used to enhance the accuracy of indel quantification according to the methods disclosed herein.
- a combination of a “min average read quality” score, a “min_single_bp_quality” score, a “min_paired_end_reads_overlap,” and a “max_paired_end_reads_overlap” is applied to filter sample sequence data in order to improve the estimation of indel frequency in sample sequence data after cleavage by a target polynucleotide sequence by a nuclease.
- a combination of a “min_average_read_quality” score, a “min_single_bp_quality” score, a “min_paired_end_reads_overlap,” and a “max_paired_end_reads_overlap” further in combination with a window size based on a sequence location of a known insertion site is applied to filter sample sequence data in order to improve the estimation of indel frequency in sample sequence data after cleavage by a target polynucleotide sequence by a nuclease and insertion of a polynucleotide sequence.
- combinations of these parameters may be used to enhance the accuracy of indel quantification according to the methods disclosed herein.
- the quantification window extends 50, 40, 30, 20, 10, or 5 nucleotides 5' to the 5' end of the guide sequence. In some embodiments, the quantification window extends 50, 40, 30, 20, 10, or 5 nucleotides 3' to the 3' end of the guide sequence. In some embodiments, the quantification window extends 50, 40, 30, 20, 10, or 5 nucleotides 5' to a known insertion site. In some embodiments, the quantification window extends 50, 40, 30, 20, 10, or 5 nucleotides 3' to a known insertion site. In some embodiments, the quantification window is 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, 400, or 500 nucleotides in length.
- the indel to be detected is about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more nucleotides in length. In some embodiments, the indel to be detected is about 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 or more nucleotides in length. In some embodiments, the indel comprises a CRISPR-mediated donor insertion. In some embodiments, the indel is the result of homology directed repair (HDR). In some embodiments, the indel is the result of insertion by a CRISPR-associated transposase (CAST). In some embodiments, the insertion comprises sequences from a genomic library.
- HDR homology directed repair
- CAST CRISPR-associated transposase
- FIG. 5 is a diagram of computer system 500 components that can be used to implement a computational pipeline for indel quantification for polynucleotide sequences that have been cleaved by nucleic-acid guided nucleases, including uncharacterized nucleases.
- Computer system 500 can be used to implement methods that include parameters for alignment of next-generation sequencing (NGS) reads from targeted amplicon sequencing (TAS) of nuclease-edited or control samples that have been obtained and aligned to reference amplicon sequences.
- NGS next-generation sequencing
- TAS targeted amplicon sequencing
- Computing device 500 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers.
- Computing device 550 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smartphones, and other similar computing devices. Additionally, computing device 500 or 550 can include Universal Serial Bus (USB) flash drives.
- USB flash drives can store operating systems and other applications.
- the USB flash drives can include input/output components, such as a wireless transmitter or USB connector that can be inserted into a USB port of another computing device.
- the components shown here, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the methods and compositions described and/or claimed in this document.
- Computing device 500 includes a processor 502, memory 504, a storage device 506, a high-speed interface 508 connecting to memory 504 and high-speed expansion ports 510, and a low speed interface 512 connecting to low speed bus 514 and storage device 506.
- Each of the components 502, 504, 506, 508, 510, and 512 are interconnected using various busses, and can be mounted on a common motherboard or in other manners as appropriate.
- the processor 502 can process instructions for execution within the computing device 500, including instructions stored in the memory 504 or on the storage device 506 to display graphical information for a GUI on an external input/output device, such as display 516 coupled to high speed interface 508.
- multiple processors and/or multiple buses can be used, as appropriate, along with multiple memories and types of memory.
- multiple computing devices 500 can be connected, with each device providing portions of the necessary operations, e.g., as a server bank, a group of blade servers, or a multi-processor system.
- the memory 504 stores information within the computing device 500.
- the memory 504 is a volatile memory unit or units.
- the memory 504 is a non-volatile memory unit or units.
- the memory 504 can also be another form of computer-readable medium, such as a magnetic or optical disk.
- the storage device 506 is capable of providing mass storage for the computing device 500.
- the storage device 506 can be or contain a computer- readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid-state memory device, or an array of devices, including devices in a storage area network or other configurations.
- a computer program product can be tangibly embodied in an information carrier.
- the computer program product can also contain instructions that, when executed, perform one or more methods, such as those described above.
- the information carrier is a computer- or machine-readable medium, such as the memory 504, the storage device 506, or memory on processor 502.
- the high-speed controller 508 manages bandwidth-intensive operations for the computing device 500, while the low speed controller 512 manages lower bandwidth intensive operations. Such allocation of functions is only an example.
- the high-speed controller 508 is coupled to memory 504, display 516, e.g., through a graphics processor or accelerator, and to high-speed expansion ports 510, which can accept various expansion cards (not shown).
- low-speed controller 512 is coupled to storage device 506 and low-speed expansion port 514.
- the low-speed expansion port which can include various communication ports, e.g., USB, Bluetooth, Ethernet, wireless Ethernet can be coupled to one or more input/output devices, such as a keyboard, a pointing device, microphone/speaker pair, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
- the computing device 500 can be implemented in a number of different forms, as shown in the figure. For example, it can be implemented as a standard server 520, or multiple times in a group of such servers. It can also be implemented as part of a rack server system 524. In addition, it can be implemented in a personal computer such as a laptop computer 522.
- components from computing device 500 can be combined with other components in a mobile device (not shown), such as device 550.
- a mobile device not shown
- Each of such devices can contain one or more of computing device 500, 550, and an entire system can be made up of multiple computing devices 500, 550 communicating with each other.
- the computing device 500 can be implemented in a number of different forms, as shown in the figure. For example, it can be implemented as a standard server 520, or multiple times in a group of such servers. It can also be implemented as part of a rack server system 524. In addition, it can be implemented in a personal computer such as a laptop computer 522. Alternatively, components from computing device 500 can be combined with other components in a mobile device (not shown), such as device 550. Each of such devices can contain one or more of computing device 500, 550, and an entire system can be made up of multiple computing devices 500, 550 communicating with each other.
- Computing device 550 includes a processor 552, memory 564, and an input/output device such as a display 554, a communication interface 566, and a transceiver 568, among other components.
- the device 550 can also be provided with a storage device, such as a micro-drive or other device, to provide additional storage.
- a storage device such as a micro-drive or other device, to provide additional storage.
- Each of the components 550, 552, 564, 554, 566, and 568 are interconnected using various buses, and several of the components can be mounted on a common motherboard or in other manners as appropriate.
- the processor 552 can execute instructions within the computing device 550, including instructions stored in the memory 564.
- the processor can be implemented as a chipset of chips that include separate and multiple analog and digital processors. Additionally, the processor can be implemented using any of a number of architectures.
- the processor 510 can be a CISC (Complex Instruction Set Computers) processor, a RISC (Reduced Instruction Set Computer) processor, or a MISC (Minimal Instruction Set Computer) processor.
- the processor can provide, for example, for coordination of the other components of the device 550, such as control of user interfaces, applications run by device 550, and wireless communication by device 550.
- Processor 552 can communicate with a user through control interface 558 and display interface 556 coupled to a display 554.
- the display 554 can be, for example, a TFT (Thin-Film-Transistor Liquid Crystal Display) display or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology.
- the display interface 556 can comprise appropriate circuitry for driving the display 554 to present graphical and other information to a user.
- the control interface 558 can receive commands from a user and convert them for submission to the processor 552.
- an external interface 562 can be provided in communication with processor 552, so as to enable near area communication of device 550 with other devices.
- External interface 562 can provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces can also be used.
- the memory 564 stores information within the computing device 550.
- the memory 564 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units.
- Expansion memory 574 can also be provided and connected to device 550 through expansion interface 572, which can include, for example, a SIMM (Single In Line Memory Module) card interface.
- SIMM Single In Line Memory Module
- expansion memory 574 can provide extra storage space for device 550, or can also store applications or other information for device 550.
- expansion memory 574 can include instructions to carry out or supplement the processes described above, and can also include secure information.
- expansion memory 574 can be provided as a security module for device 550, and can be programmed with instructions that permit secure use of device 550.
- secure applications can be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.
- the memory can include, for example, flash memory and/or NVRAM memory, as discussed below.
- a computer program product is tangibly embodied in an information carrier.
- the computer program product contains instructions that, when executed, perform one or more methods, such as those described above.
- the information carrier is a computer- or machine-readable medium, such as the memory 564, expansion memory 574, or memory on processor 552 that can be received, for example, over transceiver 568 or external interface 562.
- Device 550 can communicate wirelessly through communication interface 566, which can include digital signal processing circuitry where necessary. Communication interface 566 can provide for communications under various modes or protocols, such as GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others. Such communication can occur, for example, through radio-frequency transceiver 568. In addition, short-range communication can occur, such as using a Bluetooth, Wi-Fi, or other such transceiver (not shown). In addition, GPS (Global Positioning System) receiver module 570 can provide additional navigation- and location-related wireless data to device 550, which can be used as appropriate by applications running on device 550.
- GPS Global Positioning System
- Device 550 can also communicate audibly using audio codec 560, which can receive spoken information from a user and convert it to usable digital information. Audio codec 560 can likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 550. Such sound can include sound from voice telephone calls, can include recorded sound, e.g., voice messages, music files, etc. and can also include sound generated by applications operating on device 550.
- Audio codec 560 can receive spoken information from a user and convert it to usable digital information. Audio codec 560 can likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 550. Such sound can include sound from voice telephone calls, can include recorded sound, e.g., voice messages, music files, etc. and can also include sound generated by applications operating on device 550.
- the computing device 550 can be implemented in a number of different forms, as shown in the figure. For example, it can be implemented as a cellular telephone 580. It can also be implemented as part of a smartphone 582, personal digital assistant, or other similar mobile device.
- implementations of the systems and methods described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations of such implementations.
- ASICs application specific integrated circuits
- These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which can be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
- the systems and techniques described here can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball by which the user can provide input to the computer.
- a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball by which the user can provide input to the computer.
- Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
- the systems and techniques described here can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the systems and techniques described here, or any combination of such back end, middleware, or front end components.
- the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network ("LAN”), a wide area network (“WAN”), and the Internet.
- LAN local area network
- WAN wide area network
- the Internet the global information network
- the computing system can include clients and servers.
- a client and server are generally remote from each other and typically interact through a communication network.
- the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
- a number of embodiments have been described. Nevertheless, it will be understood that various modifications can be made without departing from the spirit and scope of the invention.
- the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results.
- other steps can be provided, or steps can be eliminated, from the described flows, and other components can be added to, or removed from, the described systems. Accordingly, other embodiments are within the scope of the following claims.
- Embodiments of the disclosure and all of the functional operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
- Embodiments of the methods and compositions can be implemented as one or more computer program products, e.g., one or more modules of computer program instructions encoded on a computer readable medium for execution by, or to control the operation of, data processing apparatus.
- the computer readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or a combination of one or more of them.
- data processing apparatus encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
- the apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
- a propagated signal is an artificially generated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus.
- a computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
- a computer program does not necessarily correspond to a file in a file system.
- a program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code).
- a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
- the processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output.
- the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
- processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer.
- a processor will receive instructions and data from a read only memory or a random access memory or both.
- the essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data.
- a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
- mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks.
- a computer need not have such devices.
- a computer can be embedded in another device, e.g., a tablet computer, a mobile telephone, a personal digital assistant (PDA), a mobile audio player, a Global Positioning System (GPS) receiver, to name just a few.
- Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks.
- the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
- embodiments of the disclosure can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
- a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
- keyboard and a pointing device e.g., a mouse or a trackball
- Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
- Embodiments of the disclosure can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the methods, or any combination of one or more such back end, middleware, or front end components.
- the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
- LAN local area network
- WAN wide area network
- the computing system can include clients and servers.
- a client and server are generally remote from each other and typically interact through a communication network.
- the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. While this specification contains many specifics, these should not be construed as limitations on the scope of the invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of the invention. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
- HTML file In each instance where an HTML file is mentioned, other file types or formats may be substituted. For instance, an HTML file may be replaced by an XML, JSON, plain text, or other types of files. Moreover, where a table or hash table is mentioned, other data structures (such as spreadsheets, relational databases, or structured files) may be used.
- EXAMPLE 1 Benchmarking the indel quantification method using a well- characterized nuclease
- the data processing pipeline is applied to data generated by next-generation sequencing of mammalian cells edited by SpCas9, a Type II nuclease with a known cleavage site and editing pattern.
- the percentage of cells estimated to comprise an indel at the target DNA sequence after editing is determined by standard CRISPResso2 processing parameters, and by the methods disclosed herein. As shown in FIG.
- the methods disclosed herein perform similarly to the standard CRISPResso2 processing parameters, indicating that the method does not over- or underestimate the percentage of cells estimated to comprise an indel at the target DNA sequence after editing with a well- characterized nuclease with a known cleavage site and editing pattern.
- the data processing pipeline was applied to data generated by next-generation sequencing of mammalian cells edited by a Type V nuclease with an unknown cleavage site in a serial dilution experiment.
- DNA isolated from mammalian cells edited by the Type V nuclease was serially diluted with DNA isolated from non-edited cells in proportions of, for example, 0% non-edited / 100% edited; 25% non-edited / 75% edited; 50% non-edited / 50% edited; 75% non-edited / 25% edited; 0% non-edited / 100% edited (FIG. 4B).
- serially diluted DNA mixtures were sequenced to produce sequencing reads comprising the edited target sequence.
- observed indel percentages y-axis
- expected indel percentages x-axis
- the data processing pipeline is applied to data generated by nextgeneration sequencing of mammalian cells edited a novel Type V nuclease, at escalating concentrations and with two different guides: guide sequence Al and F2.
- the data processing pipeline accurately estimates indel percentages of editing by the novel Type V nuclease at nuclease concentrations of 5 pM, 11 pM, and 22 pM for guide sequence Al and F2.
- the data processing pipeline is applied to data generated by next-generation sequencing of mammalian cells edited by nuclease AsCasl2a to validate the efficacy of the indel quantification methods disclosed herein.
- the data processing pipeline was applied to data generated by next-generation sequencing of mammalian cells where a sequence of known length, 21 nucleotides, was introduced at a particular insertion site in a serial dilution spike-in experiment (see, FIG. 7A).
- DNA isolated from mammalian cells comprising the insertion was serially diluted with DNA isolated from cells not comprising the insertion in proportions of, for example, 0% insertion / no insertion; 25% insertion / 75% no insertion; 50% insertion / 50% no insertion; 75% insertion / 25% no insertion; 0% insertion / 100% no insertion (FIG. 4B)
- the serially diluted DNA mixtures were sequenced to produce sequencing reads comprising the insertion site. As shown in FIG. 7B, for the experiment where the DNA comprising the insertion was spiked in at 25%, the spiked in sequence was detected at approximately 23%. As shown in FIG.
Landscapes
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Genetics & Genomics (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Chemical & Material Sciences (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Computational Biology (AREA)
- Analytical Chemistry (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
La présente invention concerne des procédés pour estimer la fréquence d'insertions et/ou de délétions médiées par des nucléases guidées par ARN. L'invention concerne des procédés et des systèmes pour le traitement, par un pipeline de calcul, de lectures de séquençage de séquences polynucléotidiques cibles éditées.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202263430827P | 2022-12-07 | 2022-12-07 | |
US63/430,827 | 2022-12-07 | ||
EP23315141 | 2023-05-02 | ||
EP23315141.4 | 2023-05-02 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024123789A1 true WO2024123789A1 (fr) | 2024-06-13 |
Family
ID=89507456
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2023/082543 WO2024123789A1 (fr) | 2022-12-07 | 2023-12-05 | Prédiction de fréquences d'indel |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024123789A1 (fr) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013176772A1 (fr) | 2012-05-25 | 2013-11-28 | The Regents Of The University Of California | Procédés et compositions permettant la modification de l'adn cible dirigée par l'arn et la modulation de la transcription dirigée par l'arn |
WO2014018423A2 (fr) | 2012-07-25 | 2014-01-30 | The Broad Institute, Inc. | Protéines de liaison à l'adn inductibles et outils de perturbation du génome et leurs applications |
US20160312198A1 (en) | 2015-03-03 | 2016-10-27 | The General Hospital Corporation | Engineered CRISPR-CAS9 NUCLEASES WITH ALTERED PAM SPECIFICITY |
-
2023
- 2023-12-05 WO PCT/US2023/082543 patent/WO2024123789A1/fr unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013176772A1 (fr) | 2012-05-25 | 2013-11-28 | The Regents Of The University Of California | Procédés et compositions permettant la modification de l'adn cible dirigée par l'arn et la modulation de la transcription dirigée par l'arn |
WO2014018423A2 (fr) | 2012-07-25 | 2014-01-30 | The Broad Institute, Inc. | Protéines de liaison à l'adn inductibles et outils de perturbation du génome et leurs applications |
US20160312198A1 (en) | 2015-03-03 | 2016-10-27 | The General Hospital Corporation | Engineered CRISPR-CAS9 NUCLEASES WITH ALTERED PAM SPECIFICITY |
US20160312199A1 (en) | 2015-03-03 | 2016-10-27 | The General Hospital Corporation | Engineered CRISPR-CAS9 Nucleases with Altered PAM Specificity |
Non-Patent Citations (12)
Title |
---|
BRINER ET AL., MAL CELL, vol. 56, no. 2, 2014, pages 333 - 9 |
CANVER MC ET AL.: "Integrated design, execution, and analysis of arrayed and pooled CRISPR genome-editing experiments", NAT PROTOC., vol. 13, no. 5, May 2018 (2018-05-01), pages 946 - 986, XP055730891, DOI: 10.1038/nprot.2018.005 |
CLEMENT K ET AL.: "CRISPResso2 provides accurate and rapid genome editing sequence analysis", NAT BIOTECHNOL., vol. 37, no. 3, March 2019 (2019-03-01), pages 224 - 226, XP036900605, DOI: 10.1038/s41587-019-0032-3 |
CONG, SCIENCE, vol. 339, no. 6121, 2013, pages 819 - 823 |
EWING BGREEN P: "Base-calling of automated sequencer traces using phred. II. Error probabilities", GENOME RES, vol. 8, no. 3, March 1998 (1998-03-01), pages 186 - 94, XP000915053 |
JINEK ET AL., SCIENCE, vol. 337, no. 6096, 2012, pages 816 - 21 |
KURGAN GAVIN ET AL: "CRISPAltRations: A validated cloud-based approach for interrogation of double-strand break repair mediated by CRISPR genome editing", MOLECULAR THERAPY- METHODS & CLINICAL DEVELOPMENT, vol. 21, 1 June 2021 (2021-06-01), GB, pages 478 - 491, XP093140503, ISSN: 2329-0501, DOI: 10.1016/j.omtm.2021.03.024 * |
LABUN KORNEL: "In silico design and analysis of targeted genome editing with CRISPR", 1 January 2020 (2020-01-01), XP093141039, Retrieved from the Internet <URL:https://bora.uib.no/bora-xmlui/handle/1956/21443> [retrieved on 20240313] * |
MAGOC TSALZBERG SL: "FLASH: fast length adjustment of short reads to improve genome assemblies", BIOINFORMATICS, vol. 27, no. 21, 1 November 2011 (2011-11-01), pages 2957 - 63, XP055332486, DOI: 10.1093/bioinformatics/btr507 |
MAKAROVA ET AL., NAT. REV. MICROBIOL, vol. 13, 2015, pages 722 - 36 |
MAKAROVA ET AL., NAT. REV. MICROBIOL., vol. 9, 2011, pages 467 - 477 |
SHMAKOV ET AL., MOLECULAR CELL, vol. 60, 2015, pages 385 - 397 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12116571B2 (en) | Compositions and methods for detecting nucleic acid regions | |
EP3565907B1 (fr) | Procédés d'évaluation de la coupure par les nucléases | |
Kebschull et al. | Sources of PCR-induced distortions in high-throughput sequencing data sets | |
EP3149168B1 (fr) | Assemblage à haut rendement d'éléments génétiques | |
EP3724214A1 (fr) | Systèmes et procédés de prédiction des résultats de la réparation en ingénierie génétique | |
US20220333186A1 (en) | Method and system for targeted nucleic acid sequencing | |
US20230056763A1 (en) | Methods of targeted sequencing | |
Maxwell et al. | A detailed cell-free transcription-translation-based assay to decipher CRISPR protospacer-adjacent motifs | |
EP3018604B1 (fr) | Procédé d'attribution de lectures de séquences enrichies de manière ciblée à un emplacement génomique | |
US20130123117A1 (en) | Capture probe and assay for analysis of fragmented nucleic acids | |
Marinov | On the design and prospects of direct RNA sequencing | |
Kramme et al. | MegaGate: A toxin-less gateway molecular cloning tool | |
WO2024123789A1 (fr) | Prédiction de fréquences d'indel | |
JP2022515085A (ja) | 一本鎖dnaの合成方法 | |
CN106319033B (zh) | 一种检测染色体异常以及重组位点dna序列的方法 | |
US20240182951A1 (en) | Methods for targeted nucleic acid sequencing | |
US20230122979A1 (en) | Methods of sample normalization | |
Selinger et al. | CRISPR-MIP replaces PCR and reveals GC and oversampling bias in pooled CRISPR screens | |
JP2023538537A (ja) | 核酸の標的化除去のための方法 | |
WO2024157194A1 (fr) | Procédés et dosages pour analyse hors cible | |
Maxwell et al. | Original publication | |
WO2023137292A1 (fr) | Procédés et compositions pour l'analyse du transcriptome | |
McDiarmid et al. | Diversified, miniaturized and ancestral parts for mammalian genome engineering and molecular recording | |
Jakimo | Precise and expansive genomic positioning for CRISPR edits | |
Mighell et al. | Cas12a-Capture: a novel, low-cost, and scalable method for targeted sequencing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23837105 Country of ref document: EP Kind code of ref document: A1 |