WO2024168196A1 - Systèmes et procédés de synthèse enzymatique de polynucléotides contenant des paires de bases nucléotidiques non standard - Google Patents

Systèmes et procédés de synthèse enzymatique de polynucléotides contenant des paires de bases nucléotidiques non standard Download PDF

Info

Publication number
WO2024168196A1
WO2024168196A1 PCT/US2024/015068 US2024015068W WO2024168196A1 WO 2024168196 A1 WO2024168196 A1 WO 2024168196A1 US 2024015068 W US2024015068 W US 2024015068W WO 2024168196 A1 WO2024168196 A1 WO 2024168196A1
Authority
WO
WIPO (PCT)
Prior art keywords
standard
base
standard nucleotide
nucleotide
dna
Prior art date
Application number
PCT/US2024/015068
Other languages
English (en)
Inventor
Jorge MARCHAND
Hinako KAWABE
Original Assignee
University Of Washington
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University Of Washington filed Critical University Of Washington
Publication of WO2024168196A1 publication Critical patent/WO2024168196A1/fr

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N9/00Enzymes; Proenzymes; Compositions thereof; Processes for preparing, activating, inhibiting, separating or purifying enzymes
    • C12N9/10Transferases (2.)
    • C12N9/12Transferases (2.) transferring phosphorus containing groups, e.g. kinases (2.7)
    • C12N9/1241Nucleotidyltransferases (2.7.7)
    • C12N9/1252DNA-directed DNA polymerase (2.7.7.7), i.e. DNA replicase
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1093General methods of preparing gene libraries, not provided for in other subgroups
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12PFERMENTATION OR ENZYME-USING PROCESSES TO SYNTHESISE A DESIRED CHEMICAL COMPOUND OR COMPOSITION OR TO SEPARATE OPTICAL ISOMERS FROM A RACEMIC MIXTURE
    • C12P19/00Preparation of compounds containing saccharide radicals
    • C12P19/26Preparation of nitrogen-containing carbohydrates
    • C12P19/28N-glycosides
    • C12P19/30Nucleotides
    • C12P19/34Polynucleotides, e.g. nucleic acids, oligoribonucleotides
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12YENZYMES
    • C12Y207/00Transferases transferring phosphorus-containing groups (2.7)
    • C12Y207/07Nucleotidyltransferases (2.7.7)
    • C12Y207/07007DNA-directed DNA polymerase (2.7.7.7), i.e. DNA replicase
    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B40/00Libraries per se, e.g. arrays, mixtures
    • C40B40/04Libraries containing only organic compounds
    • C40B40/06Libraries containing nucleotides or polynucleotides, or derivatives thereof
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides

Definitions

  • the name of the XML file containing the sequence listing is 3915- P1293WO.UW_Sequence_Listing.xml.
  • the XML file is 172,291 bytes; was created on February 07, 2024; and is being submitted electronically via Patent Center with the filing of the specification.
  • BACKGROUND [0003] The four-letter standard genetic alphabet of DNA (A, T, G, C) is ubiquitous and one of the defining biomolecular signatures of life on Earth. Organisms’ ability to read, write, and translate this information forms the basis for evolution as an emergent property of nucleic acid heteropolymers. Humanity has learned how to manipulate the standard 4-letters of DNA, spurring major advancements in biotechnology, information, and healthcare.
  • non-standard nucleotides that are capable of base- pairing with other non-standard nucleotides and/or standard nucleotides.
  • non-standard nucleotide refers to any nucleotide that is not one of the standard four nucleotides of DNA (i.e., A, T, G, C).
  • An example of such a nucleotide includes, but is not limited to, a xenonucleotide (XNA).
  • XNA xenonucleotide
  • the disclosure provides a method for generating an N+1 tailing product comprising a non-standard nucleotide that is covalently bound with a 3’ end of a precursor double-stranded DNA (dsDNA) template and is non-base-paired, the method comprising: combining the precursor dsDNA template with a DNA polymerase and a non-standard deoxyribonucleotide triphosphate (dNTP) under a reaction condition conducive to a blunt-end N+1 addition of the non-standard nucleotide to the 3’ end of the precursor dsDNA template by the DNA polymerase.
  • dNTP non-standard deoxyribonucleotide triphosphate
  • the non-standard nucleotide is a xenonucleotide (XNA) and the non-standard dNTP is a deoxy-xeno-ribonucleotide triphosphate (dxNTP).
  • the DNA polymerase comprises a polypeptide sequence of a small Klenow Fragment (KF exo-) of DNA Polymerase I.
  • the polypeptide sequence comprises a sequence of SEQ ID NO:2.
  • the non-standard nucleotide is B or p
  • the reaction condition proceeds at about 37°C for between about 1-16 hours and comprises about 0.71 U/ ⁇ L of the DNA polymerase and about 1.19 mM of the non-standard dNTP.
  • the DNA polymerase comprises a polypeptide sequence of an engineered polymerase from a hyperthermophilic marine archaeon.
  • the engineered polymerase is a variant of 9°N DNA polymerase.
  • the polypeptide sequence comprises a sequence of SEQ ID NO:3.
  • the non-standard nucleotide is selected from S n , S c , Z, X t , K n , J, and V, and the reaction condition proceeds at about 60°C for between about 4- 16 hours and comprises about 0.29 U/ ⁇ L of the DNA polymerase and about 1.19 mM of the non-standard dNTP.
  • the disclosure provides a method for generating a base pair of two nucleotides of a polynucleotide, wherein at least one nucleotide of the two nucleotides is a non-standard nucleotide.
  • the method comprises: generating a second N+1 tailing product comprising a second non-standard nucleotide that is base-pair complementary with the non-standard nucleotide, wherein the second non-standard nucleotide is non- base-paired; and ligating the N+1 tailing product with the second N+1 tailing product to form a dsDNA ligation product that comprises a base pair between the non-standard nucleotide and the second non-standard nucleotide.
  • the N+1 tailing product comprises a hairpin.
  • the second N+1 tailing product comprises a hairpin.
  • the dsDNA ligation product does not comprise a free 5’ end or a free 3’ end.
  • the method comprises: contacting the dsDNA ligation product with a type IIS restriction enzyme under a reaction condition conducive for the type IIS restriction enzyme to cleave the dsDNA ligation product to generate a blunt-end DNA template that comprises the base pair between the non-standard nucleotide and the second non-standard nucleotide.
  • 3915-P1293WO.UW -3- can hydrogen bond to a second base, a nucleobase that can base pair (without hydrogen bonding) to a second base, a nucleobase that relies on steric exclusion for base pairing, a nucleobase that relies on hydrophobic interactions for base pairing, a nucleobase that relies on a transition metal complex for base pairing, a chemical modification, or any combination thereof.
  • the non-standard nucleotide is the nucleobase that is configured to hydrogen bond to the second base and the second base is a standard base or a non-standard base.
  • the non-standard nucleotide comprises the chemical modification and the chemical modification comprises a fluorophore, a biotin, a terminal alkyne, an azide, a cyclooctyne, a tetrazine, a terminal alkene, a phosphine, a halo-alkane, an aldehyde, a thiol, a transition metal complex, another reactive handle, or any combination thereof.
  • the disclosure provides a dsDNA ligation product. In an aspect, the disclosure provides a further dsDNA ligation product.
  • the disclosure provides a defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of the dsDNA ligation product or the blunt-end dsDNA template, wherein the library polynucleotide sequence comprises the base pair between the non-standard nucleotide and the second non-standard nucleotide.
  • the disclosure provides a defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of the further dsDNA ligation product or the further blunt-end dsDNA template, wherein the library polynucleotide sequence comprises the plurality of base pairs between the plurality of non-standard nucleotides and the plurality of second non-standard nucleotides.
  • the library polynucleotide sequence further comprises: a context barcode associated with a sequence context adjacent to a base pair of a non-
  • the disclosure provides a method for generating a machine learning (ML) model that correlates one or more observed current reads with an unknown non-standard nucleotide for assignment of an identity to the unknown non-standard nucleotide, the method comprising: sequencing, with a nanopore sequencing method, the defined non-standard nucleotide base pair library to produce the one or more observed current reads; and training, with a ML algorithm, the ML model to associate the one or more observed current reads with a known identity of a defined non-standard nucleotide of the defined non-standard nucleotide base pair library, wherein the ML model is configured to assign the identity to the unknown non-standard nucleotide based on the known identity of the defined non-standard nucleotide.
  • ML machine learning
  • the ML model comprises a convolutional long short term memory recurrent neural network (LSTM RNN).
  • LSTM RNN convolutional long short term memory recurrent neural network
  • the disclosure provides a non-transitory computer-readable storage medium having stored thereon at least part of a ML model.
  • the disclosure provides a computational device or computational system comprising the non- transitory computer-readable storage medium.
  • the disclosure provides a nanopore sequencing kit, device, or system comprising the non-transitory computer- readable storage medium.
  • the disclosure provides a method for basecalling a non- standard nucleotide expanded alphabet, the method comprising: sequencing, with a nanopore sequencing method, a subject polynucleotide sequence that comprises a non- standard nucleotide to generate a subject current read; computing, with the computational device or computational system, the known identity of the defined non-standard nucleotide of the defined non-standard nucleotide base pair library associated with the subject current read with for an association; and computing, based on the association, a structure of the non-standard nucleotide.
  • the disclosure provides a circuitry configured to perform all or part of a method.
  • the disclosure provides a nanopore sequencing kit, device, or system comprising the circuitry.
  • FIGs 1A and 1B show nucleobases for an expanded 12-letter supernumerary DNA alphabet.
  • FIG. 1A Structures of standard purine and pyrimidine nucleobases.
  • FIG. 1B Structures of mutually orthogonal synthetic xenonucleobases that can form the basis of a 12-letter supernumerary DNA. Single letter abbreviations of each base indicated above nucleobase structure.
  • FIGs 2A-2H show XNA tailing and XNA ligation enable a facile means for enzymatic XNA incorporation.
  • FIG. 2A Polymerase XNA tailing activity screened by detection of released 2′-deoxy-xenonucleoside monophosphates (dxNMPs). Hairpin HP-3′PT was used as tailing substrate (Table 2); ‘*’ indicate positions of phosphorothioate bonds.
  • Extracted ion chromatograms for each dNMP and dxNMP in assays indicate dNTP and dxNTP tailing by (FIG.2B) Klenow Fragment (exo-) and (FIG. 2C) Therminator polymerase.
  • Source data are provided as a Source Data file.
  • FIG. 2D Assay measuring extent of XNA tailing by T4 ligation. Tailed hairpins are not substrates for T4 ligation.
  • FIG. 2E XNA tailing of hairpin using optimized conditions showing XNA tailed hairpin is the major product.
  • (–) is blunt-ended hairpin negative control.
  • G + is a hairpin synthesized to contain a single nucleotide 3′-G overhang as the positive control (gel representative of 3 experimental replicates; yield estimates are listed in Table 9).
  • FIG. 2F Assay to ligate two DNA hairpins with complementary single nucleotide XNA overhangs. Ligated hairpins are protected from exonucleases as they lack free 5′ and 3′- ends.
  • FIG. 2G XNA ligation of hairpins tailed with complementary purine (pur) and pyrimidine (pyr) XNA bases using optimized reaction conditions. (+) is a positive control that used blunt DNA substrate.
  • (*) is a negative control that used blunt DNA substrate without DNA ligase.
  • FIGs 3A-3D show generation of 12-letter (ATGCBSPZXKJV) nanopore sequencing kmer models.
  • FIG. 3A Overview of construction of NNNNNNN libraries, starting from two synthetic oligo pools (NNN-Pool) that contain blunt, NNN-3′ ends. The 24-nt triplet-barcodes in these hairpins are linked to the 3′-NNN sequence, allowing for proper identification of bases adjacent to XNA inserts. Complementary XNA base pairs are added to the library hairpins using XNA tailing and XNA ligation.
  • FIGs 4A-4C show construction and end-to-end nanopore sequencing of 6- letter DNA alphabets.
  • FIG. 4A Proof of concept deployment of an XNA-refinement pipeline using 4-nt kmer models measured in this disclosure.
  • Pipeline is used to transform raw commercial nanopore reads into likely XNA basecalls for the sense (+) and antisense (-) strands.
  • FIG. 4C Response
  • FIG. 5 shows enzyme-assisted synthesis and third-generation sequencing of supernumerary 12-letter DNA.
  • the kmer probability density function (observed signal mean ⁇ I z >, model mean ⁇ ki , model standard deviation ⁇ ) is used to calculate log-likelihoods while a maximum likelihood with outlier-robust log-likelihood ratios is used to determine base call.
  • FIG. 6A shows an overview of an example non-templated N+1 tailing reaction. Tailing of blunt-end hairpin DNA substrates (N) can lead to complete formation of XNA-tailed hairpin products (N+1 major).
  • PPi release from tailing leads to slow background rate of pyrophosphorolysis, which acts in the reverse direction of nucleotide tailing (3′-exo). Pyrophosphorolysis is mitigated by adding YiPP to tailing reactions and balancing reaction duration and reaction rates.
  • the over tailing of products to generate (N+2) hairpins is also considered in optimization for tailing reactions.
  • N+1 tailing is generally thought to occur at a first-order reaction rate, 2 orders of magnitude slower than templated polymerization.
  • N+2 addition rates are polymerase specific and are thought to occur at first order rates 2 orders of magnitude slower than N+1 product formation. End abbreviations: 3′ indicates 3′-OH, 5′- indicates 5′-PO4.
  • N A, T, G, C
  • T4 ligation assay A 5′-phosphorylated hairpin oligo with a 3′-blunt end was
  • 3915-P1293WO.UW -8- purchased from IDT (5′Phos-15HP; Table 2). Oligos are first refolded by incubating 20 ⁇ M of oligo in a 100 mM NaCl, 10 mM Tris-HCl buffer (pH 8.2) at 90 ⁇ C for 3 minutes then cooling at 0.1 ⁇ C/s until reaching 20 ⁇ C. All subsequent tailing reactions used 16 ⁇ M 5′Phos-15HP (blunt-end with 15 nt in the hairpin region), 1.19 mM dNTP (with dNTP used specified on lane figure panel), and tailed for 1 h at the specified temperature using the specified polymerases.
  • T4 ligation reactions were performed with 11.2 ⁇ M of oligo for 1 h using T4 DNA Ligase Reaction Buffer which contains 1 mM ATP.
  • FIG.6BA Tailing screen for Taq polymerase (0.25 U/ ⁇ L, 72 ⁇ C) and Klenow Fragment (exo-; KF) polymerase (0.68 U/ ⁇ L, 37 ⁇ C) followed by high concentration T4 ligation.
  • FIG. 6BB Tailing screen for Deep Vent (exo-; DV) polymerase (0.1 U/ ⁇ L, 72 ⁇ C) and Therminator (Therm) polymerase (0.1 U/ ⁇ L, 72 ⁇ C) followed by high concentration T4 ligation.
  • FIG.6BA Tailing screen for Taq polymerase (0.25 U/ ⁇ L, 72 ⁇ C) and Klenow Fragment (exo-; KF) polymerase (0.68 U/ ⁇ L, 37 ⁇ C) followed by high concentration T4 ligation.
  • FIG. 6BB
  • FIGs 6CA-6CM show UPLC/QTOF validation of tailing activity for all dNTPs and dxNTPs by Klenow Fragment (exo-).
  • FIG. 2B Full set of controls for the data shown in FIG. 2B.
  • Extracted ion chromatograms (EIC) show relative abundance of either dNMP or dxNMP release when corresponding dNTPs/dxNTPs are used as a substrate for polymerase (KF exo-) tailing. Chromatogram scales are normalized for comparison of runs within each panel. dNTP or dxNTP used in each reaction shown in panel legend.
  • FIGs 6DA-6DM show UPLC/QTOF validation of tailing activity for all dNTPs and dxNTPs by Therminator.
  • FIG. 2C Full set of controls for the data shown in FIG. 2C.
  • Extracted ion chromatograms (EIC) show relative abundance of either dNMP or dxNMP release when corresponding dNTPs/dxNTPs are used as a substrate for polymerase (Therminator; Therm) tailing. Chromatogram scales are normalized for comparison of runs within each panel. dNTP or dxNTP used in each reaction shown in
  • FIGs 6EA-6EE show screening and optimization of XNA tailing conditions. All tailing reactions used 11.9 ⁇ M 5′Phos-11HP, 1.19 mM of specified dNTP/dxNTP, and tailed at the specified temperature for the specified times using either Klenow Fragment (KF exo-; 0.71 U/ ⁇ L) or Therminator (Therm; 0.29 U/ ⁇ L). Tailing completeness was measured via T4 ligation assays.
  • FIG. 6EA XNA tailing screen using KF exo- and Therm for 8 h.
  • FIG. 6EB XNA tailing screen using KF and Therm for 8 h.
  • FIG. 6EC Additional S c tailing screen using Therm for 8 or 16 h.
  • FIG. 6F shows addition of yeast inorganic pyrophosphatase (YiPP) leads to slight improvements in XNA tailing reaction yield.
  • 5′-phosphorylated hairpin oligos with either a 3′-blunt end or 3′-single nucleotide (-G, or -C) overhangs were purchased from IDT (5′-Phos-11HP; Table 2). Separately, 11.4 ⁇ M of 3′-blunt end oligos were tailed with 1.14 mM of dCTP or dGTP, Klenow Fragment (exo-; KF; 0.68 U/ ⁇ L), and either 0.009 U/ ⁇ L of YiPP or no YiPP at 37 ⁇ C for 4 h.
  • Ligation reactions were performed using 2.6 ⁇ M of two oligos with complementary overhang bases, either enzymatically tailed (G, C) or synthesized overhangs (G*, C*). Ligation reactions were incubated for 15 min at 16 ⁇ C using T7 DNA ligase (272 U/ ⁇ L) and carried out in 1X of NEB StickTogetherTM buffer which contains 7.5% (w/v) PEG 6000. Blunt-end hairpins (- /-) serve as a negative ligation control as the short reaction time prevents blunt end ligation.
  • Unligated materials were digested using exonuclease I (2.7 U/ ⁇ L), exonuclease III (13.3 U/ ⁇ L) and exonuclease VII (1.33 U/ ⁇ L) for 1 h at 37 ⁇ C. Exonuclease reactions were heat inactivated by incubation at 95 ⁇ C for 10 min and then at 80 ⁇ C for 10 min.
  • Exo VII was used which has a higher heat inactivation temperature than Exo VIII (truncated) used in other aspects of this disclosure. It was also found Exo VII would result in incomplete digestion (lower band) and required different buffer conditions. In subsequent screening work, Exo VIII (truncated) was used instead in the exonuclease treatment steps. Positive control with G* and C* shows ligation of hairpins with G and C synthetic overhangs. Gel representative of a single experimental replicate. [0044] FIG. 6G shows enzymatic tailing does not lead to measurable differences in ligation when compared to ligation using fully synthetic hairpin with N+1 tails.
  • over-tailed product i.e., more than one nucleotide added to the blunt 3′-end
  • N+1 tailed hairpin would result in dsDNA that contains a gap of one or more nucleotides.
  • the gap region exposes a 3′ and 5′ end that would make this product susceptible to exonuclease degradation. Therefore, one way one can have tested to see if over-tailing was a problem was to compare how much ligated product was observed (as measured by agarose gel band intensity) if hairpins were tailed enzymatically vs made synthetically.
  • 5′-phosphorylated hairpin oligos with either a 3′-blunt end or 3′- single nucleotide (-G, or -C) overhangs were purchased from IDT (Table 2). Oligos were first folded using previously described methods. Blunt end oligo 5′Phos-11HP was then tailed with dCTP using conditions listed in Table 8. Subsequent ligation reactions were performed using T7 or T4 DNA ligase. Either the dCTP-tailed oligo (Tailed) or 5′Phos- HP-3′C (Synth) was ligated to 5′Phos-HP-3′G.
  • T7 ligation reactions 2.7 ⁇ M of each oligo were incubated with 272 U/ ⁇ L of T7 DNA ligase and StickTogether TM DNA ligase buffer at 16 ⁇ C for 15 min, after which the ligase was heat inactivated at 65 ⁇ C for 10 min.
  • 4.2 ⁇ M of each oligo were incubated with 80 U/ ⁇ L of T4 DNA ligase and T4 DNA ligase buffer at 16 ⁇ C for 2 h, after which the ligase was heat inactivated at 65 ⁇ C for 10 min.
  • FIGs 6HA-6HQ show high resolution LC/MS of oligo showing N+1 tailing as major product.
  • FIG. 6HA Hairpin oligo, 5′Phos-ScaI-HP (Table 2) was tailed
  • FIG. 6I shows an overview of T3 DNA ligase, T4 DNA ligase, and T7 DNA ligase products. (top) Major products formed from T3 ligation and T4 ligation assays between hairpins generated in this disclosure. (bottom) Major and minor products formed for T7 ligation assays in this disclosure.
  • T7 ligase preferentially ligates hairpins with a cohesive nucleotide overhang and has minimal blunt-end ligation activity.
  • T7 ligase has been observed to perform blunt end ligation though to a lesser extent than T3 ligase and T4 ligase.
  • Full hairpin sequences used in this disclosure can be found in Table 2. Nucleic acid end abbreviation: 3′ indicates 3′-OH, 5P′- indicates 5′-PO 4 .
  • FIG. 6J shows an overview of XNA ligation products from XNA tailed hairpins. XNA ligation reactions were optimized making the following considerations of possible side products.
  • FIGs 6KA-6KE show screening and optimization of ligation conditions across all XNA bases. All tailing reactions used conditions listed in Table 8 unless otherwise specified.
  • Ligation reactions were performed using 4.7 ⁇ M of one oligo or 2.4 ⁇ M of two oligos with complementary tailed bases. Ligation reactions were incubated for 16 h at 16 ⁇ C using the specified ligase and carried out in 1X of NEB StickTogetherTM buffer which contains 7.5% (w/v) PEG 6000. Improperly ligated
  • FIGs 6LA-6LC show results from screening T3 ligase, T4 ligase, T7 ligase for JV, X t K n , and BS c XNA ligation.
  • Two blunt end hairpins that create a restriction enzyme site upon blunt ligation were purchased from IDT (5′Phos-NdeI-HP-1 and 5′Phos-NdeI-HP-2; Table 2). Blunt-end ligated hairpins create an NdeI restriction site, while successfully tailed and ligated hairpins do not.
  • FIG. 6LA T3 ligase assay (272 U/ ⁇ L);
  • FIG. 6LB T4 ligase assay (36 U/ ⁇ L);
  • FIG.6LC T7 ligase assay (272 U/ ⁇ L) for reactions containing single hairpins or mixture of two hairpins (as indicated).
  • FIGs 6MA-6MC show full gels of XNA tailing and XNA ligation using optimized conditions. All assays were done with a 5′-phosphorylated hairpin oligo with a 3′-blunt end, purchased from IDT (5′-Phos-11HP; Table 2). Each DNA/XNA base was tailed using conditions from Table 8. (FIG. 6MA) Full gel for optimized XNA tailing conditions from FIG. 2E. Tailing completeness was measured via T4 ligation.
  • FIGs 6NA-6NF show a proof of concept for XNA tailing and XNA ligation cycling to insert two consecutive P ⁇ Z base pairs.
  • FIG. 6NA Agarose gel showing steps in consecutive XNA insertion.
  • FIG. 6NB A hairpin containing an MlyI restriction site adjacent to the site of XNA ligation is used (donor hairpin, HP D ).
  • MlyI is a type IIS restriction enzyme (5′- GAGTCNNNNN ⁇ -3′) that leaves a blunt end after cutting.
  • a donor hairpin with an MlyI site and an acceptor hairpin were tailed with P and Z respectively (generating HP D -P, HPA-Z), ligated and treated with exonucleases following the optimized conditions described in this disclosure, and then purified (lane 1).
  • the purified construct contains a single P ⁇ Z base pair insertion.
  • 3915-P1293WO.UW -14- site was prepared by XNA tailing (HPP-P).
  • XNA ligation followed by MlyI and exonuclease treatment does not result in formation of a ligation product (lane 3).
  • FIG. 6ND In a second round, reaction product mixture from lane 2 was tailed with Z to produce Z-tailed donor hairpin (HP D -Z) and Z-tailed PZ-acceptor hairpin (HP A -ZZ).
  • XNA ligation followed by MlyI and exonuclease treatment does not result in formation of a ligation product (lane 4).
  • FIGs 6OA-6OB show examples of basecalling XNA sequences with guppy.
  • FIG. 6OA ONT guppy was trained to basecall sequences composed of standard nucleic acids (A, T, G, or C).
  • A, T, G, or C standard nucleic acids
  • FIG. 6PB Complete NNNNNNN library products for all XNA base pairs and blunt end ligation library sequenced in this disclosure.
  • FIG. 6PC Self-ligation for library hairpins to check for incomplete tailing and pyrophosphorolysis products. Library hairpins were tailed with the listed XNA using conditions listed in Table 8, and 4.7 ⁇ M of each hairpin (except B* and Sc at 2.6 ⁇ M) was ligated to itself using the conditions listed in Table 10.
  • FIGs 6QA-6QI show examples of variance minimization for segmentation steps of signal-to-sequence mapping.
  • Signal-to-sequence mapping was performed using Tombo. Tombo uses an informed kmer model to improve the accuracy of signal-to- sequence mapping. Without a prior model, segmentation requires assigning each XNA to a standard base. Improper segmentation leads to inaccurate model parameter estimates. To minimize bias in segmentation, one can have assigned each XNA to the standard base that minimized the total variance in observed kmer signal levels.
  • FIGs 6RA-6RE show example traces of signal deviation from the standard model.
  • FIG. 6S shows an example xenomorph preprocessing pipeline.
  • Xenomorph preprocess integrates basecalling, raw multi-to-single fast5 conversion, reference sequence fasta conversion, segmentation, and level assignment into a single command.
  • Level extracted output files from xenomorph preprocess are inputs to basecalling through alternative hypothesis testing using xenomorph morph. Separating the preprocessing steps from alternative hypothesis testing allows users to experiment with basecalling using various model parameter settings or with alternative models without having to rerun the slower signal extraction steps.
  • xenomorph preprocess uses guppy for initial basecalling, minimap2 for initial basecall-reference alignment, and ONT Tombo for signal normalization and signal-to-sequence alignment.
  • FIGs 6TA-6TC show PCR amplification and sequencing of a DNA template with a P ⁇ Z base pair.
  • FIG. 6TA Synthetic template DNA containing a P ⁇ Z base pair was amplified with Taq polymerase in a pH 8.0 buffer with varying concentrations of dxNTP and dNTP (Tables 22, 23). PCR products were sequenced on a MinION nanopore flow cell then basecalled for PZ detection. Read fractions that basecalled to (FIG. 6TB) P and (FIG. 6TC) Z for each condition are shown. PCR conditions differ by concentration of dxNTP and dNTPs used. The remaining fraction for each base corresponds to G and C basecalls (the most likely standard mutation for P and Z), respectively.
  • FIGs 6UA-6UB show construction of 12-letter DNA for nanopore sequencing. All assays were performed using 12-letter DNA construction oligos as
  • FIG. 6V shows an example workflow from sequencing to heptamer classification.
  • FIGs 6WA-6WB and 6XA-6XB show an example method for generating a defined non-standard nucleotide base pair library that uses a Type IIS restriction enzyme and a context barcode (“Barcode”) associated with a sequence context and a pool barcode (“Pool-Barcode”) associated with a non-standard nucleotide, as well as steps for sequencing and machine learning (ML) model training. Randomer region indicated.
  • FIGs 6YA-6YF show example process flows for training ML models for processing read data obtained by nanopore sequencing of polynucleotide sequences containing non-standard nucleotides (FIGs 6YA-6YD), as well as base calling using trained ML models for quantification of XNA retention in PCR reactions (FIG.6YE) and quantification of XNA transcription errors from in vivo transcription (FIG.6YF).
  • the present disclosure provides an array of breakthrough approaches for synthesizing polynucleotide (e.g., DNA) sequences containing at least one non-standard nucleotide.
  • the non-standard nucleotide can include a hydrogen bonding pattern that is consistent or compatible with a hydrogen bonding pattern of a standard or existing
  • 3915-P1293WO.UW -18- nucleotide e.g., C, G, T, A
  • the present disclosure also provides breakthrough approaches for synthesizing polynucleotide sequences containing one or more non-standard nucleotides, optionally using next-generation sequencing (NGC) platforms, such as nanopore sequencing.
  • NGC next-generation sequencing
  • the disclosure also enables non-standard nucleotides to be integrated into a wide range of technologies, such as biological computing and information storage systems, therapeutics, aptamers, biosensors, and the like.
  • Methods of synthesizing polynucleotides containing one or more non- standard nucleotides make use of an N+1 tailing reaction of a suitable DNA polymerase. Accordingly, in an aspect, the disclosure provides a method for generating an N+1 tailing product comprising a non-standard nucleotide that is covalently bound with a 3’ end of a precursor double-stranded DNA (dsDNA) template, such that the non-standard nucleotide is non-base-paired.
  • dsDNA double-stranded DNA
  • the method comprises combining the precursor dsDNA template with a DNA polymerase and a non-standard deoxyribonucleotide triphosphate (dNTP) under a reaction condition conducive to facilitate a blunt-end N+1 addition of the non-standard nucleotide to the 3’ end of the precursor dsDNA template by the DNA polymerase.
  • dNTP deoxyribonucleotide triphosphate
  • the non-standard nucleotide is a xenonucleotide (XNA) and the non-standard dNTP is a deoxy-xeno-ribonucleotide triphosphate (dxNTP).
  • the DNA polymerase comprises a polypeptide sequence of a small Klenow Fragment (KF exo-) of DNA Polymerase I, as further described herein.
  • the polypeptide sequence comprises a sequence of SEQ ID NO:2.
  • a variety of XNAs can be incorporated into DNA using methods of the present disclosure, however, it was found that improvement or optimization of reaction conditions allows for the N+1 tailing reaction to proceed at an acceptable rate.
  • the non-standard nucleotide being added is B or p, and the reaction condition proceeds at about 37°C for between about 1-16 hours and comprises about 0.71
  • the non-standard nucleotide is selected from S n , S c , Z, X t , K n , J, and V, and the reaction condition proceeds at about 60°C for between about 4-16 hours and comprises about 0.29 U/ ⁇ L of the DNA polymerase and about 1.19 mM of the non- standard dNTP. While these or similar conditions were found to be effective for the disclosed reaction, other conditions, including less-than-optimal or non-improved conditions, can be implemented in embodiments without departing from the scope and spirit of the disclosure.
  • the KF exo- of DNA polymerase I can be used in embodiments, this is not the only DNA polymerase that was surprisingly and unexpectedly found to have the ability to add non-standard nucleotides to a dsDNA template in an N+1 tailing reaction.
  • the DNA polymerase comprises a polypeptide sequence of an engineered polymerase from a hyperthermophilic marine archaeon.
  • the engineered polymerase is a variant of 9°N DNA polymerase.
  • the polypeptide sequence comprises a sequence of SEQ ID NO:3 (e.g., Therminator TM ).
  • the disclosure provides a method for generating a base pair of two nucleotides of a polynucleotide, wherein at least one nucleotide of the two nucleotides is a non-standard nucleotide.
  • the base pair is comprised of one non-standard nucleotide base paired with one standard nucleotide.
  • the base pair is comprised of a first non-standard nucleotide base paired with a second non-standard nucleotide.
  • Creation of a base pair that is comprised of two non-standard nucleotides can be implemented with a method that comprises generating a second N+1 tailing product comprising a second non-standard nucleotide that is base-pair complementary with the non-standard nucleotide, such that the second non-standard nucleotide is non-
  • the second N+1 tailing product can be generated based on the same or a similar reaction as the N+1 tailing product (of the first N+1 tailing reaction).
  • the method can further include ligating the N+1 tailing product with the second N+1 tailing product, which forms a dsDNA ligation product that comprises a base pair between the non- standard nucleotide and the second non-standard nucleotide, as further described herein.
  • the N+1 tailing product can be linear or, in embodiments, can comprise a hairpin.
  • the second N+1 tailing product can be linear or, in embodiments, can comprise a hairpin.
  • the dsDNA ligation product does not comprise a free 5’ end or a free 3’ end and is fully resistant to exonucleases.
  • Additional non-standard nucleotides can be added iteratively and/or sequentially, such that two or more non-standard nucleotides can be added or inserted to a polynucleotide. This can be achieved by cleaving the dsDNA ligation product and exposing the non-standard base pair. The resultant blunt-end DNA template then becomes a template for a subsequent N+1 tailing reaction.
  • the method comprises contacting the dsDNA ligation product with a type IIS restriction enzyme under a reaction condition that is conducive for the type IIS restriction enzyme to cleave the dsDNA ligation product, which generates a blunt-end DNA template.
  • the resultant blunt-end DNA template comprises the base pair between the non-standard nucleotide and the second non-standard nucleotide.
  • the method can be performed a plurality of times for creation of a plurality of base pairs between a plurality of non-standard nucleotides and a plurality of second non-standard nucleotides as sequence elements of the further dsDNA ligation product.
  • the method comprises contacting the further dsDNA ligation product with a type IIS restriction enzyme under a reaction condition conducive for the type IIS restriction enzyme to cleave the further dsDNA ligation product to generate a further blunt-end DNA template that comprises the plurality of base pairs between the plurality of non-standard nucleotides and the plurality of second non-standard nucleotides.
  • the method is modular and can be repeated any number of times for addition of any number of non-standard nucleotides, either with non-standard nucleotides added in a continuous manner or in a manner such that the non-standard nucleotides are interspersed with, or interrupted by, one or more standard nucleotides, for example.
  • a quantity of non-standard nucleotides added to a polynucleotide with a method of the disclosure is selected from the group including, but not necessarily limited to, the set of integers defined by the range of 1 to 10,000,000,000, inclusive.
  • a quantity of standard nucleotides added to a polynucleotide with a method of the present disclosure is selected from the group including, but not necessarily limited to, the set of integers defined by the range of 1 to 10,000,000,000, inclusive.
  • the non-standard nucleotide comprises an epigenetic modification, a modified sugar, a phosphate backbone, a nucleobase, a nucleobase that can hydrogen bond to a second base, a nucleobase that can base pair (without hydrogen bonding) to a second base, a nucleobase that relies on steric exclusion for base pairing, a nucleobase that relies on hydrophobic interactions for base pairing, a nucleobase that relies on a transition metal complex for base pairing, a chemical modification, or any combination thereof.
  • the non-standard nucleotide is the nucleobase that is configured to hydrogen bond to the second base and the second base is a standard base or a non-standard base. In other example embodiments, the non-standard nucleotide is the nucleobase that can base pair (without hydrogen bonding) to the second base and the second base is a standard base or a non-standard base. In embodiments, the non-standard nucleotide comprises an epigenetic modification or is 4-methyl-cytosine, 5-methyl cytosine, 6-methyl adenosine, 5-hydroxymethyl cytosine, 7-methylguanosine, or N6-methyladenosine.
  • the non-standard nucleotide comprises the chemical modification and the chemical modification comprises a fluorophore, a biotin, a terminal alkyne, an azide, a cyclooctyne, a tetrazine, a terminal alkene, a phosphine, a halo-alkane, an aldehyde, a thiol, a transition metal complex, another reactive handle, or any combination thereof.
  • the disclosure also contemplates products, and in at least some instances, intermediates, of methods herein as also being within the scope of the disclosure.
  • the disclosure provides a dsDNA ligation product that can comprise a non-standard nucleotide.
  • the disclosure provides a further dsDNA ligation product that can comprise two or more non-standard nucleotides.
  • the disclosure contemplates defined libraries of non-standard nucleotide base pairs, in any of a variety of nucleotide contexts, produced by the methods
  • the disclosure provides a defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of the dsDNA ligation product or the blunt-end dsDNA template.
  • the library polynucleotide sequence comprises a base pair between a non- standard nucleotide and a second non-standard nucleotide.
  • a plurality of base pairs can be incorporated into one or more defined libraries.
  • the disclosure provides a defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of a further dsDNA ligation product or a further blunt-end dsDNA template, such that the library polynucleotide sequence comprises the plurality of base pairs between the plurality of non-standard nucleotides and the plurality of second non-standard nucleotides.
  • a library polynucleotide sequence further comprises a context barcode associated with a sequence context adjacent to a base pair of a non- standard nucleotide and a second non-standard nucleotide of the library polynucleotide sequence, and a pool barcode associated with the non-standard nucleotide, the second non-standard nucleotide, or both.
  • These or similar barcodes can be comprised of standard or otherwise sequence-able nucleotides, such that the identities of the non-standard nucleotides and the contexts can be known with a high degree of confidence. This facilitates correlation between the empirical data and the non-standard nucleotide bases being observed.
  • Machine learning can be used with one or more methods for facilitation of sequence data analysis.
  • the disclosure provides a method for generating a machine learning (ML) model that correlates one or more observed current reads with an unknown non-standard nucleotide, for assignment of an identity to the unknown non-standard nucleotide.
  • ML machine learning
  • Such a method comprises sequencing, with a nanopore sequencing method, the defined non-standard nucleotide base pair library to produce the one or more observed current reads, and training, with a ML algorithm, the ML model to
  • 3915-P1293WO.UW -23- associate the one or more observed current reads with a known identity of a defined non- standard nucleotide of the defined non-standard nucleotide base pair library.
  • the ML model can be configured to assign the identity to the unknown non-standard nucleotide based on the known identity of the defined non-standard nucleotide.
  • the ML model comprises a convolutional long short term memory recurrent neural network (LSTM RNN), however, other ML models can be implemented, in embodiments.
  • LSTM RNN convolutional long short term memory recurrent neural network
  • the disclosure also contemplates computer memory, computer products, computer devices, computer systems, and the like, that implement all or part of one or more methods of the disclosure as being within the scope of the disclosure.
  • the disclosure provides a non-transitory computer-readable storage medium having stored thereon at least part of a ML model.
  • the disclosure provides a computational device or computational system comprising the non-transitory computer- readable storage medium.
  • the disclosure provides a nanopore sequencing kit, device, or system comprising the non-transitory computer-readable storage medium, optionally further including instructional materials for use of the kit.
  • the disclosure provides novel and innovative tools for use in synthesizing and sequencing polynucleotides containing non-standard nucleotides. Accordingly, in an aspect, the disclosure provides a method for basecalling a non- standard nucleotide expanded alphabet.
  • the method comprises sequencing, with a nanopore sequencing method, a subject polynucleotide sequence that comprises a non-standard nucleotide to generate a subject current read, computing, with the computational device or computational system, the known identity of the defined non-standard nucleotide of the defined non-standard nucleotide base pair library associated with the subject current read with for an association, and computing, based on the association, a structure of the non-standard nucleotide.
  • the structure of the non-standard nucleotide can include, correspond, or relate to an identity of the non-standard nucleotide.
  • circuitry includes dedicated hardware having electronic circuitry configured to perform operations or computations on a dedicated basis, without any use of microprocessors, central processing units, or software or firmware or processor-executable instructions.
  • circuitry includes, among other things, one or more computing devices such as one or more processors (e.g., microprocessor(s)), one or more central processing units (CPU), one or more digital signal processors (DSP), one or more application-specific integrated circuits (ASIC), one or more field-programmable gate arrays (FPGA), or the like, or any variations or combinations thereof, and can include discrete digital and/or analog circuit elements or electronics, or combinations thereof.
  • processors e.g., microprocessor(s)
  • CPU central processing units
  • DSP digital signal processors
  • ASIC application-specific integrated circuits
  • FPGA field-programmable gate arrays
  • circuitry includes combinations of circuits and computer program products having software or firmware processor-executable instructions stored on one or more computer readable memories, e.g., non-transitory computer-readable storage mediums, that work together to cause a device or system to perform one or more methodologies or technologies described herein.
  • circuitry includes circuits, such as, for example, microprocessors or portions of microprocessors, that require software, firmware, and the like for operation.
  • circuitry includes an implementation comprising one or more processors or portions thereof and accompanying software, firmware, hardware, and the like.
  • circuitry includes a baseband integrated circuit or applications processor integrated circuit or a similar integrated circuit in a server, a cellular network device, other network device, or other computing device.
  • circuitry includes one or more remotely located components.
  • remotely located components e.g., server, server cluster, server farm, virtual private network, etc.
  • non-remotely located components e.g., desktop computer, workstation, mobile device, controller, etc.
  • remotely located components are operatively connected via one or more receivers, transmitters, transceivers, or the like.
  • Embodiments include one or more data stores that, for example, store instructions and/or data.
  • Non-limiting examples of one or more data stores include volatile memory (e.g., Random Access memory (RAM), Dynamic Random Access memory (DRAM), or the like), non-volatile memory (e.g., Read-Only memory (ROM), Electrically Erasable Programmable Read-Only memory (EEPROM), Compact Disc Read-Only memory (CD-ROM), or the like), persistent memory, or the like. Further non- limiting examples of one or more data stores include Erasable Programmable Read-Only memory (EPROM), flash memory, or the like.
  • the one or more data stores can be connected to, for example, one or more computing devices by one or more instructions, data, or power buses.
  • circuitry includes one or more computer-readable media drives, interface sockets, Universal Serial Bus (USB) ports, memory card slots, or the like, and one or more input/output components such as, for example, a graphical user
  • circuitry includes one or more user input/output components that are operatively connected to at least one computing device to control (electrical, electromechanical, software- implemented, firmware-implemented, or other control, or combinations thereof) one or more aspects of the embodiment.
  • circuitry includes a computer-readable media drive or memory slot configured to accept signal-bearing medium (e.g., computer-readable memory media, computer-readable recording media, or the like).
  • a program for causing a system to execute any of the disclosed methods can be stored on, for example, a computer-readable recording medium (CRMM), a signal-bearing medium, or the like.
  • signal-bearing media include a recordable type medium such as any form of flash memory, magnetic tape, floppy disk, a hard disk drive, a Compact Disc (CD), a Digital Video Disk (DVD), Blu-Ray Disc, a digital tape, a computer memory, or the like, as well as transmission type medium such as a digital and/or an analog communication medium (e.g., a fiber optic cable, a waveguide, a wired communications link, a wireless communication link (e.g., transmitter, receiver, transceiver, transmission logic, reception logic, etc.).
  • analog communication medium e.g., a fiber optic cable, a waveguide, a wired communications link, a wireless communication link (e.g., transmitter, receiver, transceiver, transmission logic, reception logic, etc.).
  • signal-bearing media include, but are not limited to, DVD-ROM, DVD-RAM, DVD+RW, DVD-RW, DVD-R, DVD+R, CD-ROM, Super Audio CD, CD ⁇ R, CD+R, CD+RW, CD-RW, Video Compact Discs, Super Video Discs, flash memory, magnetic tape, magneto-optic disk, MINIDISC, non-volatile memory card, EEPROM, optical disk, optical storage, RAM, ROM, system memory, web server, or the like.
  • the present application can include references to directions, such as “vertical,” “horizontal,” “front,” “rear,” “left,” “right,” “top,” and “bottom,” etc. These references, and other similar references in the present application, are intended to assist in helping describe and understand the particular embodiment (such as when the embodiment is positioned for use) and are not intended to limit the present disclosure to these directions or locations. [0091] The present application can also reference quantities and numbers. Unless specifically stated, such quantities and numbers are not to be considered restrictive, but examples of the possible quantities or numbers associated with the present application.
  • “about” refers to the stated value and a range that includes values 11% above the stated value, 12% above the stated value, 13% above the stated value, 14% above the stated value, 15% above the stated value, 16% above the stated value, 17% above the stated value, 18% above the stated value, 19% above the stated value, 20% above the stated value, 21% above the stated value, 22% above the stated value, 23% above the stated value, 24% above the stated value, or 25% above the stated value.
  • a range is stated, e.g., the range of 1-16, the stated range includes every value between the lower and upper limits as well as the lower and upper limits of the stated range, themselves, as stated values.
  • the approximately stated range includes every value between the lower and upper limits as well as the lower and upper limits of the stated range, themselves, as stated values (e.g., 1 and 16 are each stated values), including those non-stated values that are near to or approximate the stated values according to practicable ranges as would be recognized by those skilled in the art or as otherwise described herein.
  • the phrase “at least one of A, B, and C,” means (A), (B), (C), (A and B), (A and C), (B and C), or (A, B, and C), including all further possible permutations when greater than three elements are listed.
  • the term “A, B, and/or C” means (A), (B), (C), (A and B), (A and C), (B and C), or (A, B, and C), including all further possible permutations when greater than three elements are listed.
  • the term “or” is an inclusive “or”, and the phrase “A or B” means (A), (B), or (A and B).
  • the term “and” requires both elements; for example, the phrase “A and B” means (A and B).
  • the term “comprising”, is inclusive or open-ended and does not exclude additional, unrecited elements or method steps.
  • Example 1 Enzymatic Synthesis and Nanopore Sequencing of 12-letter Supernumerary DNA
  • Abstract The 4-letter DNA alphabet (A, T, G, C) is an elegant, yet non- exhaustive solution to the problem of storage, transfer, and evolution of biological information. This example provides strategies for both writing and reading DNA with expanded alphabets composed of up to 12 letters (A, T, G, C, B, S, P, Z, X, K, J, V).
  • an enzymatic strategy is devised for inserting a singular, orthogonal xenonucleic acid (XNA) base pair into standard DNA sequences using 2′-deoxy-xenonucleoside triphosphates as substrates. Integrating this strategy with combinatorial oligos generated on a chip, libraries are constructed containing single XNA bases for parameterizing kmer basecalling models for nanopore sequencing. These elementary steps are combined to synthesize and sequence DNA containing 12 letters – the upper limit of what is accessible within the electroneutral, canonical base pairing framework.
  • the 4-letter standard genetic alphabet of DNA (A, T, G, C) is ubiquitous and one of the defining biomolecular signatures of life on Earth. The ability to read, write, and translate this information forms the basis for life as an emergent property of nucleic acid heteropolymers. Humanity has learned how to manipulate the 4 letters of DNA, spurring major advances in biotechnology, information storage, and healthcare.
  • the standard nucleic acids can be components for diagnostic tests to screen for disease or detect toxins, therapeutics that create immune responses, and even as a molecular system for long-term storage of digital information.
  • Parameters of biomolecular compatibility of expanded non-canonical hydrogen bonding base pairings include stability in the DNA double helix, the ability to be replicated by DNA polymerases, transcribed by RNA polymerases, reverse transcribed by reverse transcriptases, and even translated by the ribosome. These xenonucleotides are at the forefront of nucleic acids research since they significantly expand DNA’s chemical, structural, and binding repertoire.
  • XNAs xenonucleic acids
  • methods for sequencing of xenonucleic acids are decades behind that of DNA and RNA, and rely on low-throughput, non-multiplexed measurements, such as gel-shift assays, mass spectrometry, and selective conversion of XNAs to standard bases followed by Sanger sequencing.
  • XNA sequencing technology is lower throughput, less sensitive, and less generalizable than the methods Sanger and Coulson developed in the 1970s and has no service-oriented solution.
  • ATGC-sequencing technology is in its ‘third generation.’
  • XNA XNA
  • One possible solution is to adapt existing first-, second-, or third-generation DNA sequencing technology to work with more DNA letters.
  • Nanopore sequencing has the ability to sequence non-canonical bases such as epigenetic and epitranscriptomic modifications.
  • nanopore sequencing can be used for sequencing 8-letter hachimoji DNA (A, T, G, C, B, S c , P, Z) using the Hel308 motor protein with an MspA pore.
  • third-generation (high throughput, multiplexable, single molecule, real-time) sequencing of supernumerary DNA is possible despite the “k-mer explosion” in possible current signals induced by an expanded DNA alphabet.
  • previous efforts in this regard did not attempt to build models for decoding the nanopore current signals to nucleic acid sequences.
  • Non-standard bases can be classified using commercial nanopores (e.g., GridION, ONT). This can show that commercial nanopore sequencing platforms are indeed capable of sequencing chemically modified nucleobases including 2,4-diamino- purine, 5-nitro-indole, and 5-octadiynyldeoxyuracil.
  • 3915-P1293WO.UW -32- phosphoramidite synthesis – commercial access is both limited and costly, standing as a major barrier to entry.
  • standard phosphoramidite synthesis costs for non- standard bases average around $100-400 USD/nt – or over 1000 times more expensive than A, T, G, C synthesis ($0.04-0.40 USD/nt).
  • next-generation synthesis methods that have transformed the ability to explore sequence space (pooled synthesis, synthesis-on-a-chip, enzymatic synthesis) are not commercially available for orthogonal base pairs.
  • Enzymes like terminal deoxynucleotidyl transferase can catalyze non-templated addition of a wide range of modified nucleotide building blocks on ssDNA, and can do so at neutral pH.
  • TdT terminal deoxynucleotidyl transferase
  • 3915-P1293WO.UW -33- enzymes precludes them from being used for sequence-defined addition of dNTPs. More so, TdT-based enzymatic synthesis of nucleic acids would require specially protected building blocks or polymerase-nucleotide conjugates that are not commercially available. [0111] Lacking a suitable alternative, it was needed to develop an enzymatic synthesis strategy that would be flexible enough to handle all desired xenonucleobases using 2′-deoxynucleoside triphosphates as the universal building block and be specific enough to catalyze a non-processing N+1 addition.
  • the 2′- deoxy-xenonucleoside triphosphates of the remaining bases were chemically synthesized: dX t TP, dK n TP, dJTP, dVTP (FIGs 6BA-6BE).
  • a sensitive liquid chromatography/mass spectrometry (UPLC/QTOF) assay was developed for detecting tailing activity.
  • UPLC/QTOF sensitive liquid chromatography/mass spectrometry
  • the hairpin design of the substrates generates a desired dsDNA ligation product that lacks a free 5′ or 3′ end, making it fully resistant to exonucleases. Subsequent treatment of the ligation reaction with exonucleases therefore allows one to remove unreacted starting material and partially ligated products.
  • the ideal dsDNA ligase should be able to ligate DNA strands with single nucleotide overhangs and have relaxed specificity for both the overhanging nucleotide
  • phage ligases T3 DNA ligase, T4 DNA ligase, and T7 DNA ligase
  • FOG.6I modified and non-standard nucleotide substrates
  • a negative control can be performed in which hairpins are incubated individually in the presence of the respective ligases (FIG. 6J). In these single hairpin reactions, any ligation product would indicate either blunt-end ligation, from incomplete XNA tailing, or formation of a self-ligation (mismatch ligation) product.
  • Nanopore sequencing from Oxford Nanopore Technology ® ) has features that make it adaptable for sequencing supernumerary DNA: it can sequence single DNA molecules without amplification, without the requirement for fluorescently labeled building blocks, and with high throughput (100k-10M reads per run). In nanopore sequencing, an ion current signal is generated as single-stranded DNA
  • 3915-P1293WO.UW -36- is threaded through a protein nanopore. Conversion of signal-to-sequence, or basecalling, is performed computationally by either statistical or machine learning models. However, since commercial nanopore basecalling algorithms were empirically trained on standard 4-letter DNA (A, T, G, C), they are unable to decode xenonucleobases (B, S n , S c , P, Z, X t , K n , J, V; FIGs 6OA-6OB). [0117] With this in mind, one can build and measure diverse DNA-XNA libraries that can be used to construct de novo ground-up models for sequencing single xenonucleotides within a natural DNA context.
  • NNNNNNN library was sequenced independently for model building, generating between 150k – 800k raw reads per library (Tables 14-15). Signals were then segmented and aligned to each barcoded reference sequence while filtering reads that aligned to possible ligation side products (FIGs 3B, 6J and 6QA-6QI). From these signal-to-sequence alignments, XNA-heptamer
  • Example kmer signal distributions can be generated. Mean signal currents spanning all 2,304 xenonucleotide-containing kmers, ⁇ k , are shown in FIG. 3C and comparisons can be made to the most similar standard bases. [0120] Basecalling single xenonucleotide substitutions. Next, one can apply this model to predict signals emitted by sequences that contain a single xenonucleotide (B, S n , S c P, Z, X t , K n , J, or V).
  • the expected signal is found by decomposition of a heptamer sequence into its constitutive kmers, then using measured kmer means to model current transitions (e.g., AGTBCCT ⁇ [ ⁇ ⁇ , ⁇ ⁇ , ⁇ ⁇ , ⁇ ⁇ ]).
  • FIG. 3D shows examples of signal-level predictions generated by an example model (XNA model) overlayed over observations of that library sequence and the most similar standard-bases model (DNA model).
  • XNA model example model overlayed over observations of that library sequence and the most similar standard-bases model
  • the modeled probability density function can be used to calculate the likelihood that an observed set of signal levels was emitted from a particular sequence.
  • the correct basecall should be the one that has the maximum likelihood of observation.
  • the modularity of the 4-nt kmer model allows to make a diverse set of comparisons between a xenonucleotide and 1) a standard base (e.g., P vs. G), 2) any of the standard bases (e.g., P vs. A, T, G, C), or 3) any of the full supernumerary letters (e.g., P vs. A, T, G, C, B, S c , Z, X t , K n , J, V).
  • XNA tailing and XNA ligation to enzymatically synthesize a new validation library composed of contextually diverse sequences.
  • this library the nucleotide sequences adjacent to the XNA-containing heptamer can be further diversified making them further removed in sequence space from those used to build the 4-nt kmer models.
  • This validation library can be built
  • each set of hairpins can contain 10 unique sequences.
  • the 20 bp at the 3′-end of each hairpin can be designed by randomly selecting standard bases from a uniform probability distribution.
  • Individual hairpin sets can be tailed with XNA bases using XNA tailing.
  • Two sets of hairpins with complementary tails can be ligated, producing a library of 100 possible sequences (10 x 10), with each sequence containing a single XNA base pair. These ligated hairpin libraries can be pooled together and sequenced for benchmarking (FIGs 4B-4C).
  • the elementary tailing and ligation synthesis steps can be coupled with an additional Golden Gate ligation to generate two proof-of- concept 12-letter supernumerary dsDNA hairpins: S c uper-12 and S n uper-12 (FIGs 6UA- 6UB, Tables 7, 12, and 13).
  • exonucleases can be added to remove intermediary DNA products, generating the desired 244 bp 12-letter dsDNA product.
  • basecalling can be performed two different ways: 1) by comparing the XNA base at a position against a model that contains all 12 possible nucleobases, and 2) by comparing the XNA base at a position against a model that contains the XNA and the most similar standard nucleobase. Even when all 12 letters are present in the model, the presently disclosed basecalling model is able to properly decode XNAs in S c uper-12 with 39-89% per-read recall (FIG. 5, Tables 25, 26). In an example experiment, for the S n uper-12 sequence, all but one XNA were properly decoded in the 12-letter model, with the exception being K n (per-read recall of 14%).
  • a general strategy is described for incorporating up to four additional orthogonal base pairs into standard DNA, and these methods can be used to build openly accessible models for sequencing XNAs (B, S n , S c , P, Z, X t , K n , J, V) in a standard DNA context (A, T, G, C) on commercial nanopore devices.
  • the enzymatic synthesis strategy developed utilizes unmodified 2′-deoxy-xenonucleoside triphosphates as the elementary building blocks, avoiding the use of phosphoramidites or caged-triphosphates.
  • Nanopore sequencing of XNAs can be performed using a nanopore sequencing device. This significantly expands the accessibility of sequencing XNAs. As history in sequencing progress has shown, additional widespread adoption and collection of XNA nanopore sequencing data can help further catalyze the improvement of sequencing models with newer basecalling algorithms, including data- intensive deep learning models. As these methods improve and adoption widens, strategies for synthesis and sequencing of higher complexity nucleic acids are possible.
  • an additional base pair enables site-specific incorporation of chemically modified groups, including the addition of nucleobases such as Z that can act as a Br ⁇ nsted base.
  • Adenosine triphosphate sodium salt (ATP; A6419-5G), acetonitrile (A955-4; LC/MS-grade), formic acid (A118P- 500), ammonium acetate (A637-500), ammonium carbonate (207861-25G), Tris base (10708976001), 5 M betaine solution (B0300-1VL), 6 N hydrochloric acid (1430071000), GelGreen (SCT124), and sodium chloride (S3014-5KG) were purchased from Sigma-Aldrich (St. Louis, MO).
  • AMPure XP beads (A63880) were purchased from Beckman Coulter (Brea, CA).
  • T4 DNA ligase high concentration T4 DNA ligase (M0202M, M0202L), T7 DNA ligase (M0318L), T3 DNA ligase (M0317S), yeast inorganic pyrophosphatase (YiPP; M2403L), thermolabile proteinase K (P8111S), Exo III (M0206L), thermolabile Exo I (M0568L), Exo I (M0293L), Exo VII (M0379L), Exo VIII (truncated; M0545S), Klenow Fragment (exo-; M0212L), Taq polymerase (M0267L), Bsu polymerase (M0330S), Deep Vent (exo-) polymerase (M0259S), Bst polymerase (M0275S), Sulfolobus DNA polymerase IV (M0327S), Therminator polymerase (M0261L), NEBNext ⁇ Ultra TM II End Repair
  • Xenonucleoside triphosphates dS c TP, dPTP, dZTP, dBTP (dSTP-401S, dPTP-201, dZTP- 101, dBTP-301P) were purchased from FireBird Biomolecular Sciences LLC (Alachua, FL).
  • Xenonucleoside triphosphate dS n TP (M-1015) was purchased from TriLink
  • the eluted oligo was then folded in 100 mM of NaCl and 10 mM Tris-HCl (pH 8.2) buffer by incubating at 90 ⁇ C for 3 minutes, then cooling at 0.1 ⁇ C/s until reaching 20 ⁇ C. 15 ⁇ L of this refolded oligo was incubated with 0.17 mM dNTP or dxNTP, 300 units of Exo III and either KF (exo-) with rCutSmart TM buffer or Therminator with ThermoPol ® buffer for 16 h. For reactions using KF, the reaction was incubated with 15 units of KF at 37 ⁇ C.
  • oligos are first refolded by incubating 40 ⁇ M of oligo in a 100mM NaCl, 10mM Tris-HCl buffer (pH 8.2) at 90 ⁇ C for 3 minutes then cooling at 0.1 ⁇ C/s until reaching 20 ⁇ C.
  • the refolded oligos are then tailed by incubating 23.8 ⁇ M of oligo in the presence of dNTP or dxNTP (1.19 mM or 2.38 mM), YiPP (0.005 U/ ⁇ L; except for the dATP tailing reaction which did not contain YiPP), polymerase (0.71 U/ ⁇ L Klenow Fragment (KF exo-), 0.29 U/ ⁇ L Therminator polymerase, or 0.71 U/ ⁇ L Taq polymerase), and polymerase buffer (either rCutsmart TM or ThermoPol buffer). Full conditions tabulated in Table 8.
  • 3915-P1293WO.UW -43- reactions were terminated by heat inactivation at 72 ⁇ C for 20 min.
  • Therminator and Taq reactions were terminated by addition of 1X rCutSmart TM buffer and 0.005 U/ ⁇ L of thermolabile proteinase K at 37 ⁇ C for 15 min, followed by subsequent heat inactivation at 72 ⁇ C for 20 min.
  • hairpins were refolded.
  • 19.8 ⁇ M of oligo was incubated with 1.8 U/ ⁇ L of ScaI-HF at 37 ⁇ C for 2 h, followed by subsequent heat inactivation at 80 ⁇ C for 20 min.
  • oligos are first refolded by incubating 20 ⁇ M of oligo in a 100 mM NaCl, 10 mM Tris-HCl buffer (pH 8.2) at 90 ⁇ C for 3 minutes then cooling at 0.1 ⁇ C/s until reaching 20 ⁇ C.
  • the refolded oligos are then tailed by incubating 11.9 ⁇ M of oligo in the presence of dNTP or dxNTP (1.19 mM or 2.38 mM), YiPP (0.005 U/ ⁇ L; except for the dATP tailing reaction which did not contain YiPP), polymerase (0.71 U/ ⁇ L Klenow Fragment (KF exo-), 0.29 U/ ⁇ L Therminator polymerase, or 0.71 U/ ⁇ L Taq polymerase), and polymerase buffer (either rCutsmart TM or ThermoPol buffer).
  • Reactions were either incubated for 8 h at 37 ⁇ C (KF exo-); 1, 4, 8, or 16 h at 60 ⁇ C (Therminator); or 1 h at 60 ⁇ C (Taq). Following incubation, KF exo- reactions were terminated by heat inactivation at 72 ⁇ C for 20 min. Therminator and Taq reactions were terminated by addition of 0.005 U/ ⁇ L of thermolabile proteinase K at 37 ⁇ C for 15 min, followed by subsequent heat inactivation at 72 ⁇ C for 20 min. Following either set of heat inactivation steps, hairpins were refolded.
  • Resulting hairpins contained a mixture of product (tailed hairpins) and unreacted starting material (3′-blunt end hairpins).
  • T4 DNA ligase was then used to screen reactions for remaining unreacted 3′-blunt ends by adding 80 U/ ⁇ L of T4 DNA ligase alongside 1X T4 DNA ligase reaction buffer. These T4 ligation reactions were incubated at 16 ⁇ C for 2 h, after which T4 ligase was heat inactivated at 65 ⁇ C for 10 min.
  • a synthetic oligo hairpin with a 3′-G overhang (5′Phos-HP-3′G , Table 2) was used in the T4 ligation reaction.
  • the starting material (5′Phos-11HP) was used in the T4 ligation reaction. Reaction products were run on a 2% (w/v) agarose gel, stained with GelGreen,
  • Exonuclease reactions were heat inactivated by incubation at either 80 ⁇ C for 20 min (for reactions containing Exo I) or at 70 ⁇ C for 20 min (for reactions containing thermolabile Exo I). Reaction products were run on a 2% (w/v) agarose gel, stained with GelGreen, and visualized using a blue light transilluminator.
  • Consecutive insertion of XNA base pairs using MlyI type IIS restriction enzyme 5′-phosphorylated hairpin oligos were purchased from IDT (5′Phos-11HP, 5′Phos-15HP, and 5′Phos-ScaI-HP; Table 2). 5′-Phos-15HP contains an MlyI restriction site adjacent to site of XNA ligation.
  • MlyI is a type IIS restriction enzyme (5′- GAGTCNNNNN ⁇ -3′) that leaves a blunt end after cutting.
  • 5′Phos-15HP donor hairpin with MlyI site; abbreviated HPD
  • 5′Phos-11HP acceptor hairpin; abbreviated HPA
  • HPD donor hairpin with MlyI site
  • HPA acceptor hairpin
  • These two hairpins were then ligated and subsequently treated with exonuclease following the optimized conditions described in “ XNA ligation conditions and reaction components.” This material was purified using Zymo’s DNA Clean and Concentrator and eluted in 30 ⁇ L of elution buffer.
  • the purified construct contains a single P ⁇ Z base pair insertion and was digested using 1.24 U/ ⁇ L of MlyI and 1X rCutSmart TM buffer at 37 ⁇ C for 2 h then heat inactivated at 65 ⁇ C for 20 min. MlyI digestion results in a hairpin with a terminal P ⁇ Z,
  • 5′-phosphorylated oligo pools (purchased as oPoolsTM from Integrated DNA Technologies) were designed to form blunt-end hairpins with two barcodes: a 24 nt Triplet-barcode [NNN-BC] and an 8 nt pool-barcode [Pool- BC] (FIG. 3A, Tables 3-5).
  • the Triplet-barcode is linked to the NNN sequence at the 3′- blunt end of the hairpin, while the pool-barcode is used to decode which dxNTP/dNTP was tailed (Table 12).
  • Each Triplet-barcode maps 1:1 with a corresponding NNN sequence adjacent to an XNA base.
  • Ligation reactions for libraries generate combinations with two different pool barcodes. Restriction enzyme cut sites were included upstream of Triplet-barcodes to remove hairpins following ligation reactions and prepare DNA for nanopore sequencing. Full hairpin sequences in each library can be produced based on the present disclosure.
  • Val-20 validation library design 5′-phosphorylated oligo pools (purchased as oPoolsTM from Integrated DNA Technologies) were designed to form blunt-ended hairpins with a variable 20 nt region at the end (Tables 3, 6).
  • variable 20 nt region was designed computationally by randomization with a uniform prior probability for each base.
  • Candidate sequences were passed through IDT oligo analyzer tool to remove sequences that might form secondary structures that could disrupt hairpin formation.
  • Each validation oligo pool contained 10 unique sequences (six total pools: Val_A-F; Table 6) and was synthesized at a scale of 50 pmol/oligo.
  • Two different validation oligo pools can be tailed with a dxNTP. Ligating two pools together (with complementary N+1 tails) results in a library with 100 possible sequences (10 x 10 combinations). Restriction enzyme cut sites were included upstream of these variable regions for nanopore library preparation following ligation.
  • the assembled product contains two different restriction sites for hairpin removal, 5′- GATATC-3′ (EcoRV) and 5′-AGTACT-3′ (ScaI).
  • EcoRV 5′- GATATC-3′
  • 5′-AGTACT-3′ 5′-AGTACT-3′
  • Asymmetric presence of restriction sites on the hairpins allows us to remove a singular hairpin and therefore generate a blunt end on the assembled product.
  • the resulting dsDNA contains a single 3′- and 5′-end.
  • Subsequent library preparation and sequencing of dsDNA results in reads where both sense and antisense strands, containing all 12-nucleobases, can be read in a single sequencing event (S c uper-12 and S n uper-12; FIG.5, FIGs 6UA-6UB).
  • NNNNNNN library, validation library, and 12-letter DNA preparation by XNA tailing and XNA ligation were first refolded by incubating 20 ⁇ M of oligo pool in a 100 mM NaCl, 10 mM Tris- HCl (pH 8.2) buffer at 90 ⁇ C for 3 minutes then allowing for cooling at 0.1 ⁇ C/s until reaching 20 ⁇ C.
  • oligos or oligo pools were tailed with a corresponding dxNTP using tailing conditions listed in Table 8. Reactions tailed with KF exo- were heat inactivated, while those tailed with Therminator were inactivated by thermolabile proteinase K treatment. Following inactivation of polymerase, oligos were refolded. Tailed oligo or oligo pools with complementary 3′-ends were then ligated with either T4 DNA ligase, T3 DNA ligase, or T7 DNA ligase using ligation conditions listed in Table 10. As a negative control for tailing, the starting material 3′-blunt end oligo or oligo pool (e.g.
  • Purified NNN-oligo pools were then digested for 1 h at 37 ⁇ C using 1 U/ ⁇ L of BbsI-HF and rCutSmart TM buffer, then purified again using AMPure XP with a 2:1 bead-to-sample ratio and eluted in 30 ⁇ L of nuclease-free water. Purified NNNNNNN library samples were then prepared for nanopore sequencing following the details in the Nanopore sample preparation section.
  • ligated validation oligo pool reactions were purified using AMPure XP with a 3:1 bead-to-sample ratio and eluted in 30 ⁇ L of elution buffer (10 mM Tris-HCl, pH 8.2), then combined to a final concentration of 0.2 ⁇ M/pool before enzymatic digestion for 1 h at 37 ⁇ C using 1 U/ ⁇ L of BbsI-HF and 1X rCutSmart TM buffer.
  • Each ligated oligo set was then combined at a final equimolar concentration of 0.05 or 0.075 ⁇ M/oligo before proceeding to a Golden Gate ligation with the addition of 1 U/ ⁇ L of BbsI-HF, 20 U/ ⁇ L of T4 DNA ligase, 1X rCutSmart TM buffer, and 1X T4 DNA Ligase Reaction Buffer (FIG.6UA).
  • the Golden Gate ligation included 60 cycles of 1) 37 ⁇ C for 5 min 2) 16 ⁇ C for 5 min, finalized by a step at 37 ⁇ C for 10 min, and a heat inactivation step at 65 ⁇ C for 20 min.
  • the reaction was further digested to remove incomplete ligation products by the addition of 0.45 U/ ⁇ L of BbsI-HF, 0.45 U/ ⁇ L of thermolabile Exo I, 2.27 U/ ⁇ L of Exo III, and 0.23 U/ ⁇ L of Exo VIII (truncated), incubating at 37 ⁇ C for 1 h, followed by a heat inactivation step at 70 ⁇ C for 20 min.
  • This reaction was then purified using AMPure XP with a 1.8:1 bead-to-sample ratio and eluted in 30 ⁇ L of nuclease-free water.
  • the hairpin on either end of the complete, desired product was removed by splitting the reaction in half and adding 1X rCutsmart TM and 2.78 U/ ⁇ L of either ScaI-HF or EcoRV-HF. These reactions were incubated at 37 ⁇ C for 1 h, followed by a heat inactivation step at 80 ⁇ C for 20 min. The split samples were then
  • Nanopore sample preparation and data acquisition Nanopore sample preparation followed standard Flongle or MinION Genomic DNA by Ligation protocol (available on the ONT community) using the SQK-LSK110 preparation kit with the following modifications.
  • the NEBNext FFPE Repair Mix was omitted to avoid potential XNA removal by repair enzymes.
  • the volume of the repair mix was replaced by nuclease-free water.
  • AMPure XP bead-to-sample ratio was increased to 2:1 for the NNNNNNN library, and 3:1 for the validation.
  • Signal-to-sequence mapping uses the Tombo (github.com/nanoporetech/tombo, ONT) pipeline.
  • Tombo github.com/nanoporetech/tombo, ONT
  • raw multi FAST5 files are split into single FAST5 using the ont-fast5-api (github.com/nanoporetech/ont_fast5_api, ONT) command multi_to_single_fast5.
  • Single FAST5 files are then basecalled using guppy (version 6.1.5+446c355, ONT) with the high accuracy configuration settings (dna_r9.4.1_450bps_hac.cfg).
  • FASTQ basecalls
  • 3915-P1293WO.UW -50- passing default guppy quality score settings are assigned to their corresponding single FAST5 files using Tombo command Tombo preprocess annotate_raw_with_fastqs.
  • Tombo uses a reference FASTA file that contains ground- truth sequences.
  • the reference FASTA file was generated programmatically by considering every possible combination of ligation product including mismatch homo- ligation (e.g. P1-A+P1-A, see Table 12), blunt-end ligations leading to a gap (e.g. P1-P2, P1-P1, P2-P2), or pyrophosphorolysis ligation products.
  • Full reference alignment files are deposited in the SRA (Table 31).
  • the ground truth XNA (B, S n , S c , P, Z, J, V, X t , K n ) base needs to be substituted for a canonical base (A, T, G, C) for processing in a FASTA format.
  • XNAs in reference sequences were substituted for the canonical bases that minimized observed variance in kmer levels; determined empirically (B ⁇ A; S n ⁇ A; S c ⁇ A; P ⁇ G; Z ⁇ C; X ⁇ A; K ⁇ G; J ⁇ C; V ⁇ G).
  • Substituted bases are in general agreement with observations from basecalling XNA-containing reads with guppy (FIGs 6OA-6OB and 6QA-6QI). Signal-to-sequence mapping then proceeds using Tombo resquiggle.
  • the Tombo resquiggle command uses mappy (minimap2 version 2.22-r1101 with ONT configuration) to first assign each single FAST5 read to a reference FASTA sequence based on the given FASTQ basecall. Following sequence assignment, Tombo uses dynamic programming for signal segmentation and proceeds to perform per-read signal normalization. As a general comment on the limitations of segmentation-based basecalling, Tombo is sensitive to the reference canonical base chosen for signal assignment.
  • the per-read, median normalized level signal for each base is then extracted using the Tombo resquiggle results through the Tombo Python API. Details regarding how Tombo performs mapping, matching, and normalization, along with the Tombo Python API usage, can be found in the Tombo documentation (nanoporetech.github.io/tombo/).
  • the resulting preprocessed and normalized signal- extracted data is exported to a CSV file for downstream processing (Tables 17, 18).
  • the entire data preprocessing steps, including command groups and parameter settings, are wrapped into a single command (xenomorph preprocess) and available on the Xenomorph repository.
  • XNA kmer model parameterization NNNNNNN libraries for a given XNA base pair are prepared as previously described in “NNNNNNN library, validation library, and 12-letter DNA preparation by XNA tailing and XNA ligation” and sequenced
  • Signal-to-sequence mapping is then performed using the previously described pipeline in “Raw nanopore data preprocessing and signal-to- sequence mapping” with the following specifications. Reads that do not fully map with full coverage of triplet-barcodes and pool-barcodes of the XNA position are filtered out. Likewise, reads with a q-score ⁇ 9 and signal match score > 3 are not used in the model building. Signal-to-sequence mapping is also carried out with blunt-end ligation products (i.e. NNNNNN, or no XNA insertion), such that sequences that map better to blunt-end ligation products are not used.
  • blunt-end ligation products i.e. NNNNNN, or no XNA insertion
  • the 4-nt kmer was chosen in this disclosure as a proof of concept since reasonable kmer coverage could be obtained for the full NNNNNNN library (512 kmers per XNA base pair insertion) in a single Flongle flow cell run.
  • each kmer consists of four nucleotide bases centered around the 0 th position nucleotide, as exemplified in Table 16. Therefore, each heptamer sequence (NNNNNNN) is composed of four, 4-nt kmers (i.e. +2 pos NNNN, +1 pos NNNN, 0 pos NNNN, -1 pos NNNN).
  • Observed kmer levels are modeled as normal distributions parameterized with a mean ( ⁇ ⁇ ⁇ and standard deviation ( ⁇ ⁇ ). These parameters are used to describe observed kmer signal level probability density functions: ⁇ ⁇ ⁇ ⁇ ⁇ P ⁇ ⁇ ⁇ 1 e ⁇ ⁇ ⁇ P ⁇ ⁇ ⁇ probability that from kmer ′ ⁇ ′ ⁇ ⁇ ⁇ normalized kmer level mean for kmer ′ ⁇ ′ ⁇ ⁇ standard deviation of median normalized kmer levels for kmer ′ ⁇ ′ ⁇ ⁇ ⁇ observed median normalized kmer level
  • level model means were approximated using the following kmer-specific bandwidth selection: I QR ⁇ ⁇ 0.9 ⁇ argmin ⁇ 1 , ⁇ ⁇ ⁇ ⁇ .34 BW ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ Silverman ⁇ s rule of thumb IQR ⁇ Interquartile range of kmer levels for kmer ′ ⁇ ′ ⁇ ⁇ ⁇ standard deviation of median normalized kmer levels for kmer ′ ⁇ ′ ⁇ ⁇ ⁇ number of observations ⁇ measurements ⁇ of kmer ′ ⁇ ′ BW ⁇ bandwidth used for kernel density estimate [0153] For practical purposes detailed in the Tombo documentation (github.com/nanoporetech/tombo), one can set a global standard deviation taken as the average observed standard deviation across all kmers in the model (i.e.
  • kmer models Documentation for model building and code used to generate kmer models can be found in the Xenomorph repository (github.com/xenobiolab/xenomorph). For quality control, the entire experimental and computational procedure, from building libraries to generating 4-nt kmer models, was performed in duplicate. Models were built from data collected in a single run. The
  • NNNNNNN For each heptamer sequence (NNNNNNN) a set of mapping kmer sequences (NNNN, NNNN, NNNN, NNNN) and observed signal levels (I NNNN , I NNNN , I NNNN , I NNNN ) ( ⁇ ⁇ , ⁇ ⁇ , ⁇ ⁇ , ⁇ ⁇ ⁇ are extracted. See Table 16 for additional information on numbering nomenclature of kmer sequences within a heptamer region.
  • the kmer probability density function described previously in “XNA kmer model parameterization,” is used to estimate the probability that each observed level (e.g., ⁇ ⁇ ) came from the corresponding kmer (e.g.
  • LLR Log-likelihood ratio
  • LLR ratio > 0 is used as the default criteria for deciding if the XNA model is more likely than an alternative model for a given observed sequence of signals.
  • ORLLR is a modified LLR test statistic that is nominally more robust towards outliers.
  • the ORLLR test statistic is defined as follows: ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ 2 sequence ⁇ ⁇ ⁇ median normalized kmer level for kmer ′ ⁇ ⁇ ⁇ ′ ⁇ ⁇ ⁇ median normalized kmer level for kmer ′ ⁇ ⁇ ⁇ ′ ⁇ ⁇ ⁇ ⁇ scale difference ⁇ ⁇ global standard deviation of median normalized kmer levels
  • Consensus recall and specificity perform sequence-level assignments in calculations (rather than per-read level). Specificity of kmer models was calculated by alternative hypothesis testing on sequences that did not contain any XNAs. The definition of each statistic is provided below.
  • T P recall ⁇ T P ⁇ FN TP ⁇ True positive FN ⁇ False negative F
  • P specificity ⁇ 1 ⁇ FDR ⁇ 1 ⁇ F P ⁇ TN FP ⁇ False positive TN ⁇ True negative FDR ⁇ False discovery rate
  • Receiver operating characteristic Receiver operating characteristic (ROC) curves were generated using the roc_curve function from the scikit-learn python library.
  • 3915-P1293WO.UW -56- contained XNA bases flanked by 20 randomly chosen canonical bases. Recall on the validation set was calculated at the per-read and consensus level as described previously in “Recall and specificity calculations.”
  • PCR amplification and basecalling of P ⁇ Z template DNA Two complementary oligos containing P and Z (PCR_Template_P, PCR_Template_Z, Table 22) were synthesized by Firebird Biomolecular Sciences (Alachua, Fl) and hybridized in a 1:1 molar ratio.25 ng of this hybridized PZ DNA construct was used as the template for a PCR reaction.
  • PCR reactions contained 0.2 ⁇ M of each forward and reverse primer (PCR_Amp_F, PCR_Amp_R1-4, Table 22), 5 U/ ⁇ L of Taq polymerase in 1X ThermoPol buffer (pH 8.0). Triphosphate concentrations for dxNTPs and dNTPs varied by condition (no dxNTP, limiting, equimolar, optimal) and are tabulated in FIGs 6TA-6TC. The PCR reaction then proceeded with thermocycler conditions tabulated in Table 23. PCR reactions were purified using Zymo DNA Clean and Concentrator and eluted in 30 ⁇ L of nuclease-free water.
  • the Xenomorph XNA sequencing pipeline One of the goals of this disclosure was to build a publicly available end-to-end pipeline for validation of XNA incorporation in target sequences. As a proof of concept, one can create a tool in python called “Xenomorph” comprised of a pipeline consisting of two steps: 1) preprocessing - xenomorph preprocess and 2) alternative hypothesis testing - xenomorph morph.
  • Xenomorph runs raw FASTA5 data through the preprocessing pipeline with an additional FASTA handling modification that allows users to input reference sequences with XNA base pairs. Outputs for preprocessing steps are provided in a .csv file (see Table 17 for header description), which is used as an input for xenomorph morph.
  • Xenomorph uses the XNA base pairs found input the reference sequence to perform LLR or ORLLR testing against user-defined alternatives. For example, for a sequence containing A, T, G, C, B, S n base pairs, users can calculate most likely base at the XNA position against most similar canonical base (e.g.
  • B vs A purines/pyrimidines
  • canonical bases e.g. B vs A, T, G, C
  • all bases e.g. B vs A, T, G, C, S n .
  • Alternative hypothesis testing can be performed on a per-read basis or a global basis.
  • XNA kmers models generated in this disclosure are built-in and can be viewed using xenomorph models. Model compilation is performed ad hoc, allowing users to experiment with kmer models.
  • Outputs for alternative hypothesis testing are provided as a .csv file (see Table 18 for header description).
  • kmer models are inherently independent (i.e. signal observations of NNNBNNN are independent of NNNSNNN observations) and therefore modular.
  • Xenomorph was built to be flexible, allowing users to add more kmer models or modify them, and straightforward, requiring two commands to go from raw nanopore data to XNA-refined sequences.
  • FIG. 6S A graphical overview of the preprocessing pipeline can be found in FIG. 6S.
  • Xenomorph can be found in the Xenomorph repository (github.com/xenobiolab/xenomorph) alongside all code, documentation, and parameters used in this disclosure.
  • 3915-P1293WO.UW -58- building and basecalling can be downloaded from the SRA Bioproject PRJNA932328 [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA932328]. Additional overview of how the Xenomorph pipeline performs XNA basecalling is found in Note 1. [0164] Data availability: Models measured in this disclosure used for basecalling are provided in Data Table 1, and can also be found on the Xenomorph github repository (github.com/xenobiolab/xenomorph/tree/main/models).
  • the raw nanopore sequences (FAST5) and guppy basecalls (FASTQ) used in this disclosure to build models, validate models, and test 12-letter DNA sequencing have been deposited in the sequence reads archive (SRA) under Bioproject PRJNA932328 [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA932328] and can be accessed without restriction (Table 31).
  • Raw nanopore data for PZ PCR amplification experiments (FIGs 6TA-6TC) are available under restricted access, as this data was collected in a pooled nanopore run and contains additional data. Full sequences for hairpin libraries purchased for this work can be produced based on this disclosure. Additional source data can be produced based on this disclosure.
  • Code availability Code for end-to-end processing of nanopore reads and basecalling xenonucleotides described in this example can be produced based on this disclosure.
  • Information for Example 1 Enzymatic Synthesis and Nanopore Sequencing of 12-letter Supernumerary DNA
  • Methods [0168] Organic synthesis of dX t TP: 8-(2′-Deoxy- ⁇ -D-erythro- pentofuranosyl)imidazo[1,2-a]-s-triazin-2,4-dione 5′-triphosphate.
  • 3915-P1293WO.UW -71- xenonucleobases (B, S n , S c , P, Z, X t , K n , J, V) are integrated for selection.
  • the pipeline, as built, also allows users to generate their own models.
  • Basecalling can be performed either per-read or per-sequence (global). In per-read basecalling, individual reads are basecalled while in per-sequence, the signal of all reads that match a sequence are averaged before determining a global call.
  • the per-read consensus is defined as the most frequent basecall among all reads that match a certain sequence.
  • 4-nt kmer models are parameterized with a kmer mean ( ⁇ k) and a kmer variance ( ⁇ k ). Users have the choice of setting experimentally measured signal means, signal medians, or means from kernel density estimates as ⁇ k. Options for ⁇ k values are either the kmer-specific measured variance or a fixed global variance. The choice of bases to use in the model can also be specified. As described, basecalling in this disclosure uses signal means for ⁇ k and global average kmer variance for ⁇ k . [0207] Full code and documentation of Xenomorph is available on github. Sample data, such as the FAST5 data generated in this disclosure, can be found in the SRA under Bioproject PRJNA932328 (Table 31). [0208] Note 2.
  • Each hairpin pool contains 10 unique sequences. Ligating two hairpin pools together generates a final library of 100 possible sequence combinations (10 x 10).
  • the table shows constant regions for all oligos in each pool (black), with regions in brackets (blue, bold) being replaced with their corresponding sequence elements from Tables 4-6. ‘-F’ and ‘-R’ are used to note forward and reverse sequences of different components after the hairpin is folded.
  • NNN denotes the 3 randomized bases at the end of the hairpins
  • [NNN-BC] i.e., Triplet-barcode
  • [Pool-BC] i.e., Pool-barcode
  • NNN-BC Triplet-barcode
  • [Pool-BC] Pool-barcode
  • Regions highlighted in red denote restriction site sequence difference between HP_v1 and HP_v2, HP1 and HP2. All sequences are shown in the 5′ to 3′ direction.
  • Full hairpin sequences purchased for this disclosure can be produced based on this disclosure.
  • Triplet-barcodes sequences Sequences of the Triplet-barcodes and NNN sequences they are assigned to.
  • the Triplet-barcode is a 24 nt sequence that is distal to the 3′-NNN end in each hairpin and is used to assign the true identity of the 3′- NNN bases that flank XNA insertions (Fig.3a).
  • N A, T, G, or C; 64 NNN combinations
  • Barcode sequences were chosen from Oxford Nanopore Technologies list of barcodes for long-read sequencing.
  • Barcode sequences are shown in 5′ to 3′ direction.
  • the Triplet- barcode (abbreviated as [NNN-BC]) and NNN sequences used to construct HP_v1-NNN- [Pool-ID] and HP_v2-NNN-[Pool-ID] hairpin sequences, shown in Table 3, by insertion into [NNN-BC] and [NNN] regions, respectively.
  • Full sequences of all hairpins used for model generation can be produced based on this disclosure.
  • Validation pool sequences were randomly generated and intended to provide a sequence diversity (+/- 20 nt surrounding an XNA nt) much greater than what is present in the model training NNN-pools.
  • the smaller library size (100 sequences per ligated pool) and richer sequence diversity made it possible to multiplex all the validation sets while still obtaining sufficient coverage for calculating appropriate statistics.
  • Validation pool sequences are a subset of HP1-[VAL-ID] and HP2-[VAL-ID] hairpin sequences shown in Table 3. Sequences are shown in 5′ to 3′ direction. Full sequences of hairpins ordered, alongside ligation products generated, can be produced based on this disclosure. SE SE A A
  • Table shows barcodes for each oligo that links to the variable 3 nt sequence on the 3′-end and the xenonucleotide tailed on the 3′-end (bold), as well as restriction site sequences (red, bold). Sequences are shown in 5′ to 3′ direction.
  • Primer sequences are used to amplify the template: each condition used a different barcoded reverse primer (PCR_Amp_R1: Equimolar; PCR_Amp_R2: Optimal; PCR_Amp_R3: No dxNTP; PCR_Amp_R4: Limiting). All conditions used the same forward primer (PCR_Amp_F). Sequences are shown in 5′ to 3′ direction. S
  • Table shows: (left) fraction of base called at each xenonucleotide position using the full 12-letter supernumerary model; (right) base called using model with simplified priors, where denotes the xenonucleotide at position called, and ⁇ denotes the most similar standard base called instead. Box highlights base pair chosen from picking the most likely nucleobase among any purine or pyrimidine set, then fixing complementary base. Base called – S c uper-12 ⁇ .2 9 .1 3 .0 8
  • Table shows: (left) fraction of base called at each xenonucleotide position using the full 12-letter supernumerary model; (right) base called using model with simplified priors, where denotes the xenonucleotide at position called, and ⁇ denotes the most similar standard base called instead. Box highlights base pair chosen from picking the most likely nucleobase among any purine or pyrimidine set, then fixing complementary base. Base called – S n uper-12 ⁇ .2 0 .0 9 .2 2 .0 4 .2
  • Table 27 Tabulation of per-read recall from simulated signal levels for the standard genetic code (A, T, G, C). Information regarding read simulation can be found in the Note section. Standard code A.
  • Table 28 Tabulation of per-read recall from simulated signal levels for the isoG/isoC code (A, T, G, C, B, S n ). isoG/isoC code 6 0 7 0 3
  • a method for generating an N+1 tailing product comprising a non-standard nucleotide that is covalently bound with a 3’ end of a precursor double-stranded DNA (dsDNA) template and is non-base-paired, the method comprising: combining the precursor dsDNA template with a DNA polymerase and a non-standard deoxyribonucleotide triphosphate (dNTP) under a reaction condition conducive to a blunt-end N+1 addition of the non-standard nucleotide to the 3’ end of the precursor dsDNA template by the DNA polymerase.
  • dNTP non-standard deoxyribonucleotide triphosphate
  • Embodiment 1 The method of Embodiment 1 or any other Embodiment, wherein the non-standard nucleotide is a xenonucleotide (XNA) and the non-standard dNTP is a deoxy-xeno-ribonucleotide triphosphate (dxNTP).
  • XNA xenonucleotide
  • dxNTP deoxy-xeno-ribonucleotide triphosphate
  • Embodiment 3 The method of Embodiment 1 or any other Embodiment, wherein the DNA polymerase comprises a polypeptide sequence of a small Klenow Fragment (KF exo-) of DNA Polymerase I.
  • Embodiment 4 The method of Embodiment 3 or any other Embodiment, wherein the polypeptide sequence comprises a sequence of SEQ ID NO:2.
  • Embodiment 5. The method of any of Embodiments 3-4 or any other Embodiment, wherein the non-standard
  • Embodiment 6 The method of Embodiment 1 or any other Embodiment, wherein the DNA polymerase comprises a polypeptide sequence of an engineered polymerase from a hyperthermophilic marine archaeon.
  • Embodiment 7 The method of Embodiment 6 or any other Embodiment, wherein the engineered polymerase is a variant of 9°N DNA polymerase.
  • Embodiment 9 The method of any of Embodiments 6-8 or any other Embodiment, wherein the non-standard nucleotide is selected from Sn, Sc, Z, Xt, Kn, J, and V, and the reaction condition proceeds at about 60°C for between about 4-16 hours and comprises about 0.29 U/ ⁇ L of the DNA polymerase and about 1.19 mM of the non-standard dNTP.
  • Embodiment 10 Embodiment 10.
  • Embodiment 11 A method for generating a base pair of two nucleotides of a polynucleotide, wherein at least one nucleotide of the two nucleotides is a non-standard nucleotide.
  • Embodiment 12 The method of Embodiment 10 or any other Embodiment, comprising the method of any of Embodiments 1-9 or any other Embodiment.
  • Embodiments 10-11 or any other Embodiment comprising: generating a second N+1 tailing product comprising a second non-standard nucleotide that is base-pair complementary with the non-standard nucleotide, wherein the second non-standard nucleotide is non-base-paired; and ligating the N+1 tailing product with the second N+1 tailing product to form a dsDNA ligation product that comprises a base pair between the non-standard nucleotide and the second non-standard nucleotide.
  • Embodiment 13 The method of any of Embodiments 10-12 or any other Embodiment, wherein the N+1 tailing product comprises a hairpin.
  • Embodiment 14 The method of any of Embodiments 10-13 or any other Embodiment, wherein the second N+1 tailing product comprises a hairpin.
  • Embodiment 15 The method of Embodiment 14 or any other Embodiment, wherein the dsDNA ligation product does not comprise a free 5’ end or a free 3’ end.
  • Embodiment 16 The method of any of Embodiments 12-15 or any other Embodiment, comprising: contacting the dsDNA ligation product with a type IIS restriction enzyme under a reaction condition conducive for the type IIS restriction enzyme to cleave the dsDNA ligation product to generate a blunt-end DNA template that comprises the base pair between the non-standard nucleotide and the second non-standard nucleotide.
  • Embodiment 17 Embodiment 17.
  • Embodiment 16 The method of Embodiment 16 or any other Embodiment, wherein the method is performed a plurality of times for creation of a plurality of base pairs between a plurality of non-standard nucleotides and a plurality of second non-standard nucleotides as sequence elements of a further dsDNA ligation product.
  • Embodiment 17 comprising: contacting the further dsDNA ligation product with a type IIS restriction enzyme under a reaction condition conducive for the type IIS restriction enzyme to cleave the further dsDNA ligation product to generate a further blunt-end DNA template that comprises the plurality of base pairs between the plurality of non- standard nucleotides and the plurality of second non-standard nucleotides.
  • Embodiment 20 The method of Embodiment 19 or any other Embodiment, wherein the non-standard nucleotide is the nucleobase that is configured to hydrogen bond to the second base and the second base is a standard base or a non- standard base.
  • Embodiment 21 The method of Embodiment 19 or any other Embodiment, wherein the non-standard nucleotide is the nucleobase that can base pair (without hydrogen bonding) to the second base and the second base is a standard base or a non-standard base.
  • Embodiment 22 The method of Embodiment 19 or any other Embodiment, wherein the non-standard nucleotide comprises an epigenetic modification or is 4-methyl-cytosine, 5-methyl cytosine, 6-methyl adenosine, 5-hydroxymethyl cytosine, 7-methylguanosine, or N6-methyladenosine.
  • Embodiment 23 The method of Embodiment 19 or any other Embodiment, wherein the non-standard nucleotide comprises the chemical modification and the chemical modification comprises a fluorophore, a biotin, a terminal alkyne, an azide, a cyclooctyne, a tetrazine, a terminal alkene, a phosphine, a halo-alkane, an aldehyde, a thiol, a transition metal complex, another reactive handle, or any combination thereof.
  • Embodiment 24 A dsDNA ligation product produced by the method of any of Embodiments 12-23 or any other Embodiment.
  • Embodiment 25 A further dsDNA ligation product produced by the method of any of Embodiments 17-23 or any other Embodiment.
  • Embodiment 26 A blunt-end dsDNA template produced by the method of any of Embodiments 16-23 or any other Embodiment.
  • Embodiment 27 A further blunt-end dsDNA template produced by the method of any of Embodiments 18-23 or any other Embodiment.
  • Embodiment 28 A further blunt-end dsDNA template produced by the method of any of Embodiments 18-23 or any other Embodiment.
  • a defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of the dsDNA ligation product of Embodiment 24 or any other Embodiment or the blunt-end dsDNA template of Embodiment 26 or any other Embodiment, wherein the library polynucleotide sequence comprises the base pair between the non-standard nucleotide and the second non-standard nucleotide.
  • a defined non-standard nucleotide base pair library comprising a library polynucleotide sequence of the further dsDNA ligation product of Embodiment 25 or any other Embodiment or the further blunt-end dsDNA template of Embodiment 27 or any other Embodiment, wherein the library polynucleotide sequence
  • Embodiment 30 The defined non-standard nucleotide base pair library of any of Embodiments 28-29 or any other Embodiment, wherein the library polynucleotide sequence further comprises: a context barcode associated with a sequence context adjacent to a base pair of a non-standard nucleotide and a second non-standard nucleotide of the library polynucleotide sequence; and a pool barcode associated with the non-standard nucleotide, the second non-standard nucleotide, or both.
  • Embodiment 31 A method for generating a machine learning (ML) model that correlates one or more observed current reads with an unknown non-standard nucleotide for assignment of an identity to the unknown non-standard nucleotide, the method comprising: sequencing, with a nanopore sequencing method, the defined non- standard nucleotide base pair library of any of Embodiments 28-30 or any other Embodiment to produce the one or more observed current reads; and training, with a ML algorithm, the ML model to associate the one or more observed current reads with a known identity of a defined non-standard nucleotide of the defined non-standard nucleotide base pair library of any of Embodiments 28-30 or any other Embodiment, wherein the ML model is configured to assign the identity to the unknown non-standard nucleotide based on the known identity of the defined non-standard nucleotide.
  • ML machine learning
  • Embodiment 32 The method of Embodiment 31 or any other Embodiment, wherein the ML model comprises a convolutional long short term memory recurrent neural network (LSTM RNN).
  • Embodiment 33 A non-transitory computer-readable storage medium having stored thereon at least part of a ML model produced by any of Embodiments 31- 32 or any other Embodiment.
  • Embodiment 34 A computational device or computational system comprising the non-transitory computer-readable storage medium of Embodiment 33 or any other Embodiment.
  • Embodiment 35 Embodiment 35.
  • Embodiment 36 A method for basecalling a non-standard nucleotide expanded alphabet, the method comprising: sequencing, with a nanopore sequencing
  • Embodiment 37 A circuitry configured to perform all or part of the method of Embodiment 36 or any other Embodiment.
  • Embodiment 38 A circuitry configured to perform all or part of the method of Embodiment 36 or any other Embodiment.
  • Embodiment 39 A nanopore sequencing kit, device, or system comprising the circuitry of Embodiment 37 or any other Embodiment.
  • Embodiment 40 A nanopore sequencing kit, device, or system comprising the circuitry of Embodiment 38 or any other Embodiment.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Organic Chemistry (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Wood Science & Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Zoology (AREA)
  • Molecular Biology (AREA)
  • Biochemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Microbiology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • General Chemical & Material Sciences (AREA)
  • Medicinal Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Plant Pathology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Immunology (AREA)
  • Library & Information Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Systèmes et procédés pour générer des paires de bases définies dans un format de banque défini par séquence qui comprend au moins un nucléotide non standard dans une paire de bases. Les paires de bases non standard peuvent être créées par l'utilisation d'une polymérase d'acide nucléique (par exemple , une ADN polymérase, une ARN polymérase, une désoxynucléotide polymérase terminale) pour l'ajout de la queue émoussée de la base non standard, qui peut ensuite être ligaturée à une autre extrémité nucléotidique. Les séquences nucléotidiques contenant des paires de bases non standard peuvent être utilisées pour générer des banques pour les modèles servant à l'identification de bases de la base non standard avec des plateformes de séquençage nouvelle génération (NGS) et le séquençage de séquences nucléotidiques non standard, y compris les xénonucléotides (XNA).
PCT/US2024/015068 2023-02-08 2024-02-08 Systèmes et procédés de synthèse enzymatique de polynucléotides contenant des paires de bases nucléotidiques non standard WO2024168196A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202363483926P 2023-02-08 2023-02-08
US63/483,926 2023-02-08

Publications (1)

Publication Number Publication Date
WO2024168196A1 true WO2024168196A1 (fr) 2024-08-15

Family

ID=92263538

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2024/015068 WO2024168196A1 (fr) 2023-02-08 2024-02-08 Systèmes et procédés de synthèse enzymatique de polynucléotides contenant des paires de bases nucléotidiques non standard

Country Status (1)

Country Link
WO (1) WO2024168196A1 (fr)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150133320A1 (en) * 1997-04-01 2015-05-14 Illumina, Inc. Method of nucleic acid amplification
WO2019081680A1 (fr) * 2017-10-25 2019-05-02 Institut Pasteur Immobilisation d'acides nucléiques à l'aide d'un mimétique d'étiquette histidine enzymatique pour des applications de diagnostic
US20200263218A1 (en) * 2017-10-04 2020-08-20 Centrillion Technology Holdings Corporation Method and system for enzymatic synthesis of oligonucleotides
US20200392572A1 (en) * 2017-12-21 2020-12-17 Curevac Ag Linear double stranded dna coupled to a single support or a tag and methods for producing said linear double stranded dna
US10934569B1 (en) * 2018-12-20 2021-03-02 Nicole A Leal Enzymatic processes for synthesizing RNA containing certain non-standard nucleotides
US20210171920A1 (en) * 2015-10-29 2021-06-10 Temple University-Of The Commonwealth System Of Higher Education Modification of 3' Terminal Ends of Nucleic Acids by DNA Polymerase Theta
US20210355519A1 (en) * 2020-05-15 2021-11-18 Codex Dna, Inc. Demand synthesis of polynucleotide sequences

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150133320A1 (en) * 1997-04-01 2015-05-14 Illumina, Inc. Method of nucleic acid amplification
US20210171920A1 (en) * 2015-10-29 2021-06-10 Temple University-Of The Commonwealth System Of Higher Education Modification of 3' Terminal Ends of Nucleic Acids by DNA Polymerase Theta
US20200263218A1 (en) * 2017-10-04 2020-08-20 Centrillion Technology Holdings Corporation Method and system for enzymatic synthesis of oligonucleotides
WO2019081680A1 (fr) * 2017-10-25 2019-05-02 Institut Pasteur Immobilisation d'acides nucléiques à l'aide d'un mimétique d'étiquette histidine enzymatique pour des applications de diagnostic
US20200392572A1 (en) * 2017-12-21 2020-12-17 Curevac Ag Linear double stranded dna coupled to a single support or a tag and methods for producing said linear double stranded dna
US10934569B1 (en) * 2018-12-20 2021-03-02 Nicole A Leal Enzymatic processes for synthesizing RNA containing certain non-standard nucleotides
US20210355519A1 (en) * 2020-05-15 2021-11-18 Codex Dna, Inc. Demand synthesis of polynucleotide sequences

Similar Documents

Publication Publication Date Title
Lucas et al. Quantitative analysis of tRNA abundance and modifications by nanopore RNA sequencing
US20210062186A1 (en) Next-generation sequencing libraries
Chen et al. The history and advances of reversible terminators used in new generations of sequencing technology
Lu et al. Enzymatic DNA synthesis by engineering terminal deoxynucleotidyl transferase
DK2245187T3 (en) Methods for accurate sequence data and modified due to localization
Leu et al. Cascade of reduced speed and accuracy after errors in enzyme-free copying of nucleic acid sequences
US10704164B2 (en) Methods, systems, computer readable media, and kits for sample identification
EP3146075B1 (fr) Séquençage d'adn et d'arn par synthèse basé sur la détection d'ions à l'aide de terminateurs nucléotidiques réversibles
CN107969138B (zh) 条形码序列和有关系统与方法
WO2015081229A2 (fr) Amplification sélective de séquences d'acide nucléique
US20060141516A1 (en) De-novo sequencing of nucleic acids
CN105579592B (zh) 用于制备dna文库的dna接头分子以及生产它们的方法和用途
US20200190574A1 (en) Rna-stitch sequencing: an assay for direct mapping of rna : rna interactions in cells
KR20240069835A (ko) 대규모 병렬 서열분석을 위한 dna 라이브러리를 생성하기 위한 개선된 방법 및 키트
Desgranges et al. Navigation through the twists and turns of RNA sequencing technologies: application to bacterial regulatory RNAs
Kawabe et al. Enzymatic synthesis and nanopore sequencing of 12-letter supernumerary DNA
Jankowsky et al. Mapping specificity landscapes of RNA-protein interactions by high throughput sequencing
JP2002525129A (ja) ポリヌクレオチドを分析するための方法
Giurgiu et al. A Fluorescent G‐Quadruplex Sensor for Chemical RNA Copying
US20160239732A1 (en) System and method for using nucleic acid barcodes to monitor biological, chemical, and biochemical materials and processes
CN116287167B (zh) 核酸分子的测序方法
WO2024168196A1 (fr) Systèmes et procédés de synthèse enzymatique de polynucléotides contenant des paires de bases nucléotidiques non standard
US20240052342A1 (en) Method for duplex sequencing
Tserovski et al. Diastereoselectivity of 5-Methyluridine Osmylation is inverted inside an RNA chain
Lau et al. Single molecule counting and assessment of random molecular tagging errors with transposable giga-scale error-correcting barcodes

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24754101

Country of ref document: EP

Kind code of ref document: A1