WO2019079802A1 - Methods of encoding and high-throughput decoding of information stored in dna - Google Patents

Methods of encoding and high-throughput decoding of information stored in dna Download PDF

Info

Publication number
WO2019079802A1
WO2019079802A1 PCT/US2018/056900 US2018056900W WO2019079802A1 WO 2019079802 A1 WO2019079802 A1 WO 2019079802A1 US 2018056900 W US2018056900 W US 2018056900W WO 2019079802 A1 WO2019079802 A1 WO 2019079802A1
Authority
WO
WIPO (PCT)
Prior art keywords
nucleotide
nucleotides
digits
sequence
strands
Prior art date
Application number
PCT/US2018/056900
Other languages
French (fr)
Inventor
Henry Hung-yi LEE
Reza Kalhor
George M. Church
Original Assignee
President And Fellows Of Harvard College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by President And Fellows Of Harvard College filed Critical President And Fellows Of Harvard College
Publication of WO2019079802A1 publication Critical patent/WO2019079802A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C13/00Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00
    • G11C13/0002Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00 using resistive RAM [RRAM] elements
    • G11C13/0009RRAM elements whose operation depends upon chemical change
    • G11C13/0014RRAM elements whose operation depends upon chemical change comprising cells based on organic memory material
    • G11C13/0019RRAM elements whose operation depends upon chemical change comprising cells based on organic memory material comprising bio-molecules

Definitions

  • the present invention relates in general to methods of using nucleotide transitions to encode information into a nucleotide sequence and high-throughput decoding of information stored in the nucleotide sequence.
  • DNA is a compelling data storage medium given its superior density, stability, energy-efficiency, and longevity compared to currently used electronic media (C. Bancroft, T. Bowler, B. Bloom, C. T. Cleiland, Long-term storage of information in DNA. Science. 293, 1763-1765 (2001), V. Zhirnov, R. M. Zadegan, G S, Sandhu, G. M. Church, W. L. Hughes, Nucleic acid memory. Nat. Mater. 15, 366-370 (2016)). Recent studies have demonstrated that any digital data can be written in DNA, stored, and accurately read (G. M. Church, Y. Gao, S. Kosuri, Next -generation digital information storage in DNA. Science. 337, 1628 (2012), N.
  • the present disclosure provides a method of decoding a nucleotide sequence, the nucleotide sequence encoding a value corresponding to a format of information.
  • the method includes determining the nucleotide sequence, identifying a transition or boundary or edge between different or nonidenticai nucleotides of the nucleotide sequence, and assigning a predetermined value to the identified transition or boundary or edge to create the value encoded in the nucleotide sequence corresponding to the format of information.
  • the nucleotide sequence encodes a series of values corresponding to the format of information and wherein a plurality of transitions or boundaries or edges between different or nonidenticai nucleotides of the nucleotide sequence are identified and each identified transition or boundary or edge is assigned a predetermined value to create the series of values encoded in the nucleotide sequence corresponding to the format of information.
  • the value corresponding to the format of information can be obtained from analog, digital, optical, visible or non-visible wavelengths, chemical, or physical input sources.
  • the value is a digital value and the series of values are digital values.
  • the digital values are two digits, three digits, four digits, five digits, six digits, seven digits, eight digits, nine digits, ten digits, eleven digits, twelve digits, thirteen digits, fourteen digits, fifteen digits, sixteen digits, or more.
  • the digital values are bits (binary digits), trits (ternary digits), octet (eight digits), hexadecimal (sixteen digits), or more.
  • the format of information is selected from the group consisting of text, image, video or audio format, sensor data, and combinations thereof.
  • the different or nonidentical nucleotides comprise natural nucleotides or nonnatural nucleotides. In other embodiments, the different or nonidentical nucleotides comprise adenine, cytosine, guanine, and thymine. In one embodiment, the nucleotide sequence includes at least one nucleotide homopolymer. In another embodiment, the nucleotide sequence includes a nucleotide homopolymer for each different or nonidentical nucleotide in the nucleotide sequence.
  • the nucleotide sequence includes a nucleotide homopolymer for each different or nonidentical nucleotide in the nucleotide sequence and wherein the transition between one nucleotide homopolymer to a different or nonidentical nucleotide homopolymer is a single transition or boundary or edge.
  • the series of digital values comprises two different digital values.
  • the series of digital values comprises three different digital values.
  • the series of digital values comprises more than three different digital values.
  • each digital value in the series of digital values represents two, three or more different digital values.
  • the each nucleotide transition or boundary or edge is assigned a predetermined digital value.
  • the step of determining the nucleotide sequence is carried out by sequencing methods including nanopore sequencing, sequencing-by-synthesis, sequencing- by-ligation, and sequencing-by-hybridization. In one embodiment, the step of determining the nucleotide sequence is carried out by nucleotides modified with reversible terminators. In another embodiment, the step of determining the nucleotide sequence is carried out by detection of pyrophosphate or hydrogen ions generated during DNA polymerization of a complementar nucleotide strand. In one embodiment, the step of determining the nucleotide sequence is carried out by ligation of fluorescently modified single-stranded nucleotides with complementarity to the nucleotide sequence to be sequenced.
  • the series of digital values includes a corresponding barcode.
  • the method further includes decoding a plurality of nucleotide sequences, each member of the plurality encoding for an identical value corresponding to the format of information, wherein the nucleotide sequence is determined for each member of the plurality, and identifying a transition or boundary or edge between different nucleotides of each member of the plurality and assigning a predetermined value to each identified transition or boundary or edge to create the identical value corresponding to the format of information.
  • each member of the plurality of the nucleotide sequence encodes a series of identical values corresponding to the format of information and wherein a plurality of transitions or boundaries or edges between different or nonidentical nucleotides of each member of the plurality of the nucleotide sequence are identified and each identified transition or boundary or edge is assigned a predetermined value to create the series of identical values encoded in each member of the plurality of the nucleotide sequence corresponding to the format of information.
  • the nucleotide sequence is attached to a substrate. In another embodiment, each member of the plurality of nucleotide sequence is attached to a substrate. In one embodiment, the series of digital values is a bit or trit stream and the nucleotide sequence corresponds to a bit or trit sequence within the bit or trit stream.
  • the series of digital values is a bit or trit stream and the bit or trit stream comprises a plurality of bit or trit sequences each having a corresponding barcode to indicate position within the bit or trit stream and with the plurality of bit or trit sequences having a corresponding plurality of nucleotide sequences, wherein each member of the plurality of nucleotide sequences is sequenced, and identifying a plurality of transitions or boundaries or edges between different nucleotides of each member of the plurality and assigning a predetermined bit or trit value to each transition or boundary or edge of the plurality of transitions or boundaries or edges to create the bit or trit sequences corresponding to each member of the plurality.
  • the present disclosure provides a method of decoding a nucleotide sequence encoding for a series of digital values corresponding to a format of information.
  • the method includes determining the nucleotide sequence to identify nucleotide homopolymers and for each homopolymer assigning one or more of the nucleotides based on a predetermined predicted homopolymer length of the nucleotide produced using enzymatic or chemical synthesis, and assigning a particular digital value for each of the one or more nucleotides.
  • the predicted homopolymer length is determined from empirical observation.
  • the predicted homopolymer length is a median, a mean, or a mode.
  • the format of information is selected from the group consisting of text, image, video or an audio format, sensor data, and combination thereof.
  • the nucleotides comprise natural nucleotides or nonnatural nucleotides.
  • the nucleotides comprise adenine, cytosine, guanine, and thymine.
  • the present disclosure provides a method of sequencing and decoding a plurality of nucleotide sequences representing a format of information wherein each nucleotide sequence encodes a portion of the format of information and wherein each portion of the format of information has more than two corresponding nucleotide sequences.
  • the method includes determining the sequences and decoded series of digital values for the sequences within a first portion of the plurality of nucleotide sequences, translating the series of digital values into the portions of the format of information, and sequencing and decoding in series additional portions into series of digital values and translating the series of digital values into the portions of the format of information until the entire format of information is achieved.
  • the present disclosure provides a method of encoding a series of digital values corresponding to a format of information into a nucleotide sequence.
  • the method includes for each digital value, assigning a corresponding nucleotide to different or nonidentical nucleotide transition to generate the nucleotide sequence, synthesizing the nucleotide sequence, and optionally storing the nucleotide sequence.
  • the digital values are two digits, three digits, four digits, five digits, six digits, seven digits, eight digits, nine digits, ten digits, eleven digits, twelve digits, thirteen digits, fourteen digits, fifteen digits, sixteen digits, or more.
  • the digital values are bits (binary digits), trits (ternary digits), octet (eight digits), hexadecimal (sixteen digits), or more.
  • the format of information is selected from the group consisting of text, image, video or an audio format, sensor data, and combination thereof.
  • nucleotides or different or nonidentical nucleotides comprise natural nucleotides or nonnatural nucleotides. In another embodiment, the nucleotides or different or nonidentical nucleotides comprise adenine, cytosine, guanine, and thymine.
  • the present disclosure provides a method for high- throughput decoding of a format of information encoded in a plurality of nucleotide sequences.
  • the method includes providing a plurality of nucleotide sequences, the plurality of nucleotide sequences represents a packet of information, the packet comprises at least one unique identifier; sequencing at least one of the plurality of nucleotide sequences using a selective sequencer; storing the sequence and its unique identifier; and preventing, using the selective sequencer, redundant sequencing of the same nucleotide sequence.
  • the step of preventing comprises using the unique identifier to prevent sequencing of additional nucleotide sequence with the same identifier.
  • the selective sequencer is a nanopore sequencer or a sequencer compatible with sequencing-by-synthesis, sequencing-by-ligation and sequencing-by- hybridization methods.
  • the sequence is stored in computer memon,'.
  • the sequence is decoded into digital values.
  • the unique identifier is a synthetic sequence.
  • the unique identifier is located at the 3' end, the 5' end of the nucleotide sequence, or is interspersed within the nucleotide sequence.
  • the plurality of nucleotide sequences comprises a plurality of unique identifiers.
  • the method further includes sequencing a predetermined number of nucleotide sequences; assembling the packet of information; and analyzing the information to determine if the information is correctly decoded.
  • the method further includes permitting sequencing of any nucleotide sequences that were not correctly decoded.
  • the step of analyzing is performed using a decoding algorithm.
  • the present disclosure provides a method of encoding information using nucleotides.
  • the method includes converting a format of information into a sequence of binary ASCII bits, converting the sequence of binary ASCII bits into a sequence of ternary ASCII bits, converting the sequence of ternary ASCII bits into a corresponding oligonucleotide sequence such that one bit represents a transition between non-identical nucleotides, and synthesizing the corresponding oligonucleotide sequence by the following steps: (a) providing a reaction mixture to an initiator oligonucleotide immobilized to a solid support wherein the reaction mixture comprises an amount of terminal deoxyiiucleotide transferase (TdT), an amount of apyrase, one or more selected nucleotide triphosphates, and divalent cations, wherein the TdT adds one or more of the selected nucleotide triphosphates to the 3' terminal nu
  • the present disclosure provides a method of decoding a format of information from a synthesized oligonucleotide sequence encoding bit sequences of the formation of information.
  • the method includes amplifying the oligonucleotide sequence, sequencing the amplified oligonucleotide sequence, converting the oligonucleotide sequence to bit sequences wherein each bit represents a transition between non-identical nucleotides, and converting the bit sequences to the format of information.
  • the oligonucleotide sequence is ligated to a universal adaptor before amplification.
  • the present disclosure provides a method of storing information using nucleotides.
  • the method includes converting a format of information into a sequence of binary ASCII bits, converting the sequence of binary ASCII bits into a sequence of ternary ASCII bits, converting the sequence of ternary ASCII bits into a corresponding oligonucleotide sequence such that one bit represents a transition between non-identical nucleotides, synthesizing the corresponding oligonucleotide sequence by the following steps: (a) providing a reaction mixture to an initiator oligonucleotide immobilized to a solid support wherein the reaction mixture comprises an amount of terminal deoxynucleotide transferase (TdT), an amount of apyrase, one or more selected nucleotide triphosphates, and divalent cations, wherein the TdT adds one or more of the selected nucleotide triphosphates to the 3' terminal nucleo
  • TdT terminal de
  • the nucleotide triphosphate comprises dATP, dTTP, dCTP, dGTP, and dUTP.
  • synthesis activity is modulated by the ratio of the amount of TdT : the amount of apvrase.
  • divalent cations comprise magnesium and cobalt.
  • the reaction mixture further comprises additives comprising glycerol, sucrose, PEG8000, betaine, DMSA, Triton-XlOO and Tween20.
  • the 3' terminal nucleotide of the initiator oligonucleotide is preferably A, G or T.
  • a polyC tail is added to the end of the corresponding oligonucleotide sequence.
  • a washing step is included between steps (a) and (b).
  • an index is included in the oligonucleotide sequence to specify strand order.
  • the nucleotide sequence is synthesized by a template-independent DNA polymerase.
  • the template-independent DNA polymerase is terminal deoxynucieotidyl transferase (TdT).
  • TdT terminal deoxynucieotidyl transferase
  • the nucleotide sequence is synthesized by a mixture of a template-independent DNA polymerase and an apvrase.
  • the information is stored using a codec model.
  • the codec model is capable of correcting errors accumulated from synthesis, storage and sequencing.
  • the sequencing is streaming nanopore sequencing.
  • Fig. 1 depicts in schematic of a comparison of the number of steps required for a single coupling in enzymatic DNA synthesis vs phosphoramidite chemistry.
  • Figs. 2 A - 2C depict results for optimizing and tuning TdT: apyrase ratio.
  • Fig. 2 A depict initiator extension with dATP, dCTP, dGTP or dTTP by four different TdT to apyrase ratios.
  • TdT concentration is constant at lU/ ( uL, apyrase concentration varies and is marked above each lane. mU is milliunits. Gels are 15% TBE-urea. "L” is ssDNA size marker.
  • Figs. 2B & 2C depict extension of an initiator with various concentration of dCTP (Fig.
  • dGTP Fig, 2C
  • Apyrase:TdT ratio, as well as dNTP concentrations are marked above each lane.
  • Gels are 15% TBE-urea.
  • "L” is ssDNA size marker and includes the unextended initiator, as well as initiator synthesized with 1 , 2, 3, 4, or 5 additional Cytosines (Fig, 2B) or Guanines (Fig. 2C),
  • Figs. 3 A - 3C depict effects of cobalt on TdT: apyrase performance.
  • Fig. 3 A depicts an initiator extension with each dNTP by various ratios of TdT to apyrase in presence of magnesium and presence or absence of supplemental cobalt.
  • TdT concentration is constant at ⁇ / ⁇
  • apyrase concentration which varies, as well as presence or absence of cobalt are marked above each lane.
  • cobalt is at 250 ⁇ .
  • Gels are 15% TBE-urea.
  • "L" is ssDNA size marker.
  • Fig. 3B depicts an initiator extension with 300 ⁇ dATP in presence of Magnesium and increasing amounts of supplemental cobalt.
  • Cobalt concentrations are marked above each lane.
  • Gel is 15% TBE-urea.
  • "L” is ssDNA size marker.
  • Fig. 3C depicts an initiator extension with each dNTP by TdT:apyrase in magnesium-only or cobalt-only reactions.
  • dNTP concentration is marked above each lane.
  • Gel is 1.5% TBE-urea.
  • "L” is ssDNA size marker and includes the unextended initiator, as well as initiator synthesized with 1, 2, 3, 4, or 5 additional nucleotides of the corresponding base, that is Cytosines for the gel with cytosine extension.
  • Figs. 4A - 4C depict buffer and additives optimization for TdT:apyrase.
  • Fig. 4A depicts an initiator extension with dATP by TdT:apyrase with increasing concentration of Enzvmatics Green Buffer. Final buffer concentration is marked above each lane. Gels are 15% TBE-urea. "L” is ssDNA size marker.
  • Fig. 4B depicts an initiator extension with a 500 ⁇ mixture of all dNTPs by TdT apyrase in presence of various additives in different concentrations. Each lane is labelled with a number, the additive and its concentration in that lane are listed below the gels. Gels are 10% TBE-urea.
  • "L” is an RNA size marker.
  • 4C depicts an initiator extension with various dCTP concentration by TdT: apyrase in the optimized buffer and the standard buffer. Gels are 1 5% TBE-urea. "L” is ssDNA size marker and includes the unextended initiator, as well as initiator synthesized with 1, 2, 3, 4, or 5 additional Cytosines,
  • Fig. 5 depicts Optimizing polymerase to initiator ratio.
  • Initiator extension extension with dATP by TdT apyrase with increasing concentration of TdT. Values above each lane mark the concentration of TdT at units per ⁇ . Apyrase concentration is constant at 1 ⁇ / ⁇ . Gel is 15% TBE-urea. "L” is ssDNA size marker and includes the unextended initiator which is 27 bases long.
  • Fig. 6 depicts TdT: apyrase performance and nucleotide concentration optimization for all sixteen possible combinations of 3' base of the initiator and the incoming nucleotide triphosphate (4 by 4). Each combination is evaluated on five lanes. The concentration of the relevant nucleotide is shown in ⁇ on top of each lane. Gels are 15% TBE-urea. "L" is ssDNA size marker and includes the unextended initiator which is 27 bases long.
  • Fig. 7 depicts multiple consecutive rounds of extension using the TdT:apyrase reagent. Two different series of transitions are shown. The nucleotides that is added is marked on top of each lane. All samples that are shown on each gel were aliquots of the same reaction that were samples after the addition of each nucleotide. Gels are 15% TBE-urea. "L" is ssDNA size marker and includes the unextended initiator which is 24 bases long.
  • Figs. 8A - 8C depict schematics for an enzymatic synthesis platform for DNA information storage.
  • Fig, 8 A shows a schematic depiction of the synthesis reaction consisting of an oligonucleotide initiator, terminal deoxynucleotidyl transferase (TdT) and apyrase (AP).
  • TdT catalyzes the addition of nucleotides to the 3' end of the initiator, and apyrase degrades nucleotide triphosphates to terminate polymerization. Subsequent nucleotide triphosphates are added for further DNA synthesis. All synthesized strands share the same order of transitions between different nucleotides.
  • FIG. 8B depict a schematic conversion between DNA and information. Synthesized DNA polymers are processed in silica by extracting transitions, which are then mapped to trits and bits.
  • Fig. 8C depicts the conversion between nucleotide transitions and trits used in this study.
  • Figs. 9A - 9D depict encoding "hello world! in DNA using enzymatic synthesis.
  • Fig. 9A depicts an overview of the encoding scheme. Each character is represented by its own DNA strand containing a header index. To encode each character, its respective ASCII binary representation is converted to ternary, then to nucleotide transitions according to the mapping in Fig. 8C. DNA is synthesized using the enzymatic strategy disclosed herein, then sequenced as a pool using Illumina or Oxford Nanopore platforms.
  • Fig. 9B depicts strand fidelity of each strand by Illumina and Oxford sequencing platforms.
  • Fig. 9C depicts streams of nanopore sequencing data. Each read is represented as a light gray dot. Reads passing the correct number of transitions (dark gray) and those with correct transitions (black) are marked. For each strand, the vertical line marks the time where the correct data can be decoded with a 99.9% confidence from the collected sequences.
  • Fig. 9D depicts data reconstruction using streaming nanopore sequencing compared to batch sequencing-by-synthesis (SBS), For each platform, the point of time at which the entire message can be decoded is marked by a box and an arrow.
  • SBS batch sequencing-by-synthesis
  • Fig. 10 depicts profiling accuracy of each "hello world! strand at every position, Illumina sequencing output was subjected to run-length encoding.
  • the black line indicates the percentage of reads that contained a nucleotide.
  • the bars indicate percentage of ail reads that had a deletion, mismatch, or insertion at each position. As the frequencies of deletions and insertions are small, their bars are not visible in most positions.
  • Fig. 11 depicts the length distribution for each of the twelve synthesizes strands. Lengths of all reads are denoted by the black line. Lengths of perfect reads are denoted by the gray shading. As perfect reads are longer, on average, size selection will increase the yield of correctly synthesized strands.
  • Figs. 12A - 12B depict the evaluation of 5-Bromo-dCTP and natural dCTP for TdT:apyrase.
  • 5-Bromo-dCTP as a substitute for natural dCTP is evaluated.
  • " is ssDNA size marker and includes the initiator oligonucleotide which is 27 bases long and ends in three cytosines.
  • Fig. 12A depicts that the extension lengths were evaluated over indicated concentration of natural dCTP.
  • Fig. 12B depicts that the extension lengths were evaluated over indicated concentration of 5-Bromo-dCTP (5Br-dCTP).
  • Figs. 13A - 13C depict an enzymatic synthesis strategy for storing information in DNA.
  • Fig. 13 A depicts a schematic depiction of a series of enzymatic synthesis reactions consisting of an oligonucleotide initiator (N, gray), terminal deoxynucleotidyl transferase (TdT) and apyrase (AP).
  • the initiator is tethered to a solid support.
  • TdT catalyzes the addition of a given nucleotide triphosphate to the 3' end of all initiators while apyrase degrades the added substrate to limit net polymerization.
  • FIG. 13B depicts the DNA strands synthesized for each of eight consecutive synthesis cycle, as shown on 15% TBE-urea gel. The initiators were not tethered to a solid support and no wash was performed between cycles. The first lane is a single-stranded DNA size marker which includes 24 nucleotide long initiator oligonucleotide.
  • Fig. 13C depicts a schema for
  • Raw strands represent enzymatically- synthesized DNA.
  • Compressed strands represent sequences of non-identical nucleotides. Transitions between nucleotides, starting with the last nucleotide of the initiator
  • strands is equivalent to the template sequence, all desired transitions are present and the information stored in DNA is retrieved.
  • Figs. 14A - 14H depict the demonstration of information storage in DNA using enzymatic synthesis.
  • Fig. 14A depicts that the message "hello world!” was encoded in twelve template sequences, H01-H12, each representing one character. Transitions between nucleotides starts with the last base of the initiator, which is labeled 'g ⁇ A header index (shaded gray) denotes strand order. Only results from the first five transitions sequences are shown (see Fig. 15).
  • To encode each character its respective ASCII decimal value, prefixed with an address is represented in base 2 (binary) or in base 3 (ternary) (see Table 1), mapped to transitions (see Fig. 13C), resulting in template sequences with nucleotides to be synthesized (capitalized).
  • Fig. 14B depicts the extension lengths for each base from (A). Only
  • Fig. 14C depicts the distribution of extension lengths for each nucleotide transition, combined across ail positions from ail perfect strands.
  • Fig. 14D depicts the stepwise increases in strand R length with an increasing strand ⁇ length for all synthesized strands of H01-H12.
  • Fig. 14E depicts the distribution of all strand R lengths. Distributions are derived via kernel density estimation for all synthesized strands ('all ', gray shading) and a subpopulation of strands that contain all desired transitions ('perfect', dotted line).
  • Fig. 14F depicts the bulk error analysis for all synthesized strands of H01-H12.
  • strands ' were aligned, by Needleman-Wunsch, to their respective template sequences, and the number of mismatches, insertions, and missing nucleotides were tabulated.
  • Fig. 14G depicts the information retrieval with in silica filtering. Fraction of perfect strands are shown before
  • Fig, 14H depicts the information retrieval by different sequencing platforms. Streaming nanopore sequencing (Oxford) was compared to batch sequencing-by-synthesis (lilumina). Each dot indicates the fraction of sequencing run at which each strand is robustly retrieved (100% correct with 99.99% probability). Arrow denotes the fraction of the sequencing run at which all data is robustly retrieved using each platform.
  • Fig. 15 depicts the dxtension lengths for perfect strands of H01-H12. Extension lengths for each nucleotide from perfect strands are displayed as a letter-value plot for each template sequence.
  • Fig. 16 depicts the raw lengths for all and perfect strands of H01 -HI 2. All
  • synthesized strands of H01-H12 were sequenced with lilumina. Length distribution for the set of all (gray with shading) and perfectly (dashed line) synthesized and sequenced raw strands are shown. Distributions are derived via kernel density estimation. The number of all strands to perfect strands for each template sequence are as follows: H01 ⁇ all : 399363, perfect: 42337 ⁇ , H02 ⁇ all: 431770, perfect: 62243 ⁇ ; H03 ⁇ all : 611804, perfect: 89302 ⁇ ; H04
  • Fig. 17 depicts the synthesis error analysis for all strands of H01-H12. All synthesized strands R were sequenced with lilumina and transitions of non-identical nucleotides were
  • strands ' Each of these strands is aligned, by Needleman-Wunsch, to its respective template sequence. For each alignment, the number of mismatches, insertions, and missing nucleotides are tabulated.
  • Figs. 18 A. - 18B depict the nanopore sequencing and decoding of H01 -HI 2.
  • Nanopore sequencing (Oxford) of synthesized raw strands. For each raw strand, the sequence of non-identical nucleotides are extracted to form compressed strands (strands*")- Fraction of perfect strands ' are plotted out of the set of all strands ' (filled triangles) or out of the set of the top 3 most abundant strands*" (open triangles). Strands* " can be filtered based on the design of the template sequence (Methods).
  • Figs. 19 A - 19E depict the coded strand architecture for sequence reconstruction.
  • Fig. 19A depicts a DNA information storage channel. Data is converted to template sequences, synthesized (strand ), and can be stored in vitro. Retrieval starts with sequencing, then
  • Fig, 19B depicts the coded strand architecture, 'scaffold', enables
  • Fig. 19C depicts a 16-base transition sequence, E0, is synthesized and sequenced with Illumina. Examples of diverse strands "' produced by synthesis of E0.
  • Strands are aligned, by Needleman-Wunsch, to the template.
  • Ambiguous alignments can exist depending on the location and number of missing nucleotides within a strand ' .
  • FIG. 19D depicts the error analysis for purified strands of E0. Synthesized strands were purified in silico, by filtering for strands 11 between 32-48 bases in length, and aligned by Needleman- Wunsch to the E0 template. For each alignment, the number of mismatches, insertions, and missing nucleotides were tabulated.
  • Fig. 19E depicts evaluating the diversity of synthesized
  • the number of sequencing reads for each length of strand was tabulated. Diversity was evaluated as the number of unique variants at each length of strand C and the Levenshtein edit distance was computed with respect to the E0 template.
  • the set of 802 purified strands contains 2 perfect strands.
  • Figs. 20A - 20C depict the synthesis error analyses and diversity of all synthesized strands of E0. All synthesized strands 11 of E0 were sequenced with Illumina and transitions of non-identical nucleotides were extracted to form strands Fig. 20A depicts the length distribution for the set of all (gray with shading) and perfectly (dashed line) synthesized and sequenced raw strands are shown. Distributions are derived via kernel density estimation. The number of all strands to perfect strands for the template sequence is as follows: E0 ⁇ all: 79192, perfect: 3 ⁇ .
  • a sequence of non-identical nucleotides were extracted to form strand 0 , which is then aligned, by Needleman-Wunsch, to its respective template sequence.
  • Fig. 20B depicts that for each alignment, the number of mismatches, insertions, and missing nucleotides from strand 0 are tabulated.
  • Fig. 20C depicts the number of sequencing reads at each length (number of nucleotides of strand is tabulated. Diversity is evaluated as the number of unique variants at each strand C length and the Levenshtein edit
  • Strands were filtered for read counts of at least 3 to remove aberrantly synthesized or sequenced variants.
  • Figs. 21A - 21B depict the constraints for valid transitions between nucleotides. As physical processes, both chemical synthesis and enzymatic synthesis have constraints for valid transitions between nucleotides, A transition matrix with no self-transitions (Fig. 21 A) and a transition matrix excluding specific transitions (Fig. 21B) are depicted. Based on whether certain transitions are permitted, there exists a fundamental limit for the maximum number of bits per nucleotide that is possible to store. This limit is equal to
  • Figs. 22A - 22B depict the placement and modulation of information into template sequences.
  • Fig. 22A depicts the placement of information within a template sequence for both experimental and simulated storage systems.
  • template sequences contained 8 or 16 nucleotides each.
  • template sequences contained 38, 74, or 152 nucleotides each.
  • Each nucleotide in a template sequence either stores 1 trit (blue), 1 bit (red), or is allocated for synchronization (orange).
  • Fig. 22B depicts a modulation scheme to map 16 bits to a sequence of 16 nucleotides. As an intermediate step, 16 bits are converted to a mixture of 8-trits and 4-bits using map Ml (Table 9).
  • Figs. 23 A - 23 B depict the Markov model for the production of DNA strands.
  • Fig. 23A depicts that a Markov model provides a statistical framework for the production of DNA strands " created from a desired template sequence.
  • the k-th state denoted by 3 ⁇ 4 the k-th state denoted by 3 ⁇ 4 .
  • Markov model specifies the process for writing the k-t nucleotide in the template sequence.
  • An example is provided for the template sequence (AGCT).
  • the Markov model contains states which include a deletion error
  • Fig. 23 B depicts that in the event of synthesis of a strand ' nucleotide, either a correct write occurs with probability l ⁇ P ⁇ b , or a write error (mismatch or substituted strand C nucleotide) occurs with total probability A specific substitution error occurs with probability
  • the function - x,y) mathematically represents the probability for substitutions of different strand 0 nucleotides.
  • Figs. 24A - 24E depict reconstruction of a template sequence by MAP estimation.
  • a template sequence may be successfully reconstructed from multiple DNA strands ' .
  • a template DNA sequence, associated scaffold sequence, and mathematical representation is c
  • the entries of the alpha and beta tables represent alpha forward probabi lities and beta backward probabilities, and are computed incrementally and efficiently based on dynamic programming recursions. These alpha and beta probabilities are necessary for the MAP estimation of each nucleotide in the template sequence as illustrated in (Fig. 24D) and (Fig. 24E). Specifically, an example of decoding the fourth nucleotide 0 4 of the template sequence is provided in (Fig. 24D).
  • This decoding involves determining the following probabilities: ⁇ ATCGCT ⁇ ** CA * A * *), ⁇ ATCG €T f ** CT * A **), W ⁇ ATCGCT j ** CC * ⁇ **), and ⁇ ATCGCT f ** CG * A **) each representing the fact that either an A, T, C, or G is possible for the fourth nucleotide respectively.
  • the decomposition of the probability ⁇ (ATCGCT I ** CG * A **) into different cases is given in (E).
  • the result of MAP estimation applied for all nucleotides reveals that a nearly correct reconstruction of the template sequence is possible even with one received DNA strand'", and that errors may be localized to their proper positions within the sequence.
  • Fig. 25A - 25C depict the coded strand architecture for robust information storage in imperfectly synthesized DNA strands.
  • Fig. 25A depicts that the message "Eureka!” was encoded and partitioned into four template sequences, E1-E4. Each sequence stores a 2-bit address and 14 bits of data and these bits are mapped to a template sequence of 16 nucleotides, which includes four synchronization nucleotides (dark gray). Synthesis performed with initiators tethered to beads and sequencing performed on the Illumina platform.
  • Fig. 25B depicts that retrieving information from E1-E4.
  • Synthesized strands R were sequenced using the Illumina sequencing-by-synthesis (SBS) platform and purified in silico based on raw length of 32-48 nucleotides (Methods), The decoding accuracy for each sequence is defined as the probability of 100% correct data retrieval for a given number of reads, estimated over 500 decoding trials. Each trial is based on a randomly drawn set of purified strand " variants. A 90% decoding accuracy (gray band) is considered sufficient for robust data retrieval, and the accuracy could be further reinforced by other codec modules.
  • Fig. 25C depicts the decoding of E3.
  • a set of 10 DNA strands " is decoded as two sets of five n
  • the decoder uses MAP estimation and a scaffold to determine the probability for each of the four nucleotides at every position.
  • the decoded sequence is a probabilistic consensus of the reconstructed sequences from MAP estimation and successfully retrieves the data stored in E3.
  • Fig. 26 depicts the raw lengths for all and perfect strands for E1-E4, All synthesized strands of E1 -E4 were sequenced with Illumina. Length distribution for the set of all (gray with shading) and perfectly (dashed line) synthesized and sequenced raw strands are shown.
  • Distributions are derived via kernel density estimation.
  • the number of all strands to perfect strands for each template sequence are as follows: El ⁇ all: 1 19677, perfect: 21 ⁇ ; E2 ⁇ all: 106983, perfect: 3 ⁇ ; E3 ⁇ all: 106793, perfect: 3 ⁇ ; E4 ⁇ all: 146710, perfect: 19 ⁇ .
  • Figs. 27A ⁇ 27B depict the synthesis error analysis for all strands and purified strands
  • strands Each of these strands is aligned, by Needleman-Wunsch, to its respective template sequence. For each alignment, the fraction of strands with the indicated number of mismatches, insertions, and missing nucleotides are tabulated. The set of all strands are evaluated in (Fig. 27 A) and the set of purified strands obtained by filtering the length of the corresponding strands R between 32-48 bases, assuming an extension length of 3 to 4 bases per template nucleotide are evaluated in (Fig, 27B).
  • Figs. 28A - 28B depict the lengths, diversity, and edit distance for all and purified strands for E1-E4. All synthesized strands R of E1-E4 were sequenced with Ulumina and
  • Strands ' were filtered for read counts of at least 3 to remove aberrantly synthesized or sequenced variants.
  • the number of sequencing reads at each length (number of strand*" nucleotides) is tabulated. Diversity is evaluated as the number of unique variants at each length and the Levenshtein edit distance is computed according to its respective template sequence.
  • Fig. 29 depicts the diversity of compressed synthesized strands for EO.
  • Strands ⁇ obtained for template sequence E0. Different strand variants are ranked in the vertical axis in order of the number of reads per variant. The strands are arranged on the horizontal axis in order of increasing length. In comparison to the E0 template sequence, most diverse compressed strands are missing nucleotides, although some strands may have insertions or mismatches (substitutions).
  • Fig. 30 depicts the diversity of compressed synthesized strands for El .
  • Strands obtained for sequence El Different strand variants are ranked in the vertical axis in order of the number of reads per variant.
  • the strands are arranged on the horizontal axis in order of increasing length.
  • most diverse compressed strands are missing nucleotides, although some strands may have insertions or mismatches (substitutions).
  • Fig. 31 depicts the diversity of compressed synthesized strands for E2. Strands obtained for sequence E2. Different strand variants are ranked in the vertical axis in order of the number of reads per variant. The strands are arranged on the horizontal axis in order of increasing length. In comparison to the E2 template sequence, most diverse compressed strands are missing nucleotides, although some strands may have insertions or mismatches (substitutions).
  • Fig. 32 depicts the diversity of compressed synthesized strands for E3. Strands obtained for sequence E3. Different strand variants are ranked in the vertical axis in order of the number of reads per variant. The strands are arranged on the horizontal axis in order of increasing length. In comparison to the E3 template sequence, most diverse compressed strands are missing nucleotides, although some strands may have insertions or mismatches (substitutions).
  • Fig. 33 depicts the diversity of compressed synthesized strands for E4. Strands obtained for sequence E4, Different strand variants are ranked in the vertical axis in order of the number of reads per variant. The strands are arranged on the horizontal axis in order of increasing length. In comparison to the E4 template sequence, most diverse compressed strands are missing nucleotides, although some strands may have insertions or mismatches (substitutions).
  • Figs. 34A - 34H depict the decoding curves for E1-E4 template sequences for "Eureka! ". Results for the successful reconstruction of sequences E1-E4 from the in silico size-selected set of DNA strands ⁇ .
  • All decoding curves illustrate the probability of correct decoding of a sequence vs. the number of purified reads of synthesized DNA strands
  • the probability of correct decoding is based on 500 decoding trials, each of which involves sampling a set of purified DNA strands according to the target number of total reads. In each decoding trial, the sampled set of DNA strands is filtered further based on the number of reads per strand (between 1 and 5 reads per strand).
  • the 10 strands with the longest length are selected for reconstruction via MAP decoding and consensus.
  • Decoding curves are presented for sequences E1-E4 in (Fig. 34 A), (Fig. 34C), (Fig. 34E), and (Fig. 34G) respectively when applying the different filters based on reads per strand.
  • the best decoding results from the filters are compiled for each datapoint to produce the "Best MAP Decoding" curve in (Fig. 34B), (Fig. 34D), (Fig. 34F), and (Fig. 34H).
  • This curve is compared to the two-step baseline filter, used for HQ1-H12, decoding which outputs the longest DNA strand which also has the highest number of reads amongst other strands of equal length. Taken together, these results show that decoding accuracy improves substantially when applying MAP decoding and consensus with 10 filtered strands compared to baseline decoding with one filtered strand.
  • Figs. 35 A - 35C depict a roadmap for scaling DNA storage systems.
  • Fig. 35A depicts the efficiency of storage for experimental and simulated systems.
  • Experimental systems black
  • Simulated maximum storage systems white circles
  • the amount of bits stored per sequence is dependent on the amount of error-correction codes (ECC) that are applied. Reducing ECCs increases the efficiency rate of storage.
  • ECC error-correction codes
  • the upper bound theoretical limit represents a maximum efficiency of storage of -1.58 bits per transition between non-identical nucleotides.
  • the lower bound theoretical limit represents the minimum number of bits per template sequence that must be stored for addressing only.
  • Fig. 35B depicts that flexible- write storage is enabled by a codec which harnesses diversely synthesized strands. The decoding pipeline supports robust data retrieval from synthesized strands with a significant percentage of errors.
  • 35C depicts a system architecture for storing information in enzymatiealiy-synthesized DNA.
  • a bitstream is partitioned into rows, each augmented with an address to delineate its order for reassembly.
  • An ECC such as a Bose-Chaudhuri-Hocquenghem (BCH) code can be applied to each row, or an ECC such as a Reed-Solomon (RS) code can be applied across multiple rows, to protect data from errors.
  • Modulation consists of mapping sequences of bits to template sequences, which includes synchronization nucleotides. Enzymatic synthesis then produces multiple diverse strands 0 per template sequence. The resulting strands ⁇ are used for sequence reconstruction based on MAP estimation and probabilistic consensus. Subsequently, the reconstructed sequence is demodulated into bits. Error-correction is applied to ensure data retrieval.
  • Figs. 36A ⁇ 36F depict the estimated capacity in bits per template sequence with increased synthesis accuracy for simulated DNA storage systems. Tradeoffs between estimated capacity (bits stored per sequence) vs. synthesis accuracy.
  • Fig. 36A estimated capacity vs. synthesis accuracy measured in terms of the probability of deletions only (missing nucleotides) or (Fig. 36B) including additional 5% substitution and 2% insertion errors.
  • Fig. 36C estimated capacity vs. synthesis accuracy measured in terms of the probability of deletions only (missing nucleotides) or (Fig. 36D) including additional 5% substitution and 2% insertion errors.
  • Fig. 36E estimated capacity vs. synthesis accuracy measured in terms of the probability of deletions only (missing nucleotides) or (Fig. 36F) including additional 5% substitution and 2% insertion errors.
  • the estimated capacity decreases smoothly as synthesis accuracy decreases. The tradeoffs are non-linear. If more compressed strand variants are utilized for decoding, the estimated capacity increases.
  • Figs. 37A - 37F depict the waterfall decoding curves for simulated DNA storage systems. Simulation results for successfully decoding and retrieving information from multiple DNA strands synthesized per sequence. Decoding results are visualized as "waterfall curves ' ", representing the probability of correct retrieval for varying levels of errors tolerated per strand. The boundary of error-tolerance for all displayed systems is between 25-30% per strand*", including missing nucleotides (deletions), mismatches (substitutions), and insertion errors. This error tolerance is obtained for decoding with up to 10 diverse strands*" per sequence. (Fig. 37 A) Decoding 23 bits of information stored in template sequences of 38
  • Fig. 37C Decoding 36 bits of information stored in template sequences of 74 nucleotides, based on multiple strands " containing only missing nucleotides and (Fig. 37B) with the inclusion of mismatches (substitutions) and insertion errors. (Fig. 37C) Decoding 36 bits of information stored in template sequences of 74 nucleotides, based on multiple strands " containing only missing nucleotides and (Fig. 37B) with the inclusion of mismatches (substitutions) and insertion errors. (Fig. 37C) Decoding 36 bits of information stored in template sequences of 74 nucleotides, based on multiple
  • Figs. 38A - 38D depict the majority alignment of DNA strands per sequence. Simulation results for decoding sequences using the majority alignment algorithm.
  • Template sequences have (Fig. 38 A) 16, (Fig. 38B) 24, (Fig. 38C) 74, and (Fig. 38D) 152 nucleotides respectively. Each template sequence is randomly created per decoding trial. A total of 1000 decoding trials were simulated per datapoiiit. The production of DNA strands from a template sequence is simulated according to a Markov model with probability of deletion per nucleotide. Sequences are decoded from either 10, 100, or 1000 diverse strands " . Majority alignment achieves an increase in decoding accuracy given more strands C . However, the decoding accuracy reaches a theoretical limit. The error-tolerance saturates at approximately
  • Figs. 39 A. - 39B depict the system architecture of codec for storing information in DNA.
  • Fig. 39A depicts a high-level block diagram of a DNA storage system. Data is represented as bits of information which are encoded into a set of DNA sequences. De novo synthesis (e.g., enzymatic synthesis) of each sequence results in the creation of diverse DNA strands which can be stored at high volumetric density. For random-access retrieval of data, a subset of the DNA strands may be PCR-amplified and then sequenced (e.g., using Illumina or nanopore sequencing technologies), DNA sequencing results in several reads. All reads are clustered, filtered, processed in-silico, and provided to a decoder for reconstruction.
  • Fig. 39B depicts a detailed block diagram of a codec for robust storage of digital information in DN A.
  • the encoder first partitions payload data into rows of bits. Each row is prefixed with an address (turquoise) to delineate its order. To recover missing rows of data, an error-correction code (ECC) may be applied per block of rows, resulting in redundant rows of information (purple). Additionally, an ECC may be applied per row/sequence of data, resulting in redundant bits per row (light green).
  • ECC error-correction code
  • Each row of bits is modulated into a DNA sequence of nucleotides (blue) containing interspersed synchronization nucleotides (orange). Synthesis of each sequence results in diverse compressed strands which may contain nucleotide errors (red).
  • the decoder fully or partially reconstructs DNA sequences using synchronization alignment and consensus algorithms. After demodulation of DNA sequences to rows/sequences of bits, the decoder may apply error-correction decoding per row/sequence to correct remaining bit eirors (red). The decoder then orders all rows according to their addresses. If any rows are missing, additional error- correction may be applied across rows using a block ECC. The final step of the decoder is to extract the original payload data from the ordered rows of bits. Overall, the encoding and decoding pipelines ensure the robust storage of data in DNA sequences.
  • Figs. 40A - 40E depict an array-format enzymatic synthesis platform.
  • Fig. 40 A depicts that the prototype is comprised of two main parts: a Mantis liquid handler, which has a single robotic arm that can be programmed to dispense one of six reagents at a time, and custom jigs, which were either laser cut (Epilog Legend 36EXT) or machined (gift from Formulatrix) to hold the glass slide acting as a solid support substrate for the DNA.
  • a Mantis liquid handler which has a single robotic arm that can be programmed to dispense one of six reagents at a time
  • custom jigs which were either laser cut (Epilog Legend 36EXT) or machined (gift from Formulatrix) to hold the glass slide acting as a solid support substrate for the DNA.
  • Epilog Legend 36EXT laser cut
  • Fig. 40B depicts that the enzymatic mix is dispensed according to programmed coordinates on the treated slide, resulting in a 2D grid of features.
  • Fig. 40C depicts that the Mantis places the enzymatic mix, according to programmed coordinates, in serial to all features on the slide.
  • Fig. 40D depicts that for each synthesis cycle, there are four dispense cycles, one for each of the four nucleotide triphosphates used. The specific nucleotide triphosphate is dispensed only to the desired features (bold).
  • Fig. 40E depicts that the Mantis has a single dispenser and places the nucleotide triphosphate, according to programmed coordinates, in serial to the desired features on the slide.
  • Fig. 41 depicts the raw lengths for all and perfect raw strands for S01 -S03.
  • the number of all strands and perfect strands for each template sequence are as follows: SOI repl (all: 192989, perfect: 1 ⁇ , SO I rep 2 ⁇ all: 220921, perfect: 684 ⁇ , SOI rep 3 ⁇ all: 153002, perfect: 286 ⁇ , S02 rep 1 (all: 277897, perfect: 3545 ⁇ , S02 rep 2 (all: 385615, perfect: 4889 ⁇ , S02 rep 3 ⁇ all: 176680, perfect: 248 ⁇ , S03 rep 3 ⁇ all : 185327, perfect: 464 ⁇ , S03 rep 2 ⁇ all : 169000, perfect: 273 ⁇ , S03 rep 3 ⁇ all: 209018, perfect 898 ⁇ , The S01 rep 1 distribution for perfect strands is not visible due to the low number of perfect strands.
  • Figs. 42 A - 42B depict the synthesis error analysis for ail and purified strands for S01- R
  • Figs. 43A - 43B depict the lengths, diversity, and edit distance for all and purified strands for S01-S03. All synthesized strands* " of S01-S03 were sequenced with Illumina and transitions extracted. Run-length compressed strands (strands C ) were filtered for read counts of at least 3 to remove aberrantly synthesized or sequenced variants. The number of sequencing reads at each length (number of strand " nucleotides) is tabulated. Diversity is evaluated as the number of unique variants at each length and the Levenshtein edit distance is computed according to its respective template sequence. These measurements are presented for all synthesized strands " (Fig. 43 A) or a set of purified strands 1 ' obtained by filtering the length of the corresponding strands R between 39-52 bases, assuming an extension length of 3 to 4 bases per template nucleotide are evaluated in (Fig. 43B).
  • Figs. 44A - 44B depict the reagent cost projections for phosphoramidite chemistry and enzymatic synthesis.
  • the minimum feature size is 2.37 nm, which corresponds to the diameter of double-stranded DNA.
  • the price per megabyte for 1 million features with current feature sizes of 15 (gray circle) or 38 microns (gray diamond) are indicated.
  • Embodiments of the present disclosure are directed to methods of decoding a nucleotide sequence.
  • the nucleotide sequence contains encoded one, or more, or a series of values corresponding to a format of information. Each value or value point within the nucleotide sequence is represented as a transition or boundary or edge between different or nonidentical nucleotides of the nucleotide sequence.
  • the steps of decoding include determining the nucleotide sequence, identifying a transition or boundary or edge between different or nonidentical nucleotides of the nucleotide sequence, and assigning a predetermined value to the identified transition or boundary or edge to create the value that was originally encoded in the nucleotide sequence corresponding to the format of information.
  • the step of determining the nucleotide sequence includes sequencing according to methods known to a skilled in the art. in one embodiment, sequencing includes nanopore sequencing.
  • sequencing includes nanopore sequencing.
  • the values are represented by a plurality of transitions or boundaries or edges between different or nonidentical nucleotides of the nucleotide sequence, which can be identified.
  • Each identified transition or boundary or edge is assigned a predetermined value to create the series of values encoded in the nucleotide sequence corresponding to the format of information.
  • the value corresponding to the format of information can be obtained from many input sources, including but are not limited to analog, digital, optical, visible or non-visible wavelengths, chemical, or physical input sources.
  • the disclosure contemplates digital values.
  • Digital values can include multiple digits according to a specific need.
  • the digital values include two digits, three digits, four digits, five digits, six digits, seven digits, eight digits, nine digits, ten digits, eleven digits, twelve digits, thirteen digits, fourteen digits, fifteen digits, sixteen digits, or more digits to accommodate a certain need or application.
  • the digital values are bits (binary digits), trits (ternary digits), octet (eight digits), hexadecimal (sixteen digits), or more digits.
  • the series of digital values comprises two, three or more different digital values.
  • Each of the digital value of the series of digital values represents two, three or more different digital values.
  • Each of the digital value of the series of digital values represents a digital value of the two, three or more different digital values.
  • the disclosure contemplates natural nucleotides or nonnatural nucleotides for information encoding, storage and decoding.
  • the nucleotides can be R A or DNA.
  • the nucleotides can include adenine, cytosine, guanine, thymine and uridine.
  • Any format of information can be converted into corresponding values and encoded in the nucleotide sequence.
  • a format of information includes but is not limited to text, image, video or audio format, sensor data, and combinations thereof.
  • the present disclosure contemplates the use of nucleotide transitions for information encoding and decoding.
  • the transition can be from a certain nucleotide to another different or nonidentical nucleotide.
  • the transition can also be from a certain nucleotide or nucleotide homopolymer to another different or nonidentical nucleotide or nucleotide homopolymer.
  • the transition between one nucleotide homopolymer to a different or nonidentical nucleotide homopolymer is a single transition or boundary or edge.
  • the each nucleotide transition or boundary or edge is assigned a predetermined digital value.
  • the series of digital values includes a corresponding barcode.
  • the disclosed method further contemplates decoding a plurality of nucleotide sequences.
  • Each member of the plurality encodes for an identical value or series of identical values corresponding to the format of information.
  • the nucleotide sequence or a plurality of nucleotide sequences can be attached to a substrate or solid support.
  • Embodiments of the present disclosure are directed to a method of decoding a nucleotide sequence encoding for a series of digital values corresponding to a format of information.
  • the nucleotide sequence can be determined by sequencing methods known to a skilled in the art to identify nucleotide homopolymers. Each homopolymer is assigned one or more of the nucleotides based on a predetermined predicted homopolymer length of the nucleotide produced using enzymatic synthesis, and a particular digital value is assigned for each of the one or more nucleotides.
  • the predicted homopolymer length can be determined from empirical observation.
  • the predicted homopolymer length is a median, a mean, or a mode based on data collected from empirical observation.
  • Embodiments of the present disclosure are directed a method of sequencing and decoding a plurality of nucleotide sequences representing a format of information wherein each nucleotide sequence encodes a portion of the format of information and wherein each portion of the format of information has more than two corresponding nucleotide sequences.
  • the nucleotide sequences are determined and series of digital values for the sequences within a first portion of the plurality of nucleotide sequences are decoded and translated into the portion of the format of information.
  • the sequencing and decoding are continued in series for additional portions into series of digital values and the series of digital values are translated into the portions of the format of information until the entire format of information is achieved.
  • Embodiments of the present disclosure further provides a method of encoding a series of digital values corresponding to a format of information into a nucleotide sequence.
  • the method includes for each digital value, assigning a corresponding nucleotide to different or nonidentical nucleotide transition to generate the nucleotide sequence, synthesizing the nucleotide sequence, and optionally storing the nucleotide sequence.
  • the digital values are two digits, three digits, four digits, five digits, six digits, seven digits, eight digits, nine digits, ten digits, eleven digits, twelve digits, thirteen digits, fourteen digits, fifteen digits, sixteen digits, or more digits.
  • the digital values are bits (binary digits), trits (ternary digits), octet (eight digits), hexadecimal (sixteen digits), or more digits.
  • Embodiments of the present disclosure also provides a method for high-throughput decoding of a format of information encoded in a plurality of nucleotide sequences or a plurality of DNA strands.
  • the plurality of nucleotide sequences or DNA strands are separated (packetized) into many packets.
  • each packet includes a plurality of DNA strands.
  • each packet includes a plurality of identical DNA strands.
  • each of the nucleotide sequence or DNA stand can include a unique identifier (such as a barcode sequence) corresponding to the specific packet of information.
  • each packet includes a plurality of identical nucleotide sequences (each as an independent DNA strand), thus, sequencing one strand in that packet is sufficient since the remaining strands are considered redundant.
  • each packet includes a plurality of near perfect identical nucleotide sequences (each as an independent DNA strand), due to encoding errors. In this case, an algorithm is designed to sample a predetermined number of nucleotide sequences with redundant identifiers, which leads to decoding of the format of information.
  • the algorithm will dictate for each packet, sequencing and decoding more than one strand with a specific identifier until a certain confidence of correctness is reached, without requiring sequencing of all the strands with the same/redundant identifier.
  • the sequence with its unique identifier is stored. In this manner, redundant sequencing of the same nucleotide sequence is prevented using the selective sequencer.
  • the selective sequencer is a sequencing platform that can prevent or halt redundant sequencing of the nucleotide sequences based on the unique identifier that is associated with the nucleotide sequence.
  • the selective sequencer is a nanopore sequencer that includes the selective functionality.
  • Embodiments of the disclosure relate to optimizing packet information management to improve data accuracy and increase the content loading speed, which can drive faster internet connections for many types of utilities including cellphones.
  • the information stored in DNA is packetized (separated) into units of DNA strands.
  • each packet can contain multiple copies of representative DNA strands. In decoding or retrieving the stored information, it would be more efficient to sequence one or a few representative DNA strands for each packet.
  • the initial results and simulations shown in Fig, 9D indicated that sequencing time and cost can be reduced by at least 2 fold, which would be a dramatic benefit when scaled to very large datasets.
  • Embodiments of the disclosure are directed to the use of the selective sequencer to optimize packet information management.
  • the selective sequencer has a first feature which can generate DNA sequences on the fly. This is an improvement over the current state of the art sequencer (Illumina being an exemplary case), which must fully sequence the DN A strand that was deposited on the sequencer before the sequence data can be used for further decoding, retrieval or recovery.
  • the Oxford Nanopore sequencer allows each DNA strand to be sequenced and decoded independently. This asynchronous sequencing allows processing and decoding each packet on the fly.
  • the selective sequencer has a second feature such that after a packet is sequenced and decoded, the sequencer moves on to sequence only the strands of the remaining unsequenced packets.
  • the sequencer is able to physically prevent further redundant sequencing of copies of DNA strands of the decoded packets.
  • a unique identifier such as a barcode, or header index is included in the DNA strands which signals the sequencer whether the strand has been decoded so that the sequencer can make a decision of whether to block continued sequencing.
  • Oxford Nanopore' s nanopore sequencing platform has the first feature, and there has been a proof-of-concept demonstration for the second feature for sequencing genomes (DNA strands of biological origin, not of synthetic origin). This platform performs the second feature by physically kicking the DNA strand out of the pore after reading just a fraction of the DNA strand.
  • nanopore sequencing is artificially slowed down to obtain high accuracy reads because it is highly error-prone.
  • Embodiments of the disclosure are thus directed to interspersing the unique identifier throughout the DNA strand to improve accuracy of sequencing using nanopore sequencing. Theoretically, the sequencing rate of nanopore sequencing can increase more than 20 fold, and at this rate, the error-rate will likely be even higher.
  • the sequence information can be stored in a suitable medium including computer memory.
  • the stored sequence information can be further decoded into digital values.
  • Any unique identifier can be used including a synthetic sequence or barcode sequence.
  • the synthetic sequence or barcode sequence is located at the 3' end, the 5' end of the nucleotide sequence, or is interspersed within the nucleotide sequence.
  • a plurality of nucleotide sequences can be labeled with a plurality of unique identifiers.
  • the method can further include sequencing a predetermined number of nucleotide sequences; assembling the packet of information; and analyzing the assembled information to determine if the information is correctly decoded.
  • the method further includes permitting sequencing of any nucleotide sequences that were not correctly decoded.
  • the assembled information can be analyzed using a decoding algorithm.
  • a format of information is first converted to a binary sequence, such as zeros "0s” and ones “ Is", and then to a ternary sequence, such as zeros "0s", ones "Is", and twos "2s", although any number can be used.
  • a binary sequence such as zeros "0s” and ones " Is”
  • a ternary sequence such as zeros "0s", ones "Is", and twos "2s”
  • Each digit of the ternary sequence corresponds to a transition of different or non -identical nucleotides according to a conversion scheme.
  • the ternary bit sequence is further converted to a corresponding oligonucleotide sequence.
  • Figs. 8B-8C and Fig. 9A provide an exemplary embodiment of such a conversion scheme.
  • the oligonucleotide sequence is synthesized and containing the encoded format of information. Synthesis can be carried out according to methods known to a skilled in the art. Embodiments of the disclosure are direct to enzymatic synthesis of oligonucleotides.
  • a template-independent D ' NA polymerase such as a terminal deoxynucleotidyiy transferase (TdT) is used.
  • an initiator oligonucleotide (a primer/an initiator) immobilized to a solid support is sequentially contacted by a reaction mixture that comprises an amount of terminal deoxynucleotide transferase (TdT), an amount of apyrase, one or more selected nucleotide triphosphates, and divalent cations.
  • TdT terminal deoxynucleotide transferase
  • apyrase an amount of selected nucleotide triphosphates
  • divalent cations divalent cations
  • any enzymatic, chemical or physical methods or reagents can be used to control the length of the nucleotide extension/polymerization.
  • one or more desired/selected nucleotides is added to the extending oligonucleotide chain until corresponding oligonucleotide sequence is formed.
  • the nucleotide triphosphate includes dATP, dTTP, dCTP, dGTP, and dUTP.
  • the synthesis activity is modulated by the ratio of the amount of TdT to the amount of apyrase.
  • divalent cations comprising magnesium and cobalt
  • additives comprising glycerol, sucrose, PEG8000, betaine, DMSA, Triton-Xl OO and Tween20 can also modulate the enzymatic reaction. Since each bit represents a transition between different or non- identical nucleotides, the information can be accurately encoded into oligonucleotide sequences independent of the lengt of each nucleotide extension/polymerization.
  • the disclosure provides that during each round of nucleotide extension/polymerization, one type of selected nucleotide triphosphate is added. In one embodiment, the excessive nucleotide triphosphate is inactivated by apyrase. This inactivation allows for multiple rounds of nucleotide polymerization that each adds a different nucleotide to the initiator or growing polynucleotide chain.
  • Embodiments of the present disclosure are directed to a method of decoding a format of information from a synthesized oligonucleotide sequence encoding bit sequences of the formation of information.
  • the synthesized oligonucleotide sequence containing the encoded information can be amplified.
  • the amplified oligonucleotide sequence is sequenced and the sequence can be converted to bit sequences according to the encoding scheme wherein each bit represents a transition between different or non-identical nucleotides.
  • the bit sequences can be converted back to the format of information.
  • the oligonucleotide sequence is ligated to a universal adaptor before amplification.
  • Embodiments of the present disclosure are directed to a method of storing information using nucleotides.
  • a format of information is first converted into a sequence of binary ASCII bits, then converted into a ternary sequence, which is further converted into a corresponding oligonucleotide sequence such that one bit of the ternary sequence represents a transition between different or non-identical nucleotides.
  • the corresponding oligonucleotide sequence is synthesized by the following steps: (a) providing a reaction mixture to an initiator oligonucleotide immobilized to a solid support wherein the reaction mixture comprises an amount of terminal deoxynucleotide transferase (TdT), an amount of apyrase, one or more selected nucleotide triphosphates, and divalent cations, wherein the TdT adds one or more of the selected nucleotide triphosphates to the 3' terminal nucleotide of the initiator oligonucleotide, and wherein the apyrase degrades excessive nucleotide triphosphates to inactive diphosphates and monophosphates, and (b) repeating step (a) until the corresponding oligonucleotide sequence is formed, and storing the synthesized corresponding oligonucleotide sequence.
  • TdT terminal deoxynucleotide transferase
  • the initiator oligonucleotides are immobilized on beads and pre-mixed with reagents that include TdT, apyrase and reaction buffer.
  • the initiator oligonucleotides can also be immobilized on the surface of a solid support such as beads or on the surface of a fluidic channel.
  • Certain embodiment of the disclosure is directed to an initiator that is attached by a cleavable moiety. This mixture is sequentially contacted with one type of the desired nucleotide triphosphates (dNTPs).
  • dNTPs desired nucleotide triphosphates
  • the ratio of the amount of TdT to the amount of apyrase in the reaction reagents modulates the enzymatic synthesis.
  • the desired or selected nucleotide is a natural nucleotide or any nucleotide analog known to a skilled in the art.
  • the reaction reagent can include a buffer comprising a monovalent salt, a divalent salt, a buffering agent, and a reducing agent at a suitable pH and temperature.
  • the selected concentration of reaction reagents is determined by the selected nucleotide triphosphate present in the reaction reagent.
  • a washing step is included between each round of enzymatic synthesis.
  • the present disclosure provides methods of enzymatic oligonucleotide synthesis which enable rapid and high-accuracy synthesis of custom DNA sequences by the template- independent DNA-polymerase terminal deoxynucleotidyl transferase (TdT).
  • TdT template- independent DNA-polymerase terminal deoxynucleotidyl transferase
  • the methods according to the present disclosure can be used for synthesis of cheaper, more accurate and longer custom DNA sequences for various biochemical, biomedical, or biosynthetic applications.
  • the methods according to the present disclosure can facilitate the use of DNA as an information storage medium.
  • a solid-phase synthesis device can be used to record digital information in DNA molecules.
  • the method according to the disclosure further comprises releasing the polynucleotide after the desired sequence of nucleotides has been added to the 3' end of the polynucleotide.
  • the method according to the disclosure further comprises releasing the polynucleotide using an enzyme, a chemical, light, heat or other suitable method or reagent.
  • the method according to the disclosure further comprises releasing the polynucleotide, collecting the polynucleotide, amplifying the polynucleotide and sequencing the polynucleotide.
  • nucleotide triphosphate inactivating enzyme is an apyrase.
  • nucleotide triphosphate inactivating enzyme is a nucleotide triphosphate degrading enzyme that degrades nucleotide triphosphates at a rate slower than rate of addition of nucleotides by the error prone or template independent DNA polymerase.
  • the nucleotide triphosphate inactivating enzyme is a nucleotide triphosphate degrading enzyme present at a concentration that degrades nucleotide triphosphates at a rate slower than rate of addition of nucleotides by the present concentration of the error prone or template independent DNA polymerase.
  • the nucleotide triphosphate inactivating enzyme comprises ATP diphosphohydrolase, dNTP pyrophosphatases, dNTPases, and phosphatases.
  • the concentration of nucleotide triphosphate inactivating enzyme is modulated to control addition of one or more nucleotides.
  • the nucleotide triphosphate inactivating enzyme renders free nucleotide triphosphates inactive.
  • the nucleotide inactivating enzyme renders free nucleotide triphosphates inactive by degradation.
  • the nucleotide inactivating enzyme renders free nucleotide triphosphates inactive by polymerizing them with each other.
  • the reaction conditions present a competing reaction between addition of free nucleotide triphosphates to the initiator sequence and degradation of free nucleotide triphosphates.
  • Polymerases including without limitation error-prone or template-dependent polymerases, modified or otherwise, can be used to create nucleotide polymers having a random or known or desired sequence of nucleotides.
  • Template-independent polymerases whether modified or otherwise, can be used to create the nucleic acids de novo. Ordinary nucleotides are used, such as A, T/U, C or G. Nucleotides may be used which lack chain terminating moieties.
  • a template independent polymerase may be used to make the nucleic acid sequence. Such template independent polymerase may be error-prone which may lead to the addition of more than one nucleotide resulting in a homopolymer.
  • oligonucleotide sequences or polynucleotide sequences are synthesized using an error prone polymerase, such as template independent error prone polymerase, and common or natural nucleic acids, which may be unmodified.
  • Initiator sequences or primers are attached to a substrate, such as a silicon dioxide substrate, at various locations whether known, such as in an addressable array, or random.
  • Reagents including at least a selected nucleotide, a template independent polymerase and other reagents required for enzymatic activity of the polymerase are applied at one or more locations of the substrate where the initiator sequences are located and under conditions where the polymerase adds one or more than one or a plurality of the nucleotide to the initiator sequence to extend the initiator sequence.
  • the nucleotides (“dNTPs") may be applied or flow in periodic applications. Nucleotides with blocking groups or reversible terminators can be used with the dNTPs under reaction conditions that are sufficient to limit or reduce the probability of enzymatic addition of the dNTP to one dNTP, i.e. one dNTP is added using the selected reaction conditions taking into consideration the reaction kinetics.
  • a microfluidic channel or microfluidic channels having an input and an output can be used to deliver reaction fluids including reagents, such as a polymerase, a nucleotide and other appropriate reagents and washes to particular locations on a substrate within the flow cell, such as within a microfluidic channel.
  • reagents such as a polymerase
  • reaction conditions will be based on dimensions of the substrate reaction region, reagents, concentrations, reaction temperature, and the structures used to create and deliver the reagents and washes.
  • pH and other reactants and reaction conditions can be optimized for the use of TdT to add a dNTP to an existing nucleotide or oligonucleotide in a template independent manner.
  • a dNTP to an existing nucleotide or oligonucleotide in a template independent manner.
  • reagents and reaction conditions for dNTP addition such as initiator size, divalent cation and pH.
  • TdT was reported to be active over a wide pH range with an optimal pH of 6.85. Methods of providing or delivering dNTP, rNTP or rNDP are useful in making nucleic acids.
  • nucleic acid molecule As used herein, the terms “nucleic acid molecule,” “nucleic acid sequence,” “nucleic acid fragment” and “oligomer” are used interchangeably and are intended to include, but are not limited to, a polymeric form of nucleotides that may have various lengths, including either deoxyribonucleotides or ribonucleotides, or analogs thereof.
  • nucleotide refers to a nucleoside having one or more phosphate groups joined in ester linkages to the sugar moiety. Exemplary nucleotides include nucleoside monophosphates, diphosphates and triphosphates.
  • nucleic acid molecule In general, the terms “nucleic acid molecule,” “nucleic acid sequence,” “nucleic acid fragment,” “oligonucleotide” and “polynucleotide” are used interchangeably and are intended to include, but not limited to, a polymeric form of nucleotides that may have various lengths, either deoxyribonucleotides (DNA) or ribonucleotides (RNA), or analogs thereof.
  • DNA deoxyribonucleotides
  • RNA ribonucleotides
  • a oligonucleotide is typically composed of a specific sequence of four nucleotide bases: adenine (A); cytosine (C); guanine (G); and thymine (T) (uracil (U) for thymine (T) when the polynucleotide is RNA).
  • deoxynucleotides such as dATP, dCTP, dGTP, dTTP
  • rNTPs ribonucleotide triphosphates
  • rNDPs ribonucleotide diphosphates
  • oligonucleotide sequence is the alphabetical representation of a polynucleotide molecule; alternatively, the term may be applied to the polynucleotide molecule itself.
  • This alphabetical representation can be input into databases in a computer having a central processing unit and used for bioinformatics applications such as functional genomics and homology searching.
  • Oligonucleotides may optionally include one or more non-standard nucieotide(s), nucleotide analog(s) and/or modified nucleotides.
  • the present disclosure contemplates any deoxyribonucleotide or ribonucleotide and chemical variants thereof, such as methylated, hydroxymethylated or glycosylated forms of the bases, and the like.
  • natural nucleotides are used in the methods of making the nucleic acids. Natural nucleotides lack chain terminating moieties. According to certain aspects, nucleotides with blocking groups or reversible terminators can be used in certain embodiments. Nucleotides with blocking groups or reversible terminators are known to those of skill in the art.
  • nucleotide analog refers to a non-standard nucleotide, including non-naturally occurring ribonucleotides or deoxyribonucleotides.
  • nucleotide analogs are modified at any position so as to alter certain chemical properties of the nucleotide yet retain the ability of the nucleotide analog to perform its intended function.
  • positions of the nucleotide which may he derivitized include the 5 position, e.g., 5-(2-amino)propyl uridine, 5-bromo uridine, 5-propyne uridine, 5-propenyl uridine, etc.; the 6 position, e.g., 6-(2-amino) propyl uridine, the 8-position for adenosine and/or guanosines, e.g., 8-bromo guanosine, 8- chloro guanosine, 8-fluoroguanosine, etc.
  • 5 position e.g., 5-(2-amino)propyl uridine, 5-bromo uridine, 5-propyne uridine, 5-propenyl uridine, etc.
  • the 6 position e.g., 6-(2-amino) propyl uridine
  • the 8-position for adenosine and/or guanosines e.g
  • Nucleotide analogs also include deaza nucleotides, e.g., 7-deaza-adenosine; O- and N-modified (e.g., alkylated, e.g., N6-methyl adenosine, or as otherwise known in the art) nucleotides; and other heterocyclicaliy modified nucleotide analogs such as those described in Herdewijn, Antisense Nucleic Acid Drug Dev., 2000 Aug. 10(4):297-310.
  • Nucleotide analogs may also comprise modifications to the sugar portion of the nucieotides.
  • the 2' OH-group may be replaced by a group selected from H, OR, R, F, CI, Br, I, Sit SR, NII 2 , M IR. NR 2 , COOR, or OR, wherein R is substituted or unsubstituted O-Ce alkyl, alkenyl, alkynyl, aryl, etc.
  • R is substituted or unsubstituted O-Ce alkyl, alkenyl, alkynyl, aryl, etc.
  • Other possible modifications include those described in U.S. Pat. Nos. 5,858,988, and 6,291,438.
  • the phosphate group of the nucleotide may also be modified, e.g., by substituting one or more of the oxygens of the phosphate group with sulfur (e.g., phosphorothioates), or by making other substitutions which allow the nucieotide to perform its intended function such as described in, for example, Eckstein, Antisense Nucleic Acid Drug Dev. 2000 Apr. 10(2): 1 17-21, Rusckowski et al. Antisense Nucleic Acid Drug Dev. 2000 Oct. 10(5):333 ⁇ 45, Stein, Antisense Nucleic Acid Drag Dev. 2001 Oct. 1 1(5): 317-25, Vorobj ev et al . Antisense Nucleic Acid Drug Dev. 2001 Apr.
  • modified nucleotides include, but are not limited to diaminopurine, S2T, 5-fluorouracil, 5-bromouracil, 5-chlorouracil, 5-iodouracil, hypoxanthine, xantine, 4- acetyl cytosine, 5-(carboxyhydroxylmethyl)uracil, 5-carboxymethylaminomethyl-2- thiouridine, 5-carboxymethylaminomethyluracil, dihydrouracil, beta-D-galactosylqueosine, inosine, N6-isopentenyladenine, 1 -methylguanine, 1-methylinosine, 2,2-dimethylguanine, 2- methyl adenine, 2-methylguanine, 3-methylcytosine, 5 -methyl cytosine, N6-adenine, 7- methyi guanine, 5-methylaminomethyluracil, 5-methoxyaminomethyl-2-thiouracil, beta-D- man
  • Nucleic acid molecules may also be modified at the base moiety (e.g., at one or more atoms that typically are available to form a hydrogen bond with a complementary nucleotide and/or at one or more atoms that are not typically capable of forming a hydrogen bond with a complementary nucleotide), sugar moiety or phosphate backbone.
  • Nucleic acid molecules may also contain amine-modified groups, such as aminoallyl-dUTP (aa-dUTP) and aminohexhylacrylamide- dCTP (aha-dCTP) to allow covalent attachment of amine reactive moieties, such as N- hydroxy succinimide esters (NHS).
  • a nucleic acid used in the invention can also include native or non-native bases.
  • a native deoxyribonucleic acid can have one or more bases selected from the group consisting of adenine, thymine, cytosine or guanine and a ribonucleic acid can have one or more bases seiected from the group consisting of uracil, adenine, cytosine or guanine.
  • Exemplar ⁇ ' non-native bases that can be included in a nucleic acid, whether having a native backbone or analog structure, include, without limitation, inosine, xathanine, hypoxathanine, isocytosine, isoguanine, 5 -methyl cytosine, 5-hydroxymethyl cytosine, 2- aminoadenine, 6-methyl adenine, 6-methyl guanine, 2 -propyl guanine, 2-propyl adenine, 2- thioLiracil, 2-thiothymine, 2- thiocytosine, 15 -halouracil, 15 -halocytosine, 5-propynyl uracil, 5-propynyl cytosine, 6-azo uracil, 6-azo cytosine, 6-azo thymine, 5-uracil, 4- thiouracil, 8-halo adenine or guanine, 8- amino adenine or guanine, 8-thiol adenine
  • adenine or guanine 5-halo substituted uracil or cytosine, 7-methylguanine, 7- methyiadenine, 8-azaguanine, 8-azaadenine, 7- deazaguanine, 7-deazaadenine, 3-deazaguanine, 3-deazaadenine or the like.
  • unique barcode sequences may be attached to each nucleic acid, i.e. DNA or RNA strands. Then adapters and or primers or other reagents known to those of skill in the art may be used as desired to sequence or amplify the nucleic acid with the unique barcode sequence.
  • polymerases are used to build nucleic acid molecules, such as for representing information which is referred to herein as being recorded in the nucleic acid sequence or the nucleic acid is referred to herein as being storage media.
  • Polymerases are enzymes that produce a nucleic acid sequence, for example, using DNA or RNA as a template. Polymerases that produce RNA polymers are known as RNA polymerases, while polymerases that produce DNA polymers are known as DNA polymerases. Polymerases that incorporate errors are known in the art and are referred to herein as an "error-prone polymerases". Template independent polymerases may be error prone polymerases.
  • Error-prone polymerases will either accept a non-standard base, such as a reversible chain terminating base, or will incorporate a different nucleotide, such as a natural or unmodified nucleotide that is selectively given to it as it tries to copy a template.
  • Template-independent polymerases such as terminal deoxynucleotidyl transferase (TdT), also known as DNA nucleotidylexotransferase (DNTT) or terminal transferase create nucleic acid strands by catalyzing the addition of nucleotides to the 3' terminus of a DNA molecule without a template.
  • TdT terminal deoxynucleotidyl transferase
  • DNTT DNA nucleotidylexotransferase
  • Cobalt is a cofactor, however the enzyme catalyzes reaction upon Mg and Mn administration in vitro.
  • Nucleic acid initiators may be 4 or 5 nucleotides or longer and may be single stranded or double stranded. Double stranded initiators may have a 3' overhang or they may be blunt ended or they may have a 3' recessed end.
  • TdT like all DNA polymerases, also requires divalent metal ions for catalysis.
  • TdT is unique in its ability to use a variety of divalent cations such as Co2+, Mn2+, Zn2+ and Mg2+.
  • the extension rate of the primer p(dA)n (where n is the chain length from 4 through 50) with dATP in the presence of divalent metal ions is ranked in the following order: Mg2+ > Zn2+ > Co2+ > Mn2+.
  • each metal ion has different effects on the kinetics of nucleotide incorporation.
  • Mg2+ facilitates the preferential utilization of dGTP and dATP whereas Co2+ increases the catalytic polymerization efficiency of the pyrimidines, dCTP and dTTP.
  • Zn2+ behaves as a unique positive effector for TdT since reaction rates with Mg2+ are stimulated by the addition of micromolar quantities of Zn2+. This enhancement may reflect the ability of Zn2+ to induce conformational changes in TdT that yields higher catalytic efficiencies. Polymerization rates are lower in the presence of Mn2+ compared to Mg2+, suggesting that Mn2+ does not support the reaction as efficiently as Mg2+.
  • TdT is provided in Biochim Biophys Acta., May 2010; 1804(5): 1151-1 166 hereby incorporated by reference in its entirety.
  • the nucleotide pulse replaces Mg++ with other cation(s), such as Na+, K+, Rb+, Be++, Ca++, or Sr++
  • the nucleotide can bind but not incorporate, thereby regulating whether the nucleotide will incorporate or not.
  • a pulse of (optional) pre-wash without nucleotide or Mg++ can be provided or then Mg++ buffer without nucleotide can be provided.
  • the incorporation of specific nucleic acids into the polymer can be regulated.
  • these polymerases are capable of incorporating nucleotides independent of the template sequence and are therefore beneficial for creating nucleic acid sequences de novo.
  • the combination of an error-prone polymerase and a primer sequence serves as a writing mechanism for imparting information into a nucleic acid sequence.
  • nucleotide substrate By controlling the primer/initiator, the nucleotide substrate, or the template independent polymerase, the addition of a nucleotide to an initiator sequence or an existing nucleotide or oligonucleotide can be regulated to produce an oligonucleotide by extension.
  • these polymerases are capable of incorporating nucleotides without a template sequence and are therefore beneficial for creating nucleic acid sequences de novo.
  • polymers such as nucleotide sequences, including DNA strands identified herein may be sequenced by passing the strand through nanopores or nanogaps or nanochannels to determine the individual nucleic acid/nucleotide.
  • Nanopore means a hole or passage having a nanometer scale width.
  • Exemplary nanopores include a hole or passage through a membrane formed by a multimeric protein ring. Typically, the passage is 0.2-25 nm wide.
  • Nanopores may include transmembrane structures that may permit the passage of molecules through a membrane. Examples of nanopores include a-hemolysin (Staphylococcus aureus) and MspA (Mycobacterium smegmatis).
  • Nanopores may be found in the art describing nanopore sequencing or described in the art as pore-forming toxins, such as the ⁇ - PFTs Panton-Valentine leukocidin S, aeroiysin, and Clostridial Epsilon-toxin, the a-PFTs cytolysin A, the binary PFT anthrax toxin, or others such as pneumolysin or gramicidin.
  • Nanopores have become technologically and economically significant with the advent of nanopore sequencing technology. Methods for nanopore sequencing are known in the art, for example, as described in US 5,795,782, which is incorporated by reference.
  • nanopore detection involves a nanopore-perforated membrane immersed in a voltage- conducting fluid, such as an ionic solution including, for example, KC1, NaCl, NiCL LiCi or other ion forming inorganic compounds known to those of skill in the art.
  • a voltage- conducting fluid such as an ionic solution including, for example, KC1, NaCl, NiCL LiCi or other ion forming inorganic compounds known to those of skill in the art.
  • a voltage is applied across the membrane, and an electric current results from the conduction of ions through the nanopore.
  • Nanopores within the scope of the present disclosure include solid state nonprotein nanopores known to those of skill in the art and DNA origami nanopores known to those of skill in the art. Such nanopores provide a nanopore width larger than known protein nanopores which allow the passage of larger molecules for detection while still being sensitive enough to detect a change in ionic current when the complex passes through the nanopore.
  • Nanopore sequencing means a method of determining the components of a polymer based upon interaction of the polymer with the nanopore. Nanopore sequencing may be achieved by measuring a change in the conductance of ions through a nanopore that occurs when the size of the opening is altered by interaction with the polymer.
  • the present disclosure envisions the use of a nanogap which is known in the art as being a gap between two electrodes where the gap is about a few nanometers in width such as between about 0.2 ran to about 25 ran or between about 2 and about 5 nm. The gap mimics the opening in a nanopore and allows polymers to pass through the gap and between the electrodes.
  • aspects of the present disclosure also envision use of a nanochannel electrodes are placed adjacent to a nanochannel through which the polymer passes. It is to be understood that one of skill will readily envision different embodiments of molecule or moiety identification and sequencing based on movement of a molecule or moiety through an electric field and creating a distortion of the electric field representative of the structure passing through the electric field.
  • Methods described herein are capable of generating large amounts of data (billions of bits). Accordingly, high throughput methods of sequencing these nucleic acid molecules, such as that disclosed in Mitra (1999) Nucleic Acids Res. 27(24):e34; pp.1-6, are useful. In preferred embodiments, high throughput methods are used with PCR amplicons or other nucleic acid molecules having lengths of less than 100 bp.
  • PCR amplicons of 100 bp, 1 10 bp, 120 bp, 130 bp, 140 bp, 150 bp, 160 bp, 170 bp, 180 bp, 190 bp, 200 bp, 250 bp, 300 bp, 350 bp, 400 bp, 450 bp, 500 bp, 550 bp, 600 bp, 650 bp, 700 bp, 750 bp, 800 bp, 850 bp, 900 bp, 950 bp, 1000 bp or more may be used.
  • Sequencing methods useful in the present disclosure include sequencing-by-ligation, sequencing-by-synthesis, sequencing-by-hybridization known to a skilled in the art.
  • Shendure et al. Accurate multiplex polony sequencing of an evolved bacterial genome, Science, vol. 309, p. 1728-32. 2005, Drmanac et al., Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays, Science, vol. 327, p. 78-81 . 2009, McKernan et al, Sequence and structural variation in a human genome uncovered by short- read, massively parallel ligation sequencing using two-base encoding, Genome Res., vol. 19, p. 1527-41.
  • Sequencing primers are those that are capable of binding to a known binding region of the target polynucleotide and facilitating ligation of an oligonucleotide probe of the present disclosure. Sequencing primers may be designed with the aid of a computer program such as, for example, DNAWorks, or Gene20iigo. The binding region can vary in length but it should be long enough to hybridize the sequencing primer. Target polynucleotides may have multiple different binding regions thereby allowing different sections of the target polynucleotide to be sequenced. Sequencing primers are selected to form highly stable duplexes so that they remain hybridized during successive cycles of ligation.
  • Sequencing primers can be selected such that ligation can proceed in either the 5' to 3' direction or the 3' to 5' direction or both. Sequencing primers may contain modified nucleotides or bonds to enhance their hybridization efficiency, or improve their stability, or prevent extension from a one terminus or the other.
  • single stranded DNA templates are prepared by PGR amplification to be used with sequencing primers.
  • single stranded template is attached to beads or nanoparticles in an emulsion and amplified through ePCR. Supports and Attachment
  • one or more oligonucleotide sequences described herein are immobilized on a support (e.g., a solid and/or semi-solid support).
  • a support e.g., a solid and/or semi-solid support.
  • an oligonucleotide sequence can be attached to a support using one or more of the phosphoramidite linkers described herein.
  • Suitable supports include, but are not limited to, slides, beads, chips, particles, strands, gels, sheets, tubing, spheres, containers, capillaries, pads, slices, films, plates and the like.
  • a solid support may be biological, nonbiologicai, organic, inorganic, or any combination thereof.
  • Supports of the present invention can be any shape, size, or geometry as desired. Supports may be made from glass (silicon dioxide), metal, ceramic, polymer or other materials known to those of skill in the art. Supports may be a solid, semi-solid, elastomer or gel.
  • a support is a microarray.
  • Oligonucleotides immobilized on microarrays include nucleic acids that are generated in or from an assay reaction.
  • the oligonucleotides or polynucleotides on microarrays are single stranded and are covalently attached to the solid phase support, usually by a 5 ! -end or a 3'- end.
  • probes are immobilized via one or more cleavabie linkers.
  • a covalent interaction is a chemical linkage between two atoms or radicals formed by the sharing of a pair of electrons (i.e., a single bond), two pairs of electrons (i.e., a double bond) or three pairs of electrons (i.e., a triple bond).
  • Covalent interactions are also known in the art as electron pair interactions or electron pair bonds.
  • Noncovalent interactions include, but are not limited to, van der Waals interactions, hydrogen bonds, weak chemical bonds (i.e., via short-range noncovalent forces), hydrophobic interactions, ionic bonds and the like.
  • affixing or immobilizing nucleic acid molecules to the substrate is performed using a covalent linker that is selected from the group that includes oxidized 3 -methyl uridine, an acrylyl group and hexaethylene glycol, hi addition to the attachment of linker sequences to the molecules of the pool for use in directional attachment to the support, a restriction site or regulatory element (such as a promoter element, cap site or translational termination signal), is, if desired, joined with the members of the pool.
  • Nucleic acids that have been synthesized on the surface of a support may be removed, such as by a cleavable linker or linkers known to those of skill in the art.
  • Linkers can be designed with chemically reactive segments which are optionally cleavable with agents such as enzymes, light, heat, pH buffers, and redox reagents. Such linkers can be employed to pre-fabricate an in situ solid-phase inactive reservoir of a different solution-phase primer for each discrete feature. Upon linker cleavage, the primer would be released into solution for PGR, perhaps by using the heat from the thermocycling process as the trigger.
  • affixing of nucleic acid molecules to the support is performed via hybridization of the members of the pool to nucleic acid molecules that are covalently bound to the support.
  • reagents and washes are delivered that the reactants are present at a desired location for a desired period of time to, for example, covalently attached dNTP to an initiator sequence or an existing nucleotide attached at the desired location,
  • a selected nucleotide reagent liquid is pulsed or flowed or deposited at the reaction site where reaction takes place and then may be optionally followed by deliver ⁇ - of a buffer or wash that does not include the nucleotide.
  • Suitable delivery systems include fluidics systems, microfluidics systems, syringe systems, ink jet systems, pipette systems and other fluid deliver ⁇ ' systems known to those of skill in the art.
  • flow cell embodiments or flow channel embodiments or microfluidic channel embodiments are envisioned which can deliver separate reagents or a mixture of reagents or washes using pumps or electrodes or other methods known to those of skill in the art of moving fluids through channels or microfluidic channels through one or more channels to a reaction region or vessel where the surface of the substrate is positioned so that the reagents can contact the desired location where a nucleotide is to be added.
  • a microfluidic device is provided with one or more reservoirs which include one or more reagents which are then transferred via microchannels to a reaction zone where the reagents are mixed and the reaction occurs.
  • Such microfluidic devices and the methods of moving fluid reagents through such microfluidic devices are known to those of skill in the art.
  • Immobilized nucleic acid molecules may, if desired, be produced using a device (e.g., any commercially-available inkjet printer, which may be used in substantially unmodified form) which sprays a focused burst of reagent-containing solution onto a support (see Castellino (1997) Genome Res. 7:943-976, incorporated herein in its entirety by reference).
  • a device e.g., any commercially-available inkjet printer, which may be used in substantially unmodified form
  • Such a method is currently in practice at ineyte Pharmaceuticals and Rosetta Biosystems, Inc., the latter of which employs "minimally modified Epson Inkjet cartridges" (Epson America, Inc.; Torrance, CA).
  • the method of inkjet deposition depends upon the piezoelectric effect, whereby a narrow tube containing a liquid of interest (in this case, oligonucleotide synthesis reagents) is encircled by an adapter.
  • An electric charge sent across the adapter causes the adapter to expand at a different rate than the tube, and forces a small drop of liquid reagents from the tube onto a coated slide or other support.
  • Reagents can be deposited onto a discrete region of the support, such that each region forms a feature of the array.
  • the feature is capable of generating an anion toroidal vortex as described herein.
  • the desired nucleic acid sequence can be synthesized drop-by-drop at each position, as is true for other methods known in the art. If the angle of dispersion of reagents is narrow, it is possible to create an array comprising many features. Alternatively, if the spraying device is more broadly focused, such that it disperses nucleic acid synthesis reagents in a wider angle, as much as an entire support is covered each time, and an array is produced in which each member has the same sequence (i.e., the array has only a single feature).
  • This example describes an embodiment of using nucleotide transitions to encode a format of information using DNA polymerases catalyzed DNA oligonucleotide sequences.
  • the encoded DNA sequence can be stored or decoded.
  • Such an enzymatic based nucleotide synthesis can catalyze the linkage of naturally occurring deoxynucieotide triphosphates (dNTPs) rapidly, in a single step, and under non-toxic biocompatible conditions, as compared to chemical methods (Fig, 1).
  • dNTPs deoxynucieotide triphosphates
  • the methods used terminal deoxynucleotidyl transferase (TdT), a unique template-independent DNA polymerase which rampantly and indiscriminately adds dNTP substrates to the 3' termini of DNA strands (F. J. Bollum, Thermal conversion of nonprinting deoxyribonucleic acid to primer. J. Biol. Chem. 234, 2733-2734 (1959), F. J. Bollum, Oligodeoxyribonucleoti de-primed reactions catalyzed by calf thymus polymerase, J, Biol. Chem. 237, 1945-1949 ( 1962), L. M. Chang, F. J. Bollum, Molecular biology of terminal transferase.
  • TdT terminal deoxynucleotidyl transferase
  • dNTPsnucleotides are added by TdT before being degraded by apyrase (Figs. 2A-2C, Figs. 3A-3C, Figs. 4A-4C and Fig. 5).
  • the lowest dNTPnucleotide concentrations required for maximum coupling efficiency was further determined (Fig. 6), such that adding nucleotide substrates in series would result in stepwise increases in DNA length (Fig. 7).
  • Embodiments of the disclosure provide an enzymatic synthesis strategy that is rapid and simple, requiring few components to produce DNA with a given information content (Fig, 8 A).
  • Embodiments of the disclosure include a reaction mixture of short oligonucleotide initiators, TdT, and apyrase.
  • the initiators are immobilized on solid supports, such as beads or a surface, to allow removal of reaction byproducts and facilitate downstream processing and amplification.
  • TdT extends the initiators until the substrate is degraded by apyrase, allowing immediate addition of subsequent nucleotide substrates.
  • Adding a series of dNTPsnucleotides results in a population of DNA strands, all extended by the same order of nucleotides. While extension lengths may vary across strands, the same information content is stored in the whole population as transitions between different or non-identical nucleotides (Fig. 8B).
  • trits was used (Trits are the ternary equivalent of bits. One trit is log(3)/log(2) (about 1.58496) bits of information) to maximize information capacity, given three possible transitions for each nucleotide, (Fig. 8C).
  • the message "hello world!” was encoded and synthesized (Fig. 9A).
  • Fig. 9A To encode each character, its binary ASCII representation was first converted to ternary and then to nucleotide transitions (Table 1). Each character was then synthesized as its own DNA strand preceded by a header index to specify strand order. Following synthesis, these strands were ii ated to a universal adapter, PCR amplified, and stored as a single pool without additional purification (Materials and Methods), These 12 DNA strands, each with 8 trits, carry the 144 bits of data. (Table 1).
  • the pool of DNA strands was sequenced using both Alumina and Oxford Nanopore platforms and extracted nucleotide transitions from each read by performing run-length encoding, a lossless data compression algorithm ubiquitously used in modern communications.
  • the correct transition was the most abundant species, comprising 88,6%, on average, of sequences filtered for the expected number of transitions and 19%, on average, of all sequences (Fig. 9B).
  • the remainder of the reads largely contained deletions and, to a smaller extent, mismatches and insertions (Figs. 10 and I I).
  • the same pool with Oxford Nanopore MinlON was next sequenced and a similar result was observed (Fig.
  • DNA translocation rates through nanopores may be increased since nucleotide transitions are, in principle, easier to detect (D. Fologea, J. Uplinger, B. Thomas, D. S, McNabb, J. Li, Slowing DNA translocation in a solid-state nanopore. Nano Lett. 5, 1734-1737 (2005), M. Vega, P. Granell, C. Lasorsa, B. Lerner, M. Perez, Automated and inexpensive method to manufacture solid- state nanopores and micropores in robust silicon wafers. J. Phys. Conf. Ser. 687, 012029 (2016), B.
  • the present disclosure contemplates improvements and design optimizations of the nucleotide encoding and decoding methods described herein.
  • the current implementation of the methods results in an approximately 25 -fold decrease in information density compared to the maximum possible for DNA which is more than a thousand fold better than electronic storage systems (V. Zhirnov, R. M. Zadegan, G. S. Sandhu, G. M. Church, W. !,. Hughes, Nucleic acid memory. Nat. Mater. 15, 366-370 (2016), G. M. Church, Y. Gao, S, Kosuri, Next-generation digital information storage in DNA. Science. 337, 1628 (2012), Y. Erlich, D. Zieiinski, DNA Fountain enables a robust and efficient storage architecture. Science.
  • coding systems that are tailored to these biochemical processes may enable the use of all transitions, by considering extension lengths, and provide highly efficient data recovery, saving on synthesis and sequencing costs even with imperfectly synthesized DNA strands.
  • the length of nucleotide extensions per transition may be considered a design optimizations and tuned according to application demands, trading density for read-out by specialized nanopore sequencing (D. Fologea, J. Uplinger, B. Thomas, D. S. McNabb, J. Li, Slowing DNA translocation in a solid-state nanopore. Nano Lett. 5, 1734-1737 (2005), M. Vega, P. Granell, C. Lasorsa, B. Lerner, M.
  • TdT to apyrase ratio optimization To obtain a ratio of TdT polymerization activity to apyrase degradation activity that would allow for net positive extension of the initiator, initiator extensions was assessed by TdT in presence of a wide range of apyrase concentrations with every dNTP substrate (Fig, 2A).
  • Each reaction was carried out in 20 ⁇ total volume. All reaction components but the dNTP were assembled in 18 ⁇ _ ⁇ while the dNTP was prepared in 2 ⁇ of water.
  • the 18 ⁇ mix was composed such that upon mixing with the 2 ⁇ , dNTP solution, the following initial composition would be obtained: 200 ⁇ dNTP, I X Enzymatics Green Buffer, 0.05 ⁇ f-P5- SBS3 initiator oligo, ⁇ ⁇ / ⁇ TdT, and 4, 2, 1, 0,5, or 0.25 milliunits (mil) of apyrase per microliter.
  • the 18 ⁇ _ ⁇ mixture was added to a tube containing the 2 ⁇ dNTP mix and mixed immediately by pipetting. The reaction was then incubated at room temperature for at least two minutes after which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
  • IX Enzymatics Green Buffer 0.1 ⁇ 150617 LT2 initiator (AGATCAATTAATACGATACCTGCG) (36), ⁇ / ⁇ TdT, and 0.125, 0.25, 0,5, or mU/ ⁇ L apyrase.
  • the starting final concentration of substrate was varied at 5, 10, 20, 40, or 80 ⁇ for dCTP or at 1.25, 2.5, 5, 10, 20 ⁇ for dGTP.
  • the 16 ⁇ mixture was added to a tube containing the 4 ⁇ dNTP sample and mixed immediately by pipetting. After mixing, each reaction was incubated at room temperature for at least two minutes at which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
  • apyrase leads to the same level of extension as 10 ⁇ dCTP with 0.5 ⁇ / ⁇ apyrase, 5 ⁇ dCTP with 0.25U ⁇ L apyrase, and 2.5 ⁇ dCTP with 0.125 ⁇ / ⁇ apyrase.
  • T. P. Chirpich The effect of different buffers on terminal deoxynucleotidyl transferase activity. Biochim. Biophys. Acta. 518, 535-538 (1978), M. R. Deibel Jr, M. S. Coleman, Biochemical properties of purified human terminal deoxynucleotidyltransferase. J. Biol. Chem. 255, 4206-4212 (1980), L. M. Chang, F. J. Bollum, Multiple roles of divalent cation in the terminal deoxynucleotidyltransferase reaction. J. Biol. Chem. 265, 17436-17440 (1990)).
  • the buffer system as disclosed is based on magnesium as divalent cation with the option of supplementing cobalt.
  • the buffer system as disclosed is based on cobalt as the sole divalent cation.
  • the performance of the TdT:apyrase system in all three conditions were evaluated, namely, magnesium as the only divalent cation, magnesium supplemented with cobalt, and cobalt as the only divalent cation. For that, two experiments were carried out, comparing each of magnesium with cobalt and cobalt-only conditions separately with magnesium-only condition (Figs. 3A-3C).
  • each reaction was carried out in 20 ⁇ _ total volume. All reaction components but the dNTP were assembled in 18 ⁇ while the dNTP was prepared in 2 ⁇ of water. The 18 ⁇ mix was composed such that upon mixing with the 2 ⁇ dNTP solution, the following initial composition would be obtained: 200 ⁇ dNTP, IX Enzymatics Green Buffer, 0.05 ⁇ f-P5- SBS3 initiator oligo, 250 ⁇ cobalt chloride (if present), ⁇ / ⁇ TdT, and 4, 2, 1, 0.5, or 0.25 milliunits (raU) of apyrase per microliter.
  • raU milliunits
  • the 18 ⁇ _, mixture was added to a tube containing the 2 ⁇ dNTP mix and mixed immediately by pipetting. The reaction was then incubated at room temperature for at least two minutes after which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a
  • the 14 ⁇ mix was prepared as a master mix for ail reactions and composed such that upon mixing with the 6 ⁇ !, dNTP and cobalt solution, the following initial composition would be obtained: 300 ⁇ dATP, 0.05 ⁇ f-P5-SBS3 initiator oligo, IX Enzymatics Green Buffer, lU/ L TdT, ImU/uL apyrase and 50, 100, 150, 200, 250, or 300 ⁇ cobalt chloride.
  • the 14 ⁇ mixture was added to a tube containing the 6 ⁇ dATP and cobalt mixture and mixed immediately by pipetting. The reaction was then incubated at room temperature for at least two minutes after which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
  • the 16 ⁇ mix was composed such that upon mixing with the 4 ⁇ dNTP solution, the following initial composition would be obtained: IX Enzymatics Green Buffer (Composition of 10X Green Buffer (BO 120) from Enzymatics according to the manufacturer: 200 mM Tris- Acetate, 500 mM Potassium Acetate, 100 mM Magnesium Acetate , pH 7.9 @ 25°C) or X Promega TdT buffer (Composition of Terminal Transferase 5X Buffer (Ml 89 A) from Promega according to the manufacturer: 500mM cacodylate buffer (pH 6.8), 5mM CoC12 and 0.5niM DTT), 0.1 ⁇ 150617 LT2 initiator, ⁇ / ⁇ TdT, and 1 ⁇ / ⁇ apyrase.
  • IX Enzymatics Green Buffer Composition of 10X Green Buffer (BO 120) from Enzymatics according to the manufacturer: 200 mM Tris- Acetate, 500 mM Potassium Acetate
  • the starting final concentration of dNTPs was varied at 25, 50, 100, 200, or 400 ⁇ for dCTP, dATP, and dTTP, or at 12.5, 25, 50, 100, or 2 ⁇ )0 ⁇ for dGTP.
  • the 16 ⁇ mixture was added to a tube containing the 4 ⁇ dNTP sample and mixed immediately by pipetting. After mixing, each reaction was then incubated at room temperature for at least two minutes at which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE- Urea gel.
  • extension dynamics with TdT make a few patterns clear.
  • extension with pyrimi dines (dCTP and dTTP) is stimulated by cobalt as the divalent cation while extension with purines (dATP and dGTP) is hampered.
  • dCTP and dTTP extension with pyrimi dines
  • purines dATP and dGTP
  • TdT behavior L. M. Chang, F. J. Bollum, Multiple roles of divalent cation in the terminal deoxynucleotidyltransferase reaction. J. Biol. Chem. 265, 17436-17440 (1990), K. I. Kato, J. M. Goncalves, G. E. Houts, F. J.
  • each reaction was carried out in 20 ⁇ ... total volume. All reaction components but the dNTP and buffer were assembled in 14 ⁇ while the dNTP and desired amount of buffer were prepared in 6 ⁇ volume.
  • the 14 ⁇ mix was prepared as a master mix for all reactions and composed such that upon mixing with the 6 ⁇ !, dNTP and buffer solution, the following initial composition would be obtained: 300 ⁇ dATP, 0,05 ⁇ f ⁇ P5-SBS3 initiator oligo, ⁇ / ⁇ TdT, lmU/ ⁇ . apyrase and 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0, 1.2, or 1.4X Enzvmatics Green Buffer.
  • the 14 ⁇ mixture was added to a tube containing the 6 ⁇ dNTP mix and mixed immediately by pipetting. The reaction was then incubated at room temperature for at least two minutes after which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
  • Each reaction was composed of 0.1 ⁇ 150617 LT2, 0.7X Enzymatics Green Buffer, 125 ⁇ of each dNTP, ⁇ / ⁇ . TdT, and the desired amount of the additive.
  • the additives were glycerol at 27% (v/v), sucrose at 20 and 40% (w/v), PEG 8000 at 5 and 10% (w/v), betaine at 0.5 and 1M, DMSO at 5, 10, 20, and 30% (v/v), Triton X-100 at 0.01, 0.1, 0.5, and 1.0% (v/v), and Tween 20 at 0.01, 0.1, 0.5, and 1.0% (v/v).
  • the reaction were carried out at room temperature for 20 minutes and then mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 10% Novex TBE-Urea gel.
  • ⁇ ⁇ 150617_LT2 initiator ⁇ / ⁇ TdT, and 1 ⁇ / ⁇ apyrase.
  • the starting final concentrations of dCTPs were 25, 50, 100, 200, or 400 ⁇ .
  • the ⁇ mixture was added to a tube containing the 4 ⁇ dNTP sample and mixed immediately by pipetting. After mixing, each reaction was then incubated at room temperature for at least two minutes at which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
  • TdT Consistent and reproducible extension of the initiator upon addition of various nucleotides in presence of apyrase demands that TdT be at saturating concentrations relative to the initiator. Subsaturation levels of TdT can result in a high extension variability, or extension of less than the maximum possible fraction of initiators upon the addition of dNTPs. With the final composition of the reaction having taken shape, it was examined what levels of TdT would be saturating relative to the initiator concentrations that was commonly used.
  • IX Custom Synthesis Buffer 0.1 ⁇ (or less) initiator oiigo, lU/ ⁇ TdT (or more), and lmU./ ⁇ , apyrase.
  • nucleotide composition of the initiator at the 3' is also important (K, I Kato, J. M. Goncalves, G. E, Houts, F. J. Bollum, Deoxynucleotide- polymerizing enzymes of calf thymus gland.
  • K I Kato, J. M. Goncalves, G. E, Houts, F. J. Bollum, Deoxynucleotide- polymerizing enzymes of calf thymus gland.
  • TdT operates in a distributive manner (M. R. Deibel Jr, M. S. Coleman, Biochemical properties of purified human terminal deoxynucleotidyltransferase. J. Biol. Chem. 255, 4206-4212 ( 1980), E. A. Motea, A. J. Berdis, Terminal deoxynucieotidyl transferase: the story of a misguided DNA polymerase. Biochim. Biophys. Acta. 1804, 1 S il l 66 (2010)); it does not remain bound to the nascent oligonucleotide and is not processive.
  • the ⁇ 8 ⁇ mix was composed such that upon mixing with the 2 ⁇ dNTP solution, the following initial composition would be obtained: IX Custom Synthesis Buffer Buffer, 0.1 ⁇ initiator oiigo, lU/ ⁇ TdT, and 0.25 niU/ ⁇ -, apyrase.
  • the initial final concentration of dNTPs was varied at 2, 4, 8, 16, or 32 ⁇ for dCTP, dATP, and dTTP, or at 1, 2, 4, 8, or 16 ⁇ for dGTP.
  • the 18 L mixture was added to a tube containing the 2 ⁇ _. dNTP sample and mixed immediately by pipetting. The reaction was then incubated at room temperature for at least two minutes at which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
  • Template sequences were synthesized using the TdT;apyrase mixture by cyclic addition of nucleotide triphosphates to the reaction.
  • the template sequence GATGTAGA was synthesized (Fig 7, left) and in another, the template sequence CGCACTCG was synthesized (Fig. 7, right).
  • Each reaction was carried out in ⁇ total volume and was mixed with a 2 l . of dNTP at 50X the desired final concentration.
  • the ⁇ mix consisted of: IX Custom Synthesis Buffer, 0. 1 ⁇ initiator oiigo, ⁇ / ⁇ TdT, and 0.25 ⁇ / ⁇ , apyrase.
  • the initial final concentration of dNTP was 40 ⁇ for dATP, 200 ⁇ for dCTP, 20 ⁇ for dGTP, and ⁇ ⁇ ! for dTTP,
  • the ⁇ mixture was added to a tube containing 2 ⁇ iL of the desired dNTP sample and mixed immediately by pipetting. After 1 minute incubation at room temperature, a 2 ⁇ 1 sample of the mix was taken to be run on a gel. The remaining ⁇ was added to another tube containing 2 ⁇ of the next nucleotide, mixed and incubated as before, following by collection of another 2 ⁇ sample for PAGE analysis. These steps were repeated for 8 cycles without washing, thereby extending the initiator with 8 different dNTPs while using the same enzymatic mix. Afterwards, each of the 2 ⁇ iL samples that were taken was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 1 5% Novex TBE-Urea gel.
  • Example III Enzymatic DNA Synthesis for Digital Information Storage
  • a template-independent DNA polymerase for controlled synthesis of sequences with user-defined information was harnessed.
  • retrieval of 144-bits, including addressing, from perfectly synthesized DNA strands using batch Illumina and real-time Oxford Nanopore sequencing was demonstrated.
  • a codec was then developed for data retrieval from populations of diverse but imperfectly synthesized DNA strands, each with -30% error tolerance. With this codec, a kilobyte-scale design was experimentally validated which stores 1 bit per nucleotide. Simulations of the codec supported reliable and robust storage of information for large-scale systems.
  • a de novo DNA synthesis strategy and a digital codec designed specifically for information storage is provided.
  • DNA for biological functionality requires single-base precision and accuracy, these demands can be relaxed for DNA for digital information.
  • a template-independent DNA polymerase was used, a protein evolved to rapidly catalyze the linkage of naturally occurring nucleotide triphosphates (dNTPs) under non-toxic biocompatible conditions. Information in transitions were encoded between non-identical nucleotides, rather than single nucleotides. It was demonstrated that enzymatic synthesis and tailored computational tools provide robust information storage, as assessed using batch (Illumina) and real-time (Oxford Nanopore) sequencing. The presently- disclosed enzymatic synthesis strategy is cheaper than phosphoramidite chemistry and may reduce reagent costs by orders of magnitude, facilitating the adoption of DNA as a storage medium.
  • the enzyme terminal deoxynucleotidyi transferase is used.
  • TdT is a template-independent DNA polymerase which rampantly and indiscriminately adds dNTPs to the 3' termini of DNA.
  • TdT is largely used in reactions where one nucleotide triphosphate is added to indeterminate lengths.
  • it is sought to leverage apyrase, which degrades nucleotide triphosphates into their TdT-inactive diphosphate and monophosphate precursors. By competing with TdT for nucleotide triphosphates, apyrase effectively limits DNA polymerization.
  • a mixture was thus created and optimized containing a tuned ratio of these two enzymes such that a nucleotide triphosphate is added at least once to each strand by TdT before being degraded by apyrase (Figs. 2A-2C and Fig. 5).
  • the lowest nucleotide triphosphate concentrations required was determined such that adding a series of nucleotides results in stepwise increases in the length of synthesized DNA (Figs. 6-7).
  • the core of the reaction contemplates a mixture of TdT, apyrase, and short oligonucleotide initiators.
  • TdT Upon addition of a nucleotide triphosphate, TdT extends the initiators until ail added substrate is degraded by apyrase.
  • the number of polymerized nucleotides was define as 'extension length' .
  • Subsequent nucleotide triphosphates are added to continue the synthesis process. While the extension length for each added nucleotide triphosphate may vary, the resulting population of synthesized strands all share the same number and sequence of nucleotide transitions (Fig. 13B).
  • information was chosen to encode as transitions between non-identical nucleotides (Fig. 13C). Given three possible transitions for each nucleotide, trits was used (Trits are the ternar equivalent of bits. One trit is log(3)/log(2) (about 1.58496) bits of information.) to maximize information capacity.
  • Trits are the ternar equivalent of bits. One trit is log(3)/log(2) (about 1.58496) bits of information.) to maximize information capacity.
  • To convert information to DNA information in trits were mapped to a template sequence of non-identical nucleotides, starting with the last nucleotide of the initiator. Enzymatic DNA synthesis of each template sequence produced 'raw strands', or strands , which can be physically stored. To retrieve information
  • strands are sequenced and transitions between non-identical nucleotides extracted, resulting in 'compressed strands', or strands If a strand ' is equivalent to the template sequence, the strand (compressed or raw) is considered 'perfect' and the information is retrieved by mapping the sequence of non-identical nucleotides back to trits.
  • "hello world!” a message containing 96-bits of ASCII data (Fig. 1 A) was encoded and synthesized. This message was split into twelve individual 8-bit characters, and prefixed each character's bit representation with a 4-bit address to denote their order. These 144 total bits of information, including addressing, were also expressed in trits and mapped according to nucleotide transitions (Fig. 13C), resulting in twelve eight- nucleotide template sequences (Table 1). All twelve template sequences were synthesized (HOI -HI 2) in parallel on bead-conjugated initiators, and performed washing every two
  • Alumina sequencing was used to read out the synthesized strands 11 and to assess the information stored in corresponding strands (Methods).
  • DNA strands * " synthesized for HOI -HI 2 was first sequenced using an entire MinlON flowceli (Oxford Nanopore) and observed that the most abundant species, an average of 49.9% of filtered strands " , were perfectly synthesized (Fig. 18A), This is largely consistent with results from
  • nanopore sequencing can enable faster and more efficient information retrieval from strands synthesized with the enzymatic strategy.
  • DNA translocation rates are slowed through nanopores for accurate single-base sequencing. This rate may be increased since it is, in principle, easier to detect transitions between non-identical nucleotides, each with extension lengths greater than one (D. Fologea, J. Uplinger, B. Thomas, D. S. McNabb, J. Li, Slowing DNA translocation in a solid-state nanopore. Nano Lett. 5, 1734-1737 (2005); M. Vega, P. Granell, C. Lasorsa, B.
  • Coded strand architecture It has been established that data can be stored in enzymatically-synthesized DNA and retrieved by in silico filtering for perfectly synthesized DNA strands. However, perfect strands 0 may not be required for data retrieval, imperfectly synthesized strands may be used to reconstruct template sequences if nucleotide errors occur in different locations. It was thus sought to develop a codec for robust data retrieval which leverages the diversity of imperfectly synthesized strands C for template sequence reconstruction. The core of the codec relies on three elements: (i) A coded strand architecture which includes synchronization nucleotides to facilitate error localization, (ii) Sufficiently diverse strands ' produced by
  • a key feature of the presently disclosed codec is the addition of synchronization nucleotides which are interspersed between information-encoding nucleotides (Fig. 19B). These nucleotides act as a scaffold to aid reconstruction of a template sequence from imperfectly-synthesized DNA strands that may contain errors as a result of missing, mismatched, and inserted nucleotides.
  • CTCGTGCT template sequence of 8 nucleotides
  • CTCTGC and TCGTCT synthesized DNA strands 0
  • the codec includes a module for encoding information in template sequences which incorporates synchronization nucleotides.
  • the population of synthesized DNA strands for a desired sequence must be sufficiently diverse. That is, if the same nucleotide is missing systematically across all strands, then it cannot be retrieved without additional forms of error correction. It was thus analyzed diversity generated from the synthesis process by synthesizing a longer 16 -nucleotide template sequence (called EO), which contains 12 unique transitions between nucleotides to mitigate ambiguous alignments
  • EO 16 -nucleotide template sequence
  • Fig. 19C In silico size selection was performed of strands R ranging 32 to 48 bases in length, assuming that each of the 16 template nucleotides were synthesized with an extension length of two to three bases (Fig. 20A). This purified set was analyzed by aligning the corresponding strands to the EO template and observed that missing nucleotides were predominant, in line with the previous analyses, but could occur in different positions (Fig. 19C, Fig, 19D, Fig. 20B),
  • Levenshtein edit distances of strands ⁇ from the purified set (Fig. 19D, Fig. 20C). It was observed that the median strand C length was 12 nucleotides and the maximal number of variants occurred at this length.
  • the Levenshtein edit distance was also calculated (V. I. Levenshtein, in Soviet physics doklady (1966), vol. 10, pp. 707-710), which summarizes the number of single-nucleotide edits required to repair a strand " to the desired E0 sequence.
  • the median edit distance for these variants was four, indicating that synchronization nucleotides could be placed approximately every three or four nucleotides to recall missing strand nucleotides from diversely synthesized strands. It was thus set out to reconstruct a template sequence from a population of diverse but imperfect strands ' using statistical inference and mathematical models.
  • MAP maximum a posterior
  • Each template sequence contained a 2-bit address to delineate its order, and 14 bits of data. These 16 bits are encoded in a template sequence of 16 nucleotides, which includes four synchronization nucleotides, resulting in 1 bit stored per nucleotide (Fig. 22B), Sequences E1-E4 carry a total of 64 bits of information including addressing, and were synthesized in parallel on beads with a wash every cycle. Following the last synthesis cycle, strands were ligated to a universal adapter, PGR amplified, and stored as a single pool.
  • Sequence E3 required the most sequencing reads for reconstruction as synthesized strands contained one extra edit on average in comparison to synthesized strands for other template sequences (Figs. 27A-27B and Figs. 28A-28B). It was also found that MAP estimation was a more robust decoding algorithm than the previous two-step filter for 1 10 ! -1 1 12, requiring fewer reads for data retrieval (Figs. 34A-34H). These results show that the codec can accurately reconstruct data without requiring perfectly synthesized DNA strands.
  • the experimental results demonstrate that byte- and kilobyte-scale storage systems can be achieved if sufficient number of strands are synthesized (Fig, 35A).
  • the "hello world!” experiment stored 12 bits per template sequence. This is sufficient for a 256-byte maximum storage system where 11 bits are used for addressing 2,048 total template sequences, each with I bit of data.
  • the "Eureka!” experiment stored 16 bits per template sequence. This allows for a 4-kilobyte maximum storage system, where 15 bits are used for addressing 32,768 total template sequences, each with 1 bit of data (Table 7).
  • the scalability of the DNA storage codec was next assessed for gigabyte- and petabyte-scale storage through simulation, assuming that the requisite number of DNA strands for each could be produced.
  • Increased storage capacity requires more nucleotides per template sequence for additional address space, synchronization nucleotides, and data, in one embodiment, 36 bits were stored, including data and address, in a 74-nucleotide template sequence and similarly, 57 bits in a 152-nucleotide template sequence to simulate gigabyte- and petabyte-scale systems, respectively (Fig, 3 A).
  • the codec is able to resolve several types of errors, including missing nucleotides in synthesized strands* " , which would otherwise drastically reduce information storage capacities (M. C. Davey, D. J. C. Mackay, Reliable communication over channels with insertions, deletions, and substitutions. IEEE, Trans. Inf. Theory. 47, 687-698 (2001); M. Mitzenmacher, A survey of results for deletion channels and related synchronization channels. Probab. Surv. 6, 1-33 (2009)).
  • the comprehensive codec architecture consists of encoding and decoding frameworks to extract information from diversely synthesized DNA strands (Figs. 35C, Figs, 39A-39B).
  • the encoder consists of several core components; (i) Partitioning of data into ordered rows of bits; (ii) Prefixing of rows with addresses; (iii) Error correction per row of bits via an error-correction code (ECC) per template sequence (e.g., Bose-Chaudhuri-Hocquenghem code), and error correction per block of rows via a block ECC (e.g., Reed-Solomon or Fountain code, (iv) Modulation to map rows of bits to template sequences. All template sequences are subsequently synthesized enzymatically, resulting in a population of diverse DNA strands. Strands R are read out by sequencing and corresponding strands are input to a decoder.
  • ECC error-correction code
  • Strands R are read out by sequencing and corresponding strands are input to a decoder.
  • the crucial first step of the decoding pipeline is MAP estimation aided by scaffolding, followed by probabilistic consensus. Multiple subsets of strands C can be used for sequence reconstruction. Each reconstructed sequence need not be identical to the template sequence. After demodulation of the reconstructed sequence, the resulting bit sequence can be corrected by bit-level ECCs in the decoding pipeline to reinforce error-free data retrieval.
  • the design harnesses the diversity of enzymatically-synthesized DNA strands and supports a flexible-write approach to provide a functional and robust storage system.
  • extension lengths per template nucleotide may be considered a design optimization and tuned according to application demands, trading density for read-out speed and cost by specialized nanopore sequencing (S. M. H. T. Yazdi,
  • DNA for information storage is synthesized in a high-density array format with proprietary machines.
  • the presently disclosed bead-based process was thus translated to a 2D array-based platform (Figs. 40A-40E).
  • this prototype produced perfectly synthesized strands for each of the three 13 -nucleotide template sequences tested herein. Analyses of the synthesized strands indicate similar error and diversity profiles to those observed using the bead-based process, indicating that the codec could be used to store information in DNA synthesized with this platform (Fig. 41, Figs. 42A-42B and Figs. 43A- 43B).
  • Synthesis accuracy can be further improved by additional process engineering, e.g., more stringent washing per cycle that reduces carryover of nucleotide triphosphates from previous cycles to further diminish the rate of substituted strand " nucleotides. Optimization of reaction conditions to improve mixing or the use of more processive, rather than c
  • distributive, TdT mutants may reduce the rate of missing strand nucleotides (M. A. Jensen,
  • the presently disclosed enzymatic DNA synthesis strategy disclosed herein is advantageous in speed and cost relative to phosphoramidite chemistry.
  • reagent costs were compared for both processes as a function of feature size (reagent volume) (Figs, 44A-44B, Table 6).
  • the analyses indicate that the enzymatic synthesis strategy could already be cheaper as a drop-in replacement to phosphoramidite chemistry when using existing automation which synthesizes DNA strands in 15-30-micron features (Figs. 44A-44B).
  • Further miniaturization, together with reductions to enzyme cost through recycling, provide a potential roadmap for overall reduction in reagent costs by several orders of magnitude (Figs. 44A-44B).
  • the increased rate of enzymatic catalysis over chemical coupling and a lack of blocking moieties may shorten the synthesis cycle times compared to phosphoramidite chemistry, reducing write speed and equipment amortization time (Table 6).
  • aspects of the present disclosure are directed to an enzymatic synthesis strategy and tailored coding architecture for robust information storage in DNA.
  • This storage solution is an alternative to prior studies which utilized phosphoramidite chemistry to produce DNA for information storage.
  • This approach offers potentially dramatic benefits to the cost and speed of synthesis and sequencing without requiring single-base accuracy. Additionally, this approach may alleviate biosecurity concerns associated with widespread DNA synthesis of genetic information, as genes are unlikely to be produced with this strategy. While this work illustrates DNA information storage in vitro, it could provide a foundation for development of de novo molecular recording systems in vivo (B. M. Zamft et al, Measuring cation dependent DNA polymerase fidelity landscapes by deep sequencing. PLoS One.
  • the phrase "hello world!” was converted to decimal ASCII and then to ternary as shown in Table 1.
  • the ASCII decimal (data) was converted to base 2 (for binary, 8 bits) or to base 3 (for ternary, 5 trits).
  • the addresses were converted from a decimal value to base 2 (for binary, 4 bits) or base 3 (for ternary, 3 trits). Addresses were concatenated to data to form a resulting string of 2 bits or 8 trits.
  • a custom Python script was used to map trits to template sequences H01 - H 2 shown in Table 1.
  • Nucleotide triphosphates were prepared at the following concentrations: 8mM dATP, 4mM dCTP, 4mM dGTP, and 16mM dTTP.
  • eac template sequence Table 1
  • the required dNTP volumes corresponding to each transition type were dispensed (Table 3) in a 96-well PGR plate (VVVR) using a Mantis liquid handler (Formulatrix), which has a minimum dispense volume of 0.2 ⁇ .
  • initiator- conjugated polystyrene beads for each of the twelve template sequences were suspended in an enzymatic reaction mix, comprised of Ix Custom Synthesis Buffer (14 mM Tris-Acetate, 35 mM: Potassium Acetate pH 7.9, 7 mM Magnesium Acetate, 0. 1 % Triton X-100, 10% (w/v) PEG 8000) with lU/ ⁇ L TdT (Enzymatics) and ImU/ ⁇ L apyrase (NEB).
  • Ix Custom Synthesis Buffer 14 mM Tris-Acetate, 35 mM: Potassium Acetate pH 7.9, 7 mM Magnesium Acetate, 0. 1 % Triton X-100, 10% (w/v) PEG 8000
  • lU/ ⁇ L TdT Enzymatics
  • ImU/ ⁇ L apyrase N-dapyrase
  • a universal adapter was ligated to the 3 ' of the synthesized strands using a hybridization-based strategy as previously described (C. . wok, Y. Ding, M. E, Sherlock, S. M. Assmann, P, C. Bevilacqua, A hybridization-based approach for quantitative and low- bias single-stranded DNA ligation. Anal. Biochem. 435, 181- 186 (2013).
  • the 5P-rSBS9- GGG adaptor (/5Phos/AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC T/ideoxyU/CCGATCT GGG/3SpC3/) forms a hairpin with a 5' polyG overhang which hybridizes to single-stranded DNA strands ending in polyC.
  • the beads carrying synthesized DNA with polyC tail were resuspended in a reaction composed of ⁇ ⁇ 5P-rSBS9-GGG adaptor, IX NEB T4 DNA Ligase Buffer, 20% PEG 8K, 500mM Betaine, and 6 units of T4 DNA Ligase (Enzymatics). Ligation mixture was incubated at 16C overnight.
  • Each sample was column purified after amplification.
  • Illumina sequencing adapters To add the complete Illumina sequencing adapters, amplified strands were diluted and used as a template for a PCR reaction with NEBNext Dual Indexing Primers. Each strand received a different index by real-time cycle-limited PCR for 15 cycles. Barcoded strands were then combined and sequenced single end using Illumina MiSeq v2 150 Micro. Samples were then combined and sequenced using Illumina MiSeq with reagent kit v2. Sequencing was done in one direction, starting from the forward primer in each sample for 150 bases. Oxford Nanopore sequencing
  • each Illumina-barcoded strand was diluted 100-fold in Tris-HCl pH 8,0 with 0.01% Tween -20 and amplified with nested primers, comprising a barcoding primer pair, LWB 01-12 from SQK-LWB001 (Oxford Nanopore), and 50nM of primers PR2-P5 and 3580F-P7 for 10 cycles.
  • 5 ⁇ _ of each strand was then pooled (60 ⁇ _ total) and cleaned with 90 ⁇ _, of Agencourt Ampure XP beads according to the manufacturer's protocol.
  • i .uL of the pooled library was diluted with 9 ⁇ .
  • the first step is to filter for the designed number of nucleotides which contain a terminal 'C, used for ligation, in compressed strands.
  • eight of twelve template sequences specifically H01 , H02, H04, H08, H09, H10, and HI 1 , have 9 nucleotides to be synthesized.
  • four of the twelve template sequences, specifically H3, H5, H6, and H7 contain only 8 nucleotides to be synthesized.
  • the second step is to select the most frequently synthesized compressed strand variant.
  • Reads in the opposite orientation were not processed. Data retrieval for each sequence was performed as above for Illumina sequencing with a two-step filter. Real-time data reconstruction with nanopore sequencing reads was simulated by applying the two-step data retrieval filter to a subsampled number of shuffled sequencing reads obtained up to a given time point. The 48 -hour sequencing run was split into 2 -hour increments. For each increment, the timestamp for all reads obtained during the entire sequencing run were shuffled and the number of reads corresponding to the total elapsed sequencing time up to the given increment were randomly sampled. The probability of correct retrieval was assessed by performing 10,000 decoding trials for each increment and expressed each time interval to fraction of total sequencing time.
  • Encoding and decoding pipelines were implemented partly using the C++ programming language, compiled via a g++ compiler on an Ubuntu Linux operating system, and partly via specialized MATLAB (Mathworks) functions.
  • the message ' ' Eureka? ' ' consisting of 7 ASCII characters, equivalent to 56 bits of payioad data, was encoded into 4 template sequences E1-E4 each containing 16 nucleotides.
  • the encoding steps consisted of data partitioning, addressing, and modulation of bit sequences to nucleotide sequences with no repeated bases (i.e., self-transitions). Modulation included the placement of synchronization nucleotides within DNA sequences as described herein.
  • sequence E0 was specified and designed for memeposes of error analysis.
  • decoding consisted of sequence reconstruction from run-length compressed DNA strands via MAP estimation and consensus. Reconstmcted E1-E4 DNA sequences were demodulated into bit sequences, and data were extracted by ordering according to addresses.
  • the initiator oligonucleotide Bio-U-LT2 was conjugated to streptavidin beads (Invitrogen) according to manufacturer instaictions at 20% binding capacity and Biotin-14- dCTP was used to bind remaining free streptavidin. Blank beads, which have free streptavidin bound by Biotiti-14-dCTP were also prepared. Prior to use, the initiator conjugated beads were di luted 10-fold with blank beads and washed with Ix Custom Synthesis Buffer without PEG.
  • E0-E4 was performed similarly as described above. However, Bromo- dCTP was used instead of dCTP (Figs. 12A-12B) and concentrations of each dNTP regardless of transition type were fixed. The final concentration of dNTPs for each cycle were as follows: ⁇ dATP, 15 ⁇ Bromo-dCTP, 5 ⁇ dGTP, and 15 ⁇ dTTP. As above, a series of dNTPs were di spensed for each nucleotide of the template sequence in a 96-weil PCR plate.
  • initiator-conjugated magnetic beads were suspended in the enzymatic reaction mix, comprised of Ix Custom Synthesis Buffer (14 mM Tris- Acetate, 35 mM Potassium Acetate pH 7.9, 7 mM Magnesium Acetate, 0.1% Triton X- 100, 10% (w/v) PEG 8000) with lU/uL TdT (Enzymatics) and 0.25mU/uL apyrase (NEB).
  • Ix Custom Synthesis Buffer 14 mM Tris- Acetate, 35 mM Potassium Acetate pH 7.9, 7 mM Magnesium Acetate, 0.1% Triton X- 100, 10% (w/v) PEG 8000
  • lU/uL TdT Enzymatics
  • NTB 0.25mU/uL apyrase
  • each reaction was pulse vortexed and incubated for 30 seconds at room temperature. Beads were collected by magnet and washed in Ix Custom Synthesis Buffer without PEG and resuspended with fresh enzymatic mix. The reaction mixture was then transferred to the next well containing the next nucleotide substrate. Following the last cycle, each sample was prepared for Illumina sequencing as described above. Complete Alumina sequencing adapters were added by real-time cycle-limited PGR for 12 cycles, Barcoded strands were then combined and sequenced as single-end 175bp reads using Illumina Mi Seq v2 Nano.
  • Sequences were trimmed as before to remove the 5' initiator oligo sequence (Bio- U-LT2) and the 3' universal oligo sequence (5P-rSBS9-GGG). Only reads which presented both sequences for trimming were retained for further analysis.
  • a sequence of non -identical nucleotides for each raw strand was extracted as above. Purified strands were obtained by selecting strands with raw lengths between 32-48 bases, corresponding to average extension lengths of 2 to 3 per template nucleotide. Purified strands were used for analysis of synthesis errors with Needleman-Wunsch and for sequence reconstruction of E1-E4 with the decoding pipeline.
  • DNA strands synthesized for each template sequence E1 -E4 were randomly sampled from data according to a target number of reads, and then subject to a two-step filter, A filter was first applied to include those DNA strands with read counts either 1, 2, 3, 4, or 5 depending on the target number of reads, to exclude aberrant DNA strands, which could arise from combinations of synthesis and sequencing errors. A second filter was applied to rank DNA strands according to compressed strand lengths. A total of 10 top-ranked DNA strands were selected from all purified and filtered strands. These 10 strands were used to reconstruct each template sequence using MAP estimation and consensus implemented according to equations explained herein. The probability of correct retrieval of each template sequence E1-E4 was assessed by performing 500 decoding trials for each target number of reads, Each trial consisted of a random sampling of purified reads. Simulated large-scale storage systems
  • BCH Bose-Chaudhuri-Hocquenghem
  • the robustness of the codec was next assessed by performing 500 decoding trials for varying levels of synthesis accuracies.
  • a template sequence was randomly generated and ten compressed strands were synthesized by simulation with the Markov model. These compressed strands were used towards reconstruction of the template sequence via MAP estimation and probabilistic consensus.
  • Each reconstructed sequence of K nucleotides was demodulated intoi?bits, and decoded with a Matlab BCH decoder (Mathworks) to yield £ ⁇ bits.
  • the probability of correct data retrieval for a specific level of synthesis accuracy was computed as the fraction of successful decoding trials. Results for data retrieval were benchmarked on a multi-core server.
  • This example describes the evaluation of the use of 5-Bromo-dCTP (5Br-dCTP) as an altemative to natural dCTP in the synthesis reactions.
  • 5Br-dCTP 5-Bromo-dCTP
  • reaction components not including the dNTP were assembled in 18 ⁇ while nucleotide triphosphates were prepared in 2 ⁇ of water.
  • the 18 ⁇ mix was composed such that upon mixing with a 2 ⁇ nucleotide triphosphate solution, the following initial composition would be obtained: IX Custom Synthesis Buffer, 0.1 ⁇ LT2+3C initiator, ⁇ ⁇ / ⁇ TdT, and 0,25 ⁇ / ⁇ apyrase.
  • the initial final concentration of the dNTP was varied at 2, 4, 8, 16, or 32 ⁇ .
  • the 18 ⁇ mixture was added to a tube containing the 2 ⁇ _ ⁇ dNTP sample and mixed immediately by pipetting. The reaction was then incubated at room temperature for at least two minutes at which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
  • a modular design for encoding and decoding digital information in DNA is presented (Fig. 35C, Figs. 39A-39B). While single monolithic architectures can he more efficient, modular designs allow for the optimization of encoding and decoding blocks separately. Such a distributed approach simplifies the design space considerably. Within individual blocks, error-correcting codes borrowed from traditional communication systems (e.g., Reed-Solomon, Fountain, BCH, LDPC) may be applied to handle multiple types of errors.
  • traditional communication systems e.g., Reed-Solomon, Fountain, BCH, LDPC
  • Eiformation is stored in short sequences of DNA, and must be reassembled by a decoder. Alignment errors (e.g., missing or inserted nucleotides) due to inaccurate DNA synthesis or sequencing are more difficult to correct compared to substitutions or erasures common in communication systems.
  • encoding and decoding frameworks were presented, together defined as a codec, for storing and extracting information from populations of diverse DNA strands.
  • An important part of the encoding strategy is the placement of synchronization patterns which are regularly interspersed throughout data, allowing a decoder to compute accurate alignments from diverse synthesized strands. Synchronization patterns are inserted in the modulation step of the encoding pipeline, which translates rows of bits into DNA sequences which adhere to modulation constraints (Fig. 39B, Figs. 21A-21B).
  • the codec is inclusive of core components such as Reed-Solomon or Fountain codes utilized in prior DNA storage systems (Y.
  • the encoder first partitions data into ordered rows of bits, prefixing an address to each row to delineate its order in reassembly. Error-correction is incorporated within each row of bits, or block of rows to protect against synthesis errors, missing sequences, or low sequencing coverage.
  • the encoder outputs a book of template sequences, which are written by enzymatic synthesis to DNA strands.
  • the resulting strands can then be stored.
  • the stored strands are read by high-throughput DNA sequencing and transitions extracted to form a sequence of non-identical nucleotides, which is then fed into a decoder.
  • a crucial first step of the decoder is to harness information latent in diverse DNA strands by MAP estimation and probabilistic consensus (Fig. 13B).
  • the decoder is designed to function with minimal sequencing reads.
  • Existing approaches for strand alignments S. B. Needleman, C. D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443-453 (1970); T. F. Smith, M. S. Waterman, Identification of common molecular subsequences. J, Mol. Biol . 147, 195-197 ( 1981); C. Notredame, D. G. Higgins, J. Heringa, T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol.
  • the encoder (Fig. 39B) first partitions data into ordered rows of bits. Each row of bits is eventually stored in one template sequence of DNA. In subsequent paragraphs, this correspondence is maintained between rows of bits and template sequences of DNA. Each row is prefixed with a unique address to delineate its order in reassembly. Let ""denote the total number of bits stored per row, including both payload data and addresses. Let ⁇ ⁇ ⁇ indic te the number of address bits. With ⁇ bits, it is possible to address a total of bits.
  • template sequences in which each template sequence stores (i3— ⁇ Ejbits ot payload data.
  • the storage capacity is equal to the number of DNA sequences multiplied by the number of bits of payload data stored per template sequence.
  • the storage capacity is maximized by maximizing the total number of DNA template sequences. The following equations specify the storage capacity, and the maximum storage capacity.
  • Storage Capacity 2 ⁇ ( ⁇ - )bits, for 0 ⁇ ⁇ ⁇ ⁇ .
  • the goal of an encoder and decoder architecture is to recover both the address and payload data correctly. If the address is irretrievable or only partially reconstructed, the order of information is lost. In this sense, it is more critical to recover the address. If the address is correct, it is possible to correct errors in the payload data using redundant information stored in other DNA sequences. However, in the analyses, both the address and payload information (a total of Ubits per sequence) are decoded reliably with equal error protection.
  • Reed-Solomon (RS) codes and Fountain codes which may be incorporated within the encoding and decoding architecture are briefly described (Fig, 39B). However, these codes were not explicitly used in the experiments or simulations. If synthesizing thousands or millions of DNA sequences, error-correction across multiple sequences is necessary to protect against the following types of errors: 1) Missing strands for particular sequences; 2) Strands with low sequencing coverage (i .e., too few reads of DNA strands for particular sequences after PCR-ampiification and sequencing; 3) Sequences with detected errors after reconstruction from multiple strands, 4) Sequences with undetected errors in either the address or payload after reconstruction from multiple strands.
  • RS Reed-Solomon
  • the error locations within a block of reconstructed sequences are known and may be pinpointed by checking the addresses of all sequences. For example, missing sequences can be identified by their missing addresses which are not available in a block.
  • the error locations within a block of reconstructed sequences are unknown and undetected.
  • the fourth type of error is not accommodated by most Fountain codes which are specialized only to handle erased/missing sequences.
  • Fountain codes were originally applied in packet communication networks for recovering missing packets at a high-level abstraction layer in the communication protocol stack. While RS codes can correct undetected errors in the payload data, they assume that the address per sequence is reconstructed correctly.
  • an RS code is applied in the vertical direction across multiple rows/sequences of bits (Fig. 39B).
  • a total of i!bits per row exist prior to RS encoding, and a total of i!bits per row exist after RS encoding (Fig, 39B).
  • the organization of information in the horizontal direction is unaltered.
  • the RS code inserts extra rows of redundant parity bits. Each extra row of parity bits contains its own unique address which is utilized by the RS decoder for error-correction.
  • RS(H r ⁇ .., fe rs .)code which has a minimum Hamming distance of (n S — k rs — 1).
  • k rs wws store address and payload information bits, while (? S additional rows store RS parity bits.
  • the RS code is able to correct up toJfsequences with known error locations within a block of sequences, andUsequences containing undetected errors, where 2U + E ⁇ n S — k S ) .
  • the undetected errors cost twice as much in terms of added redundancy required.
  • R S(255,223)code is specified, which corrects up to 16- sequences with undetected errors within a block of sequences, or corrects up to32sequences with known error locations within a block.
  • the RS code may be applied to a block of fl S sequences of bits as a layer of protection for both detected and undetected errors, with the assumption that the address for sequences is known (Fig. 39B).
  • ECCs error-correcting codes
  • BCH Bose-Chaudhuri-Hocquenghem
  • LDPC low-density parity check
  • Synchronization itself is insufficient for correct decoding.
  • a missing nucleotide in the compressed strand causes a synchronization error, but even if the position of the deletion is known via synchronization, the missing nucleotide must be recovered correctly.
  • the alignment and synchronization step of the decoder may resolve ail errors perfectly by utilizing the diversity of synthesized strands per sequence. If enough diversity is available, the missing information in one strand variant may be recovered correctly from other variants. In this way, alignment and consensus algorithms have a probability of success for decoding correctly.
  • the number of nucleotides for template sequences must increase. In these systems, a few errors may still occur after the alignment and consensus step of the decoder.
  • BCH codes were applied to encode and decode bits stored per DNA sequence, LDPC codes could also provide similar error- correction capabilities.
  • BCH(63 f S7,l) BCH(63,45,3); BCH(63,33 ⁇ 44);
  • BCH(31, 2I,2), BCH(63,36,S), and 601 (12 , 5 ,11) codes were applied respectively. These BCH codes are applicable for DNA storage due to their short sequence length requirements, and efficient error-correcting abilities.
  • parameters ⁇ 1 ⁇ 2 ;£ - :i 3 ⁇ 4and :i ⁇ 3 ⁇ 4 ⁇ 3 ⁇ 4fo the BCH code directly affect overall system parameters.
  • the coding scheme establishes baseline efficiencies in simulations, towards a flexible-write strategy for DNA storage. The level of efficiency for coded systems is anticipated to improve.
  • a principal element of DNA storage is the encoder's mapping from bits to template nucleotides (modulation), as well as the decoder's mapping from nucleotides of reconstructed sequences to bits (demodulation).
  • modulation maps J?bits to / ⁇ nucleotides: 13 ⁇ 423 ⁇ 413 ⁇ 4, .. b s ⁇ o i a 2 & 3 ... ⁇ ⁇ ⁇ the ideal case, one template nucleotide stores a maximum of 2bits. Therefore, an upper bound for eveiy modulation scheme is the limit: B ⁇
  • a demodulation step maps K nucleotides to Hbits:
  • S 2K is not achievable for several reasons.
  • the controlled process of synthesis adds each nucleotide one by one. According to a specific concentration of nucleotide triphosphates, each nucleotide is added correctly to strand ⁇ , or an error such as a missing nucleotide in strand ⁇ (deletion) may occur.
  • a current design constraint for enzymatic synthesis is to specify sequences with non-identical nucleotides (e.g., without AA, TT, CC, GG transitions). Specifying information only in sequences of non-identical nucleotides allows for facile data processing. Further work to account for polymerization extension lengths could remove such a constraint.
  • Constraints reflecting valid and invalid transitions between nucleotides may be expressed via a transition matrix ⁇ (Figs. 21A-21B).
  • An upper bound for the maximum amount of bits stored per nucleotide is l g 2. >i rii;aos ( ⁇ r ) .where ⁇ 1 ⁇ ⁇ ) the maximum eigenvalue of F, For enzymatic synthesis in this paper, self transitions were forbidden, leading to an upper bound of B ⁇ (3 ⁇ 43 ⁇ 4C 3 )) % 3 ⁇ 4 ⁇ 21 A).
  • minimizing the use of certain transition types, such as CA or CG would improve synthesis accuracy but reduce the amount of information bits stored per template nucleotide (Fig. 21B).
  • synchronization nucleotides An important aspect of the modulation step of the encoder (Fig. 39B) is the insertion of synchronization nucleotides at regular intervals within each sequence.
  • embedded synchronization patterns provide resilience against alignment errors. The error- resilience is boosted significantly during the alignment and consensus step of the decoder (prior to the demodulation step in the pipeline).
  • Synchronization nucleotides are also utilized in the demodulation step of the decoder (Fig, 39B). As a tradeoff, the inclusion of synchronization nucleotides reduces the total space allocated for address and payload information.
  • each template nucleotide either stores 1 bit or 1 bit, or is selected for synchronization (Fig. 22 A). Without the necessity for synchronization nucleotides, it would be possible to store up to 1.5 bits per template nucleotide (close to the upper bound of 3 ⁇ 43 ⁇ 4 ⁇ 3 ⁇ bits per nucleotide) by converting all input bits directly into trits.
  • the modulation scheme for specific sequences El, E2, E3, E4 synthesized in experiment is provided (Figs. 25A, 22B).
  • the demodulation step of the decoding pipeline attempts to reverse the steps of modulation.
  • demodulation converts a sequence of nucleotides into a mixture of bits and trits, and subsequently extracts a sequence of bits according to tables of conversion (Fig. 22B, Table 9). If errors exist within the sequence of nucleotides, the demodulation step may also output a sequence of bits containing errors. Synchronization nucleotides (Figs, 22A-22B) ensure that errors are localized within a sequence to some degree, limiting a propagation of errors.
  • the modulation scheme for simulations is nearly identical to the modulation scheme used in the "Eureka! " experiment and includes a similar synchronization pattern embedded per sequence.
  • a sequence of bits is converted to a mixture of bits and trits, and then to a sequence of nucleotides. It is noted that the intermediate mixture of bits and trits is designed to facilitate placement of information between synchronization nucleotides, while also ensuring that no self-transitions are possible.
  • the demodulation step consists of reciprocal conversions to map a sequence of nucleotides to a sequence of bits (Table 9).
  • the following table specifies the conversion of B bits per sequence to K nucleotides per template sequence for all DNA storage systems analyzed in this paper.
  • the conversion utilizes an intermediate form of information which consists of a mixture of bits and trits.
  • the demodulation step of the decoder reverses the steps of modulation.
  • the end-to-end efficiency rate of storage may be computed for all experimental and simulated systems. Specifically, starting with 12 bits of data and addresses stored per sequence, an ECC per sequence results in B bits per sequence. Then E bits per sequence are converted and modulated into K nucleotides per sequence, including synchronization nucleotides. The following table lists these efficiencies for information storage in template DNA sequences.
  • DNA storage was modeled as an input-output subsystem involving only a sequence of Sbits (Fig. 39B). Based on this abstraction, the input to the DNA storage system can be represented by a sequence of Bbits prior to modulation. Similarly, the output can be represented by a sequence of iSbits, obtained after demodulation. The output bit sequence may contain errors.
  • Random input sequences of Sbits were generated, and obtained output sequences of i?bits by simulating a subsystem within the encoding and decoding pipeline (Fig, 39B).
  • the probability of bit error, denoted by ]? t-srr &T> was estimated by averaging over all input- output bit sequences.
  • the capacity was derived to be S ⁇ I— 3 ⁇ 4 (P r- « ⁇ wr))bi ts - m tn * s standard capacity formula for a bit-error memoryless channel, i1 ⁇ 2( * )denotes the binary entropy function (T. M, Cover, J. A. Thomas, Elements of Information Theory (John Wiley & Sons, 2012)).
  • the capacity in bits up to a maximum of Sliits per template sequence was plotted for different levels of synthesis accuracy (Figs. 36A-36F).
  • a template sequence of 38 nucleotides could store 10 more bits of data and addresses, an increase from 23 to 33 bits (Fig. 36A).
  • 27 and 70 more bits of data and addresses could be stored per template sequence of 74 and 152 nucleotides, respectively at the same level of synthesis accuracy (Figs. 36C and 36E).
  • the codec was also tested with a combination of missing nucleotides, substitutions, and insertion errors (Figs. 36B, 36D, and 36F).
  • Enzymatic synthesis produces populations of diverse strand variants from each DNA sequence.
  • the presence of diversity in DNA strands enables a larger set of strategies for synthesis, storage, and sequencing.
  • Encoding DNA sequences with synchronization patterns i.e., scaffolding
  • the term scaffolding is used to denote specially designed synchronization patterns in DNA sequences. This section describes algorithms for the alignment of diverse DNA strands by scaffolding and consensus.
  • the i ik synthesized compressed strand is comprised of a random number of nucleotides, and its random length is represented by random variable L i .
  • One particular realization of the ⁇ ⁇ synthesized strand is denoted by the following vector:
  • the length £ ⁇ is a realization of the random variable I,. Given a set of synthesized strands-fV j ],. a decoder must estimate correctly which original sequence was intended for storage. This estimation is computed based on the probabilistic framework of the Markov model (Fig. 23A). Such a framework is common to and adapted from the framework of synchronization codes used in traditional communication systems (24). Optimal alignment of diverse strands
  • the method for aligning diverse strands is based on maximum a posteriori (MAP) estimation of each nucleotide.
  • MAP maximum a posteriori
  • the notation ⁇ ⁇ indicates a set of events occurring simultaneously. Realizations of random variables are denoted by lower-case symbols in the above formula. Associated probabilities are computed based on a Markov chain model which characterizes how synthesized DNA strands (outputs) are produced from an input sequence (Fig. 23 A).
  • the optimal alignment is computed efficiently via dynamic programming recursions (explained in subsequent sections) if the number of strands is a small constant, and if the length of DNA sequences is short. While sequence lengths are short in DNA storage systems, the number of synthesized strands per sequence may be large. Therefore, it is critical to design approximations to the above exact optimization. For future algorithmic designs, it is noted that a superior alignment may be estimated for ail input nucleotides O t 0 2 z ... O K jointly. However, individual probability estimates computed per nucleotide allow for the direct application of consensus rules and error-correction after alignment.
  • the above product rule may be derived from Bayes' theorem directly, and is related to a simple Bayes classifier.
  • the given consensus optimi zation is computed efficiently via dynamic programming recursions, and remains tractable even for an increasing number of strand variants. Its computational complexity scales linearly in the number of strands.
  • the key difference between the above consensus product rule and the optimal solution of alignment i s that the inner probability only involves a single strand, as opposed to all strands jointly. As the number of strands increases, the inner probability may be computed for each strand separately and efficiently, after which a product rule is applied.
  • Dynamic programming is designed to utilize pre-existing computations in a recursive manner.
  • a two-dimensional table of probabilities is populated in the "forward" direction.
  • a two-dimensional table of probabilities is populated in the "backward” direction.
  • the following table summarizes the recursive computations required. The sum of the probabilities in each column of the table yields and ⁇ ( ⁇ $£ % respectively.
  • MAP estimation by scaffolding is provided (Figs. 23A-23B and Figs. 24A-24E).
  • Decoding by alignment is possible because of the synchronization pattern embedded as a scaffold in the template sequence.
  • the synchronization nucleotides provide strong cues for the correct placement of other nucleotides.
  • the ⁇ / ⁇ probabilities include and ⁇ ) as well as e C » *
  • groupwise alignment from three strands may be computed:
  • majority voting alignment Another algorithm for alignment, termed majority voting alignment, consists of greedy consensus (T. Batu, S. Kannan, S. Khanna, A. McGregor, Reconstructing Strings From Random Traces, in Proceedings of the Fifteenth Annual (ACM-SIAM) Symposium on Discrete Algorithms, (SODA), New Orleans, Louisiana, USA, January 11-14, 2004, pp. 910-918). It was found that such an algorithm was not sufficient to correct a large number of errors such as missing nucleotides, given only 10 filtered strands (Figs. 38A-38D). However, majority voting alignment may be combined with codes such as repetition coding to correct a larger number of errors. A full analysis of a coded form of majority voting alignment is an interesting direction to explore for future algorithmic designs. Considerations of increased diversity for consensus
  • Enzymatic synthesis of a template sequence produces raw strands (strands 5 ) with variable extension length per nucleotide. From each strand R , transitions can be extracted to form compressed strands (strands ). Each strand may be of variable length. For subsequent analyses in this section, the distribution of strand lengths was modeled, and compute the number of diverse strand variants of each length. Edit distances between synthesized strand variants and the original template sequence was also provided, along with a detailed error analyses.
  • Synthesis errors resulting in missing nucieotides (deletions), or insertions directly affect the length of a strand 0 unlike conventional errors such as substituted (mismatched) nucleotides.
  • a mathematical model was constructed for nucleotide errors occurring in synthesized strands c .
  • the model is a Markov model (Fig. 23 A) with a state space indicating different types of nucleotide errors such as missing nucleotides (deletions), substituted nucleotides, and insertions.
  • each strand c variant is synthesized independently and according to identical error statistics, as specified in the Markov model (e.g., Ps b* Pins)- ⁇ e error process results in several unique synthesized strands .
  • These strands c can be aligned to reconstruct the original sequence. While reconstruction is possible through alignment and probabilistic consensus, often the exact determination of error events in strands is ambiguous. For example, a random insertion followed by a deletion of an intended nucleotide is indistinguishable from a substitution error (Fig. 23A).
  • K nucleotides in the template sequence « There exist K nucleotides in the template sequence « .
  • the length of a synthesized strand c is also a random variable, which was denoted here byiL.
  • the length I has a probability mass function P L (l). Assuming each write is independent of previous and future writes, the generating function for length L is given by, Binomial distribution (Special case)
  • Size selection processes performed in silico or in vitro, to keep only longer synthesized strand c variants decrease the effective number of missing nucleotides to be resolved.
  • size-selection of strand 0 variants led to a reduction in the effective probability of missing nucleotides.
  • Enzymatic synthesis not only produces strands 0 of different lengths, but also produces diverse strands 0 .
  • Each strand 0 may contain errors such as missing nucleotides in different positions relative to its corresponding template sequence.
  • the theoretical diversity was compared with the experimentally observed diversity produced by enzymatic synthesis.
  • This upper bound is equivalent to the total number of strands of length I obtained after (K— ⁇ ) deletion errors, and is independent of the template sequence itself.
  • Proposition Define C ca _[ » JCjos the storage capacity.
  • the first inequality states that the number of nucleotides per sequence must increase, and not be held constant, to store enough bits.
  • the third inequality states that the number of nucleotides per sequence must increase at least by iog 2 M f in order to increase storage capacity. This bound indirectly implies that storing an address per sequence, which requiresO ⁇ io3 ⁇ 4 ) bits of storage per sequence, is a minimal requirement for reassembly.
  • the prototype is comprised of two main parts; a Mantis liquid handler, which has a single robotic arm that can be programmed to dispense one of six reagents at a time, and custom jigs, which were either laser cut (Epilog Legend 36EXT) or machined (gift from Formuiatrix) to hold the glass slide acting as a solid support substrate for the DNA (Fig, 40 A), Initiator immobilization and surface preparation
  • a 5' amine-modified initiator oligo (5Aml2-fSBS3-ctgag) and a 3' amine-modified blocking oligo (10T-3Am) were covalently attached onto an aldehydesilane-coated microscope slide (Schott Nexterion Slide AL).
  • the blocking oligo was included to prevent unwanted interactions, such as adsorption, between the initiator or enzymes to the surface. To do this, it was created an oligo mixture containing 2 ⁇ 5Aml2- fSBS3-ctgag and 8uM 10T-3AM in 3X SSC (IX SSC is 150mM NaCl and 15mM sodium citrate) and 1.5M Betaine.
  • the oligo mixture was dispensed as 0.1 ⁇ _, droplets onto the slide using a Mantis liquid handler (Figs. 4QB-40C). Following the dispense, the slide was incubated at room temperature for 30 minutes in a parafilm-sealed Petri dish with Kimwipes saturated with 4X SSC, Then, the slide was transferred to a 100°C hotplate and dried for 30 minutes.
  • the synthesis procedure depends on precise and specific localization of nucleotide triphosphates and enzymatic mixes to initiator spots, which is denoted as features.
  • features Once these droplets are dispensed, however, they are prone to spread unevenly and uncontrollably across the glass surface and may contaminate neighboring features.
  • To constrain the droplet it was sought to create virtual "wells" for each feature by increasing the hydrophobicity in the areas between features. Dispensed droplets should then stay localized on each feature. 0,3 ⁇ . droplets containing 3X SCC and 1.5M Betaine was first dispensed on top of the features using a Mantis liquid handler and then dried the slide for 30 minutes on a 100°C hotplate.
  • the slide is dipped in Sigmacote (Sigma), which produces a neutral hydrophobic film over the areas of the glass which do not contain features, dried under a fume hood for 5 minutes, then dried for 5 minutes on a 100°C hotplate. Afterwards, the slide is washed twice with 0.2% SDS and three times with distilled water (Invitrogen UltraPure), The slides were then stringently washed by placing it in a boiling solution of 0.2X SSC for 15 minutes, then in room temperature distilled water (Invitrogen UltraPure). Lastly, to reduce Schiff bases and unreacted aldehydes, the slide was incubated for 10 minutes in a sodium borohydride reducing solution.
  • the solution was prepared by dissolving 0.12g of NaBH 4 (Sigma) in 30mL phosphate buffered saline (PBS, Invitrogen), then adding lOmL of 100% ethanol. Afterwards, the slide was washed once with 0.2% SDS and three times with distilled water (Invitrogen UftraPure). The prepared slide is then kept in an ice-cold ethanoi hath until use.
  • PBS phosphate buffered saline
  • Each synthesis cycle was composed of the following six steps: (i) the slide is placed on a custom jig for the Mantis liquid handler; (ii) a 0.5 ⁇ !, dispense of enzymatic reaction mix, comprised of Ix Custom Synthesis Buffer (14 mM Tris-Acetate, 35 mM Potassium Acetate pH 7.9, 7 mM Magnesium Acetate, 0.1% Triton X-100, 10% (w/v) PEG 8000) with ⁇ / ⁇ TdT (Enzymatics) and 0.25mU/uL apyrase (NEB); (iii) a 0.1 ⁇ dispense of a nucleotide triphosphate at the following 6X concentrations in 10% PEG 8000 + 0.05% Triton X-100: 60 ⁇ dATP, 75 ⁇ Br-dCTP, 18 ⁇ dGTP, and () ⁇ !
  • the synthesized strands were then released from the slide surface by cleaving the uracils located on the 5' end of the initiators with USER enzymes.
  • the cleavage reaction mixture was composed of 0.18 units of UDG (Enzymatics) per ⁇ , 0.18 units of Endonuclease VIII (Enzymatics) per ⁇ , and 0.5 ⁇ ttSBS9 in USER TE-T buffer (40mM Tris-HCl pH 8.0, ImM EDTA, 0.01% Tween-20), The cleavage mixture was dispensed as 2 ⁇ droplets with the Mantis liquid handler onto each of the features.
  • a sequencing library for each feature was generated next. Using cycle-limited realtime PCR, 5 ⁇ of each feature was first amplified with the primers tSBS3 and ttSBS9, then with NEBNext Dual Indexing Primers for 15 cycles. Barcoded strands were then combined and sequenced single end using Illumina MiSeq v3 50.
  • each DNA sequence to be synthesized occupies a physical spot, also denoted as a feature, on a planar surface.
  • All DNA sequences are arranged as a 2D array to allow spatial addressing (x and y Cartesian coordinates). All DNA sequences are synthesized in parallel per cycle, that is, all features receive their first nucleotide during the first cycle, they then all receive their second nucleotide during the next cycle, and so on.
  • Each cycle consists of a series of reactions. Reagents for each reaction may be dispensed directly to each feature by non-contact Inkjet dispense or to all features by first sealing the array surface to form a flow cell and then flushing the reagent through.
  • the reagent to be dispensed by inkjet is denoted as droplet whereas the reagent to be flushed is denoted as flowcell.
  • the total cost of reagents for a cycle of each synthesis process can be computed as follows:
  • Cycle_c 0 st ffhmm ($ >e X V f ) + ⁇ X $ ⁇ X 3 ⁇ 4
  • n is the total number of features
  • is a constant representing the cost of droplet reagents in enzymatic synthesis
  • V is the droplet volume in cubic centimeters
  • $ ⁇ je represents the cost of flowcell reagents in chemical synthesis
  • $ ⁇ > ⁇ represents the cost of droplet reagents in enzymatic synthesis.
  • Flowcell area (A) can be expressed as a function of number of features (n) and density (/ ) ) of features:
  • Cycle_cast 9ng ($ fsS X c X. n ⁇ D) -f- (n. X $ ⁇ X 3 ⁇ 4 X rf 3 )
  • the number of features and feature density from the Agilent SurePrint G3 system was then utilized as a physical basis for projecting reagent costs for synthesis.
  • phosphoramidite reagent cost per cycle is 0.626 USD whereas the enzymatic reagent cost per cycle is 0.055 USD 61.3) or 0.0044 USD (assuming 4,38), a ⁇ 11-fold and -140-fold drop in cost respectively.
  • a cost-effective strategy would be to increase the number of synthesized features, n, for a given surface area (increasing the feature density, 0, as a result) per cycle and to minimize the total number of cycles, thereby limiting flowcell reagent cost. For this approach, it assumes that features are maximally packed, end-to-end, in a given surface area.
  • the flowcell area (,4 ) can be alternatively expressed as a function of the number of features (?;, ⁇ and feature size diameter (d):
  • Efficiency rate of storage For ease, it sets the average efficiency rate of storage for both enzymatic and phosphoramidite to be equivalent, storing an average of 1 bit per template nucleotide. The rate for each approach may be different depending on factors such as synthesis accuracy and the required addition of error-correction codes per template sequence to ensure accurate information recovery. Altering the efficiency rate of storage for each processes will change costs linearly, and the resulting difference between enzymatic and phosphoramidite approaches would likely be within an order of magnitude. Improvements to enzymatic synthesis wil l increase the efficiency rate of storage to be competitive to that of phosphoramidite synthesis. Such improvements will also influence the number of diversely synthesized needed for template reconstruction and inform the minimum required feature size.
  • Feature density For the reagent cost per megabyte projections, features are maximally packed with no spacing between. Practically, features are likely to be separated by a gap, usually a fraction of the feature size, to accommodate for potential positioning errors when droplets are dispensed. The number of features will then decrease inversely proportional to the square of the gap size (equation 1 1 to be modified accordingly). As this parameter is the same for calculating reagent costs for both phosphoramidite and enzymatic synthesis, altering the number of features may change absolute costs for each approach but relative comparisons between approaches will remain unchanged.
  • Feature size Reaching the projected costs depends on overcoming significant engineering challenges associated with miniaturizing feature sizes.
  • Current Inkjet printheads dispense 1-10 picoliter droplets, resulting in feature sizes of 15-38 microns (equation 4 and (FUJIFILM Dimatix col laborates with Agilent in developing Inkjet technology for advanced life sciences applications j Press Center
  • phosphoramidite features To reach the projected cost per megabyte equivalent to magnetic tape, phosphoramidite features must be ⁇ 40nm which requires dispensing a 0.016 attoliter droplet, whereas enzymatic features must be -350- 800nm which requires dispensing a droplet of 1 1 -134 attol iters.
  • Equipment amortization is another important, but often neglected, cost consideration.
  • Capital equipment costs are likely to increase significantly as DNA synthesis is scaled to achieve target costs.
  • specialty dispensers will be required.
  • the time required for a dispenser to find the correct feature to receive a droplet reagent, the dispenser seek time becomes an important consideration.
  • positioning systems likely with nanometer-scale resolution will be required, which may be expensive or prone to breakdown. While these are all important factors, there is insufficient information to estimate relevant parameters. Accordingly, it was assumed for ease that all seek times could be instantaneous and thus equipment amortization for enzymatic and phosphoramidite would be primarily dependent on their respective cycle time.
  • a conservative estimate of enzymatic cycle time is ⁇ 4-fold shorter than phosphoramidite chemistry (Table 6), which could result in a shortened amortization schedule, further reducing total synthesis costs.
  • V f and V represent flowcell and droplet volume respectively.
  • the two highlighted values on the bottom of the Phosphoramidite Price section are and c .
  • the two highlighted values on the bottom of the Phosphoramidite Price section are Table 7. Parameters of DNA Storage Systems

Abstract

The disclosure provides a method of decoding a nucleotide sequence, the nucleotide sequence encoding a value corresponding to a format of information. The method includes determining the nucleotide sequence, identifying a transition or boundary or edge between different or nonidentical nucleotides of the nucleotide sequence, and assigning a predetermined value to the identified transition or boundary or edge to create the value encoded in the nucleotide sequence corresponding to the format of information.

Description

METHODS OF ENCODING AND HIGH-THROUGHPUT DECODING OF
INFORMATION STORED IN DNA
RELATED APPLICATION DATA
This application claims priority to U.S. Provisional Application No. 62/575,017 filed on October 20, 2017, which is hereby incorporated herein by reference in its entirety for ail purposes.
STATEMENT OF GOVERNMENT INTERESTS
This invention was made with government support under Grant No. R01-MH103910- 02 from the National Institutes of Health and Grant No. DE-FG02-02ER63445 from the US Department of Energy. The government has certain rights in the invention.
FIELD
The present invention relates in general to methods of using nucleotide transitions to encode information into a nucleotide sequence and high-throughput decoding of information stored in the nucleotide sequence.
BACKGROUND
DNA is a compelling data storage medium given its superior density, stability, energy-efficiency, and longevity compared to currently used electronic media (C. Bancroft, T. Bowler, B. Bloom, C. T. Cleiland, Long-term storage of information in DNA. Science. 293, 1763-1765 (2001), V. Zhirnov, R. M. Zadegan, G S, Sandhu, G. M. Church, W. L. Hughes, Nucleic acid memory. Nat. Mater. 15, 366-370 (2016)). Recent studies have demonstrated that any digital data can be written in DNA, stored, and accurately read (G. M. Church, Y. Gao, S. Kosuri, Next -generation digital information storage in DNA. Science. 337, 1628 (2012), N. Goldman et a!., Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature, 494, 77-80 (2013 ), M. Blawat et al., Forward Error Correction for DNA Data Storage. Procedia Comput. Sci. 80, 1011- 1022 (2016), Y. Erlich, D. Zielinski, DNA Fountain enables a robust and efficient storage architecture. Science. 355, 950-954 (2017), S. M. H. T. Yazdi, Y. Yuan, J. Ma, H. Zhao, O. Miienkovic, A Rewritable, Random-Access DNA-Based Storage System, Sci. Rep. 5, 14138 (2015), R. N. Grass, R, Heckel, M. Puddu, D. Paunescu, W. J. Stark, Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angew. Chem. Int. Ed Engl. 54, 2552-2555 (2015), J. Bornholt et al., in Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS Ί6 (ACM Press, New York, New York, USA, 2016), pp. 637-649). However, the adoption of D A information storage remains limited due to its reliance on phosphoramidite chemistry, the only method currently available for de novo DNA synthesis. This chemistry, designed for synthesizing DNA with single-base accuracy for biological applications, comprises several lengthy reactions that require expensive reagents and is thus incompatible with the speed and costs suitable for large-scale DNA information storage. Additionally, after more than three decades of development, only insignificant further improvements to synthesis time or cost has been reported or projected for chemical DNA synthesis (R. Carlson, Time for New DNA Synthesis and Sequencing Cost Curves - SynBioBeta. SynBioBeta (2014), D. A. Lashkari, S. P. Hunicke-Smith, R. M. Norgren, R. W. Davis, T. Brennan, An automated multiplex oligonucleotide synthesizer: development of high- throughput, low-cost DNA synthesis. Proc. Natl. Acad. Sci. U. S. A. 92, 7912-7915 (1995)). There thus remains a need for the development of faster, more accurate and cheaper enzymatic DNA synthesis methods than the existing chemical synthesis methods for large- scale information storage.
According to one aspect, the present disclosure provides a method of decoding a nucleotide sequence, the nucleotide sequence encoding a value corresponding to a format of information. In one embodiment, the method includes determining the nucleotide sequence, identifying a transition or boundary or edge between different or nonidenticai nucleotides of the nucleotide sequence, and assigning a predetermined value to the identified transition or boundary or edge to create the value encoded in the nucleotide sequence corresponding to the format of information. In another embodiment, the nucleotide sequence encodes a series of values corresponding to the format of information and wherein a plurality of transitions or boundaries or edges between different or nonidenticai nucleotides of the nucleotide sequence are identified and each identified transition or boundary or edge is assigned a predetermined value to create the series of values encoded in the nucleotide sequence corresponding to the format of information. In some embodiments, the value corresponding to the format of information can be obtained from analog, digital, optical, visible or non-visible wavelengths, chemical, or physical input sources. In certain embodiments, the value is a digital value and the series of values are digital values. In some embodiments, the digital values are two digits, three digits, four digits, five digits, six digits, seven digits, eight digits, nine digits, ten digits, eleven digits, twelve digits, thirteen digits, fourteen digits, fifteen digits, sixteen digits, or more. In other embodiments, the digital values are bits (binary digits), trits (ternary digits), octet (eight digits), hexadecimal (sixteen digits), or more. In some embodiments, the format of information is selected from the group consisting of text, image, video or audio format, sensor data, and combinations thereof. In certain embodiments, the different or nonidentical nucleotides comprise natural nucleotides or nonnatural nucleotides. In other embodiments, the different or nonidentical nucleotides comprise adenine, cytosine, guanine, and thymine. In one embodiment, the nucleotide sequence includes at least one nucleotide homopolymer. In another embodiment, the nucleotide sequence includes a nucleotide homopolymer for each different or nonidentical nucleotide in the nucleotide sequence. In yet another embodiment, the nucleotide sequence includes a nucleotide homopolymer for each different or nonidentical nucleotide in the nucleotide sequence and wherein the transition between one nucleotide homopolymer to a different or nonidentical nucleotide homopolymer is a single transition or boundary or edge. In one embodiment, the series of digital values comprises two different digital values. In another embodiment, the series of digital values comprises three different digital values. In yet another embodiment, the series of digital values comprises more than three different digital values. In still another embodiment, each digital value in the series of digital values represents two, three or more different digital values. In one embodiment, the each nucleotide transition or boundary or edge is assigned a predetermined digital value. In another embodiment, the step of determining the nucleotide sequence is carried out by sequencing methods including nanopore sequencing, sequencing-by-synthesis, sequencing- by-ligation, and sequencing-by-hybridization. In one embodiment, the step of determining the nucleotide sequence is carried out by nucleotides modified with reversible terminators. In another embodiment, the step of determining the nucleotide sequence is carried out by detection of pyrophosphate or hydrogen ions generated during DNA polymerization of a complementar nucleotide strand. In one embodiment, the step of determining the nucleotide sequence is carried out by ligation of fluorescently modified single-stranded nucleotides with complementarity to the nucleotide sequence to be sequenced.
In one embodiment, the series of digital values includes a corresponding barcode. In another embodiment, the method further includes decoding a plurality of nucleotide sequences, each member of the plurality encoding for an identical value corresponding to the format of information, wherein the nucleotide sequence is determined for each member of the plurality, and identifying a transition or boundary or edge between different nucleotides of each member of the plurality and assigning a predetermined value to each identified transition or boundary or edge to create the identical value corresponding to the format of information. In yet another embodiment, each member of the plurality of the nucleotide sequence encodes a series of identical values corresponding to the format of information and wherein a plurality of transitions or boundaries or edges between different or nonidentical nucleotides of each member of the plurality of the nucleotide sequence are identified and each identified transition or boundary or edge is assigned a predetermined value to create the series of identical values encoded in each member of the plurality of the nucleotide sequence corresponding to the format of information.
In one embodiment, the nucleotide sequence is attached to a substrate. In another embodiment, each member of the plurality of nucleotide sequence is attached to a substrate. In one embodiment, the series of digital values is a bit or trit stream and the nucleotide sequence corresponds to a bit or trit sequence within the bit or trit stream. In another embodiment, the series of digital values is a bit or trit stream and the bit or trit stream comprises a plurality of bit or trit sequences each having a corresponding barcode to indicate position within the bit or trit stream and with the plurality of bit or trit sequences having a corresponding plurality of nucleotide sequences, wherein each member of the plurality of nucleotide sequences is sequenced, and identifying a plurality of transitions or boundaries or edges between different nucleotides of each member of the plurality and assigning a predetermined bit or trit value to each transition or boundary or edge of the plurality of transitions or boundaries or edges to create the bit or trit sequences corresponding to each member of the plurality.
According to another aspect, the present disclosure provides a method of decoding a nucleotide sequence encoding for a series of digital values corresponding to a format of information. In one embodiment, the method includes determining the nucleotide sequence to identify nucleotide homopolymers and for each homopolymer assigning one or more of the nucleotides based on a predetermined predicted homopolymer length of the nucleotide produced using enzymatic or chemical synthesis, and assigning a particular digital value for each of the one or more nucleotides. In another embodiment, the predicted homopolymer length is determined from empirical observation. In yet another embodiment, the predicted homopolymer length is a median, a mean, or a mode. In certain embodiments, the format of information is selected from the group consisting of text, image, video or an audio format, sensor data, and combination thereof. In other embodiments, the nucleotides comprise natural nucleotides or nonnatural nucleotides. In one embodiment, the nucleotides comprise adenine, cytosine, guanine, and thymine.
According to one aspect, the present disclosure provides a method of sequencing and decoding a plurality of nucleotide sequences representing a format of information wherein each nucleotide sequence encodes a portion of the format of information and wherein each portion of the format of information has more than two corresponding nucleotide sequences. In one embodiment, the method includes determining the sequences and decoded series of digital values for the sequences within a first portion of the plurality of nucleotide sequences, translating the series of digital values into the portions of the format of information, and sequencing and decoding in series additional portions into series of digital values and translating the series of digital values into the portions of the format of information until the entire format of information is achieved.
According to one aspect, the present disclosure provides a method of encoding a series of digital values corresponding to a format of information into a nucleotide sequence. In one embodiment, the method includes for each digital value, assigning a corresponding nucleotide to different or nonidentical nucleotide transition to generate the nucleotide sequence, synthesizing the nucleotide sequence, and optionally storing the nucleotide sequence. In some embodiments, the digital values are two digits, three digits, four digits, five digits, six digits, seven digits, eight digits, nine digits, ten digits, eleven digits, twelve digits, thirteen digits, fourteen digits, fifteen digits, sixteen digits, or more. In other embodiments, the digital values are bits (binary digits), trits (ternary digits), octet (eight digits), hexadecimal (sixteen digits), or more. In some embodiments, the format of information is selected from the group consisting of text, image, video or an audio format, sensor data, and combination thereof. In one embodiment, the nucleotides or different or nonidentical nucleotides comprise natural nucleotides or nonnatural nucleotides. In another embodiment, the nucleotides or different or nonidentical nucleotides comprise adenine, cytosine, guanine, and thymine.
According to one aspect, the present disclosure provides a method for high- throughput decoding of a format of information encoded in a plurality of nucleotide sequences. In one embodiment, the method includes providing a plurality of nucleotide sequences, the plurality of nucleotide sequences represents a packet of information, the packet comprises at least one unique identifier; sequencing at least one of the plurality of nucleotide sequences using a selective sequencer; storing the sequence and its unique identifier; and preventing, using the selective sequencer, redundant sequencing of the same nucleotide sequence. In one embodiment, the step of preventing comprises using the unique identifier to prevent sequencing of additional nucleotide sequence with the same identifier. In certain embodiments, the selective sequencer is a nanopore sequencer or a sequencer compatible with sequencing-by-synthesis, sequencing-by-ligation and sequencing-by- hybridization methods. In one embodiment, the sequence is stored in computer memon,'. In another embodiment, the sequence is decoded into digital values. In one embodiment, the unique identifier is a synthetic sequence. In other embodiments, the unique identifier is located at the 3' end, the 5' end of the nucleotide sequence, or is interspersed within the nucleotide sequence. In one embodiment, the plurality of nucleotide sequences comprises a plurality of unique identifiers. In another embodiment, the method further includes sequencing a predetermined number of nucleotide sequences; assembling the packet of information; and analyzing the information to determine if the information is correctly decoded. In yet another embodiment, the method further includes permitting sequencing of any nucleotide sequences that were not correctly decoded. In one embodiment, the step of analyzing is performed using a decoding algorithm.
According to one aspect, the present disclosure provides a method of encoding information using nucleotides. In one embodiment, the method includes converting a format of information into a sequence of binary ASCII bits, converting the sequence of binary ASCII bits into a sequence of ternary ASCII bits, converting the sequence of ternary ASCII bits into a corresponding oligonucleotide sequence such that one bit represents a transition between non-identical nucleotides, and synthesizing the corresponding oligonucleotide sequence by the following steps: (a) providing a reaction mixture to an initiator oligonucleotide immobilized to a solid support wherein the reaction mixture comprises an amount of terminal deoxyiiucleotide transferase (TdT), an amount of apyrase, one or more selected nucleotide triphosphates, and divalent cations, wherein the TdT adds one or more of the selected nucleotide triphosphates to the 3' terminal nucleotide of the initiator oligonucleotide, and wherein the apyrase degrades excessive nucleotide triphosphates to inactive diphosphates and monophosphates, and (b) repeating step (a) until the corresponding oligonucleotide sequence is formed.
According to another aspect, the present disclosure provides a method of decoding a format of information from a synthesized oligonucleotide sequence encoding bit sequences of the formation of information. In one embodiment, the method includes amplifying the oligonucleotide sequence, sequencing the amplified oligonucleotide sequence, converting the oligonucleotide sequence to bit sequences wherein each bit represents a transition between non-identical nucleotides, and converting the bit sequences to the format of information. In another embodiment, the oligonucleotide sequence is ligated to a universal adaptor before amplification.
According to still another aspect, the present disclosure provides a method of storing information using nucleotides. In one embodiment, the method includes converting a format of information into a sequence of binary ASCII bits, converting the sequence of binary ASCII bits into a sequence of ternary ASCII bits, converting the sequence of ternary ASCII bits into a corresponding oligonucleotide sequence such that one bit represents a transition between non-identical nucleotides, synthesizing the corresponding oligonucleotide sequence by the following steps: (a) providing a reaction mixture to an initiator oligonucleotide immobilized to a solid support wherein the reaction mixture comprises an amount of terminal deoxynucleotide transferase (TdT), an amount of apyrase, one or more selected nucleotide triphosphates, and divalent cations, wherein the TdT adds one or more of the selected nucleotide triphosphates to the 3' terminal nucleotide of the initiator oligonucleotide, and wherein the apvrase degrades excessive nucleotide triphosphates to inactive diphosphates and monophosphates, and (b) repeating step (a) until the corresponding oligonucleotide sequence is formed, and storing the synthesized corresponding oligonucleotide sequence.
In one embodiment, the nucleotide triphosphate comprises dATP, dTTP, dCTP, dGTP, and dUTP. In another embodiment, synthesis activity is modulated by the ratio of the amount of TdT : the amount of apvrase. In one embodiment, divalent cations comprise magnesium and cobalt. In another embodiment, the reaction mixture further comprises additives comprising glycerol, sucrose, PEG8000, betaine, DMSA, Triton-XlOO and Tween20. In one embodiment, the 3' terminal nucleotide of the initiator oligonucleotide is preferably A, G or T. In another embodiment, a polyC tail is added to the end of the corresponding oligonucleotide sequence. In one embodiment, a washing step is included between steps (a) and (b). In another embodiment, an index is included in the oligonucleotide sequence to specify strand order. In one embodiment, the nucleotide sequence is synthesized by a template-independent DNA polymerase. In another embodiment, the template-independent DNA polymerase is terminal deoxynucieotidyl transferase (TdT). In another embodiment, the nucleotide sequence is synthesized by a mixture of a template-independent DNA polymerase and an apvrase. In one embodiment, the information is stored using a codec model. In another embodiment, the codec model is capable of correcting errors accumulated from synthesis, storage and sequencing. In one embodiment, the sequencing is streaming nanopore sequencing.
Further features and advantages of certain embodiments of the present invention will become more fully apparent in the following description of embodiments and drawings thereof, and from the claims. BRIEF DESCRIPTION OF THE DRAWINGS
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee. The foregoing and other features and advantages of the present embodiments will be more fully understood from the following detailed description of illustrative embodiments taken in conjunction with the accompanying drawings in which:
Fig. 1 depicts in schematic of a comparison of the number of steps required for a single coupling in enzymatic DNA synthesis vs phosphoramidite chemistry.
Figs. 2 A - 2C depict results for optimizing and tuning TdT: apyrase ratio. Fig. 2 A depict initiator extension with dATP, dCTP, dGTP or dTTP by four different TdT to apyrase ratios. TdT concentration is constant at lU/(uL, apyrase concentration varies and is marked above each lane. mU is milliunits. Gels are 15% TBE-urea. "L" is ssDNA size marker. Figs. 2B & 2C depict extension of an initiator with various concentration of dCTP (Fig. 2B) and dGTP (Fig, 2C) by 1 : 1000, 1 :2000, 1 :4000, and 1 :8000 apyrase:TdT. Apyrase:TdT ratio, as well as dNTP concentrations are marked above each lane. Gels are 15% TBE-urea. "L" is ssDNA size marker and includes the unextended initiator, as well as initiator synthesized with 1 , 2, 3, 4, or 5 additional Cytosines (Fig, 2B) or Guanines (Fig. 2C),
Figs. 3 A - 3C depict effects of cobalt on TdT: apyrase performance. Fig. 3 A depicts an initiator extension with each dNTP by various ratios of TdT to apyrase in presence of magnesium and presence or absence of supplemental cobalt. TdT concentration is constant at ΐυ/μΕ, apyrase concentration, which varies, as well as presence or absence of cobalt are marked above each lane. When present cobalt is at 250μΜ. Gels are 15% TBE-urea. "L" is ssDNA size marker. Fig. 3B depicts an initiator extension with 300μΜ dATP in presence of Magnesium and increasing amounts of supplemental cobalt. Cobalt concentrations are marked above each lane. Gel is 15% TBE-urea. "L" is ssDNA size marker. Fig. 3C depicts an initiator extension with each dNTP by TdT:apyrase in magnesium-only or cobalt-only reactions. dNTP concentration is marked above each lane. Gel is 1.5% TBE-urea. "L" is ssDNA size marker and includes the unextended initiator, as well as initiator synthesized with 1, 2, 3, 4, or 5 additional nucleotides of the corresponding base, that is Cytosines for the gel with cytosine extension.
Figs. 4A - 4C depict buffer and additives optimization for TdT:apyrase. Fig. 4A depicts an initiator extension with dATP by TdT:apyrase with increasing concentration of Enzvmatics Green Buffer. Final buffer concentration is marked above each lane. Gels are 15% TBE-urea. "L" is ssDNA size marker. Fig. 4B depicts an initiator extension with a 500μΜ mixture of all dNTPs by TdT apyrase in presence of various additives in different concentrations. Each lane is labelled with a number, the additive and its concentration in that lane are listed below the gels. Gels are 10% TBE-urea. "L" is an RNA size marker. Fig. 4C depicts an initiator extension with various dCTP concentration by TdT: apyrase in the optimized buffer and the standard buffer. Gels are 1 5% TBE-urea. "L" is ssDNA size marker and includes the unextended initiator, as well as initiator synthesized with 1, 2, 3, 4, or 5 additional Cytosines,
Fig. 5 depicts Optimizing polymerase to initiator ratio. Initiator extension extension with dATP by TdT: apyrase with increasing concentration of TdT. Values above each lane mark the concentration of TdT at units per μΐ. Apyrase concentration is constant at 1ηιυ/μΕ. Gel is 15% TBE-urea. "L" is ssDNA size marker and includes the unextended initiator which is 27 bases long.
Fig. 6 depicts TdT: apyrase performance and nucleotide concentration optimization for all sixteen possible combinations of 3' base of the initiator and the incoming nucleotide triphosphate (4 by 4). Each combination is evaluated on five lanes. The concentration of the relevant nucleotide is shown in μΜ on top of each lane. Gels are 15% TBE-urea. "L" is ssDNA size marker and includes the unextended initiator which is 27 bases long.
Fig. 7 depicts multiple consecutive rounds of extension using the TdT:apyrase reagent. Two different series of transitions are shown. The nucleotides that is added is marked on top of each lane. All samples that are shown on each gel were aliquots of the same reaction that were samples after the addition of each nucleotide. Gels are 15% TBE-urea. "L" is ssDNA size marker and includes the unextended initiator which is 24 bases long.
Figs. 8A - 8C depict schematics for an enzymatic synthesis platform for DNA information storage. Fig, 8 A shows a schematic depiction of the synthesis reaction consisting of an oligonucleotide initiator, terminal deoxynucleotidyl transferase (TdT) and apyrase (AP). TdT catalyzes the addition of nucleotides to the 3' end of the initiator, and apyrase degrades nucleotide triphosphates to terminate polymerization. Subsequent nucleotide triphosphates are added for further DNA synthesis. All synthesized strands share the same order of transitions between different nucleotides. Fig. 8B depict a schematic conversion between DNA and information. Synthesized DNA polymers are processed in silica by extracting transitions, which are then mapped to trits and bits. Fig. 8C depicts the conversion between nucleotide transitions and trits used in this study.
Figs. 9A - 9D depict encoding "hello world!" in DNA using enzymatic synthesis. Fig. 9A depicts an overview of the encoding scheme. Each character is represented by its own DNA strand containing a header index. To encode each character, its respective ASCII binary representation is converted to ternary, then to nucleotide transitions according to the mapping in Fig. 8C. DNA is synthesized using the enzymatic strategy disclosed herein, then sequenced as a pool using Illumina or Oxford Nanopore platforms. Fig. 9B depicts strand fidelity of each strand by Illumina and Oxford sequencing platforms. Fraction of reads containing perfect transitions out of top 3 filtered reads (filled circles), all filtered reads (open circles), top 3 of all reads (filled triangles), and of all reads (open triangles). Reads were filtered based on transition length and a terminal dCTP which was synthesized for efficient ligation to a universal adapter (Materials and Methods). Fig. 9C depicts streams of nanopore sequencing data. Each read is represented as a light gray dot. Reads passing the correct number of transitions (dark gray) and those with correct transitions (black) are marked. For each strand, the vertical line marks the time where the correct data can be decoded with a 99.9% confidence from the collected sequences. Fig. 9D depicts data reconstruction using streaming nanopore sequencing compared to batch sequencing-by-synthesis (SBS), For each platform, the point of time at which the entire message can be decoded is marked by a box and an arrow.
Fig. 10 depicts profiling accuracy of each "hello world!" strand at every position, Illumina sequencing output was subjected to run-length encoding. The black line indicates the percentage of reads that contained a nucleotide. The bars indicate percentage of ail reads that had a deletion, mismatch, or insertion at each position. As the frequencies of deletions and insertions are small, their bars are not visible in most positions.
Fig. 11 depicts the length distribution for each of the twelve synthesizes strands. Lengths of all reads are denoted by the black line. Lengths of perfect reads are denoted by the gray shading. As perfect reads are longer, on average, size selection will increase the yield of correctly synthesized strands.
Figs. 12A - 12B depict the evaluation of 5-Bromo-dCTP and natural dCTP for TdT:apyrase. 5-Bromo-dCTP as a substitute for natural dCTP is evaluated. " " is ssDNA size marker and includes the initiator oligonucleotide which is 27 bases long and ends in three cytosines. Fig. 12A depicts that the extension lengths were evaluated over indicated concentration of natural dCTP. Fig. 12B depicts that the extension lengths were evaluated over indicated concentration of 5-Bromo-dCTP (5Br-dCTP).
Figs. 13A - 13C depict an enzymatic synthesis strategy for storing information in DNA. Fig. 13 A depicts a schematic depiction of a series of enzymatic synthesis reactions consisting of an oligonucleotide initiator (N, gray), terminal deoxynucleotidyl transferase (TdT) and apyrase (AP). The initiator is tethered to a solid support. In each cycle, TdT catalyzes the addition of a given nucleotide triphosphate to the 3' end of all initiators while apyrase degrades the added substrate to limit net polymerization. A wash can be performed at the end of each cycle to remove reaction byproducts or to facilitate downstream processes. Fig. 13B depicts the DNA strands synthesized for each of eight consecutive synthesis cycle, as shown on 15% TBE-urea gel. The initiators were not tethered to a solid support and no wash was performed between cycles. The first lane is a single-stranded DNA size marker which includes 24 nucleotide long initiator oligonucleotide. Fig. 13C depicts a schema for
R
interconversion of DNA and information. Raw strands (strands ) represent enzymatically- synthesized DNA. Compressed strands (strands ') represent sequences of non-identical nucleotides. Transitions between nucleotides, starting with the last nucleotide of the initiator
(as an example N =:: 'a', gray) are mapped from the compressed strand to digital data in trits.
If strands is equivalent to the template sequence, all desired transitions are present and the information stored in DNA is retrieved.
Figs. 14A - 14H depict the demonstration of information storage in DNA using enzymatic synthesis. Fig. 14A depicts that the message "hello world!" was encoded in twelve template sequences, H01-H12, each representing one character. Transitions between nucleotides starts with the last base of the initiator, which is labeled 'g\ A header index (shaded gray) denotes strand order. Only results from the first five transitions sequences are shown (see Fig. 15). To encode each character, its respective ASCII decimal value, prefixed with an address is represented in base 2 (binary) or in base 3 (ternary) (see Table 1), mapped to transitions (see Fig. 13C), resulting in template sequences with nucleotides to be synthesized (capitalized). Fig. 14B depicts the extension lengths for each base from (A). Only
R C
perfect strands , those whose strand " is equivalent to a template sequence, are considered.
Synthesis performed with initiators tethered to beads and sequencing performed on the Illumina platform. Fig. 14C depicts the distribution of extension lengths for each nucleotide transition, combined across ail positions from ail perfect strands. Fig. 14D depicts the stepwise increases in strandR length with an increasing strand^ length for all synthesized strands of H01-H12. Fig. 14E depicts the distribution of all strandR lengths. Distributions are derived via kernel density estimation for all synthesized strands ('all ', gray shading) and a subpopulation of strands that contain all desired transitions ('perfect', dotted line). Fig. 14F depicts the bulk error analysis for all synthesized strands of H01-H12. All strands ' were aligned, by Needleman-Wunsch, to their respective template sequences, and the number of mismatches, insertions, and missing nucleotides were tabulated. Fig. 14G depicts the information retrieval with in silica filtering. Fraction of perfect strands are shown before
(triangles) or after filtering (circles). Fraction of perfect strands is shown for ail sequences
(white) or only the top 3 most abundant sequences (black). Fig, 14H depicts the information retrieval by different sequencing platforms. Streaming nanopore sequencing (Oxford) was compared to batch sequencing-by-synthesis (lilumina). Each dot indicates the fraction of sequencing run at which each strand is robustly retrieved (100% correct with 99.99% probability). Arrow denotes the fraction of the sequencing run at which all data is robustly retrieved using each platform.
Fig. 15 depicts the dxtension lengths for perfect strands of H01-H12. Extension lengths for each nucleotide from perfect strands are displayed as a letter-value plot for each template sequence.
Fig. 16 depicts the raw lengths for all and perfect strands of H01 -HI 2. All
R
synthesized strands of H01-H12 were sequenced with lilumina. Length distribution for the set of all (gray with shading) and perfectly (dashed line) synthesized and sequenced raw strands are shown. Distributions are derived via kernel density estimation. The number of all strands to perfect strands for each template sequence are as follows: H01 {all : 399363, perfect: 42337}, H02 {all: 431770, perfect: 62243 } ; H03 {all : 611804, perfect: 89302}; H04
{all: 903181, perfect: 200154}; H05 {all: 766896, perfect: 115345} ; H06 {all: 635182, perfect: 126849}; H07 {all: 632859, perfect: 169767}; H08 {ail: 707648, perfect: 1 13567};
H09 {all : 1008825, perfect: 207182}, 1 1 10 {all: 1 176628, perfect: 406172}, HI ! {all:
544045, perfect: 105730}; HI 2 {all : 512585, perfect: 68233 } .
Fig. 17 depicts the synthesis error analysis for all strands of H01-H12. All synthesized strandsR were sequenced with lilumina and transitions of non-identical nucleotides were
C
extracted to form strands '. Each of these strands is aligned, by Needleman-Wunsch, to its respective template sequence. For each alignment, the number of mismatches, insertions, and missing nucleotides are tabulated.
Figs. 18 A. - 18B depict the nanopore sequencing and decoding of H01 -HI 2. Nanopore sequencing (Oxford) of synthesized raw strands. For each raw strand, the sequence of non-identical nucleotides are extracted to form compressed strands (strands*")- Fraction of perfect strands ' are plotted out of the set of all strands ' (filled triangles) or out of the set of the top 3 most abundant strands*" (open triangles). Strands*" can be filtered based on the design of the template sequence (Methods). Perfect strands*" appear at a higher fraction out of the set of strands passing this filter (open circles) or out of the set of the top 3 most abundant filtered strands " (filled circles). (B) Sequencing stream of raw strands from a nanopore array
C
(Oxford). Reads which pass the data retrieval filter - expected number of strand " nucleotides with a terminal 'C (filled black), those with only the expected number of strand*" nucleotides
(filled dark gray), and remainder of reads (light gray) are plotted according to their time stamp. For each strand, the time corresponding to correct data retrieval with a 99,9% probability from the collected sequences up to that point is marked (vertical red line).
Figs. 19 A - 19E depict the coded strand architecture for sequence reconstruction. Fig. 19A depicts a DNA information storage channel. Data is converted to template sequences, synthesized (strand ), and can be stored in vitro. Retrieval starts with sequencing, then
C
transitions of non-identical nucleotides were extracted in silico to form strands . Data retrieval occurs when the template sequence and reconstructed sequence are equivalent. Errors which occur in the synthesis and sequencing steps can be modeled as a communications channel. Fig, 19B depicts the coded strand architecture, 'scaffold', enables
C
data retrieval from strands ' with missing nucleotides when compared to an 'unguided' reconstruction. Synchronization nucleotides (dark gray boxes) localize errors to yield a single reconstructed sequence. Fig. 19C depicts a 16-base transition sequence, E0, is synthesized and sequenced with Illumina. Examples of diverse strands "' produced by synthesis of E0.
Strands are aligned, by Needleman-Wunsch, to the template. Ambiguous alignments can exist depending on the location and number of missing nucleotides within a strand ' . Fig,
19D depicts the error analysis for purified strands of E0. Synthesized strands were purified in silico, by filtering for strands11 between 32-48 bases in length, and aligned by Needleman- Wunsch to the E0 template. For each alignment, the number of mismatches, insertions, and missing nucleotides were tabulated. Fig. 19E depicts evaluating the diversity of synthesized
C
strands. The number of sequencing reads for each length of strand was tabulated. Diversity was evaluated as the number of unique variants at each length of strand C and the Levenshtein edit distance was computed with respect to the E0 template. The set of 802 purified strands contains 2 perfect strands.
Figs. 20A - 20C depict the synthesis error analyses and diversity of all synthesized strands of E0. All synthesized strands11 of E0 were sequenced with Illumina and transitions of non-identical nucleotides were extracted to form strands Fig. 20A depicts the length distribution for the set of all (gray with shading) and perfectly (dashed line) synthesized and sequenced raw strands are shown. Distributions are derived via kernel density estimation. The number of all strands to perfect strands for the template sequence is as follows: E0 {all: 79192, perfect: 3 }. For each raw strand, a sequence of non-identical nucleotides were extracted to form strand0, which is then aligned, by Needleman-Wunsch, to its respective template sequence. Fig. 20B depicts that for each alignment, the number of mismatches, insertions, and missing nucleotides from strand0 are tabulated. Fig. 20C depicts the number of sequencing reads at each length (number of nucleotides of strand is tabulated. Diversity is evaluated as the number of unique variants at each strand C length and the Levenshtein edit
C C
distance is computed between each strand and the EO template sequence. Strands were filtered for read counts of at least 3 to remove aberrantly synthesized or sequenced variants.
Figs. 21A - 21B depict the constraints for valid transitions between nucleotides. As physical processes, both chemical synthesis and enzymatic synthesis have constraints for valid transitions between nucleotides, A transition matrix with no self-transitions (Fig. 21 A) and a transition matrix excluding specific transitions (Fig. 21B) are depicted. Based on whether certain transitions are permitted, there exists a fundamental limit for the maximum number of bits per nucleotide that is possible to store. This limit is equal to
CD where f the transition matrix indicating valid transitions to write. The notation
Figure imgf000021_0001
indicates the maximum eigenvalue of '' .
Figs. 22A - 22B depict the placement and modulation of information into template sequences. Fig. 22A depicts the placement of information within a template sequence for both experimental and simulated storage systems. For experimental systems, template sequences contained 8 or 16 nucleotides each. For simulated systems, template sequences contained 38, 74, or 152 nucleotides each. Each nucleotide in a template sequence either stores 1 trit (blue), 1 bit (red), or is allocated for synchronization (orange). Fig. 22B depicts a modulation scheme to map 16 bits to a sequence of 16 nucleotides. As an intermediate step, 16 bits are converted to a mixture of 8-trits and 4-bits using map Ml (Table 9). Subsequently, given the prior placement of synchronization nucleotides, 8-trits are converted to nucleotides (light blue arrows) using map M2 (Table 9). For the first 5 trits, the placement of nucleotides begins with the first synchronization nucleotide 'C, and occurs from right-to-left order. This initial ad-hoc placement ensures non-identical transitions between nucleotides, compared to a left-to-right order starting from an initiator nucleotide which may conflict when transitioning to a synchronization nucleotide. The remaining 4-bits are converted to nucleotides (red arrows) via map M3 (Table 9) with the exception of the final bit which is converted with an ad-hoc mapping (black arrow). Demodulation from nucleotides to bits reverses the above mentioned steps using maps Ml, M2, and M3 (Table S9). Demodulation assumes the existence of correctly placed synchronization nucleotides.
Figs. 23 A - 23 B depict the Markov model for the production of DNA strands. Fig. 23A depicts that a Markov model provides a statistical framework for the production of DNA strands " created from a desired template sequence. At the k-th state denoted by ¾ , the
Markov model specifies the process for writing the k-t nucleotide in the template sequence. An example is provided for the template sequence (AGCT). The process for writing the third nucleotide at position k = 3 could lead to several strand^ variants, each with a different probability of occurrence. The Markov model contains states which include a deletion error
(missing strand0 nucleotide) with probability and an insertion error with probability pins. The probability of synthesis Ρ^η = 1— Pins— Pdel denotes the probability for the event of either a correct write, or a write error (mismatch or substituted strand C nucleotide).
C
Fig. 23 B depicts that in the event of synthesis of a strand ' nucleotide, either a correct write occurs with probability l ~ P^b , or a write error (mismatch or substituted strandC nucleotide) occurs with total probability A specific substitution error occurs with probability
Figure imgf000022_0001
The function - x,y) mathematically represents the probability for substitutions of different strand0 nucleotides.
Figs. 24A - 24E depict reconstruction of a template sequence by MAP estimation.
A template sequence may be successfully reconstructed from multiple DNA strands ' . Even
C
based on a single received strand MAP estimation achieves the correct localization of errors given a scaffold sequence containing synchronization nucleotides. An example of a template DNA sequence, associated scaffold sequence, and mathematical representation is c
given in (Fig. 24 A). The sequence is synthesized, resulting in one DN A strand containing two missing nucleotides. The goal of MAP estimation is to reconstruct the template DNA
C
sequence given only the received strand and scaffold. An alpha table and beta table are
C
computed in (Fig. 24B) and (Fig. 24C) respectively, given only strand and scaffold. The entries of the alpha and beta tables represent alpha forward probabi lities and beta backward probabilities, and are computed incrementally and efficiently based on dynamic programming recursions. These alpha and beta probabilities are necessary for the MAP estimation of each nucleotide in the template sequence as illustrated in (Fig. 24D) and (Fig. 24E). Specifically, an example of decoding the fourth nucleotide 04of the template sequence is provided in (Fig. 24D). This decoding involves determining the following probabilities: ¥{ATCGCT \ ** CA * A * *), ¥{ATCG€T f ** CT * A **), W{ATCGCT j ** CC * Λ **), and ¥ ATCGCT f ** CG * A **) each representing the fact that either an A, T, C, or G is possible for the fourth nucleotide respectively. The decomposition of the probability ¥(ATCGCT I ** CG * A **) into different cases is given in (E). The result of MAP estimation applied for all nucleotides reveals that a nearly correct reconstruction of the template sequence is possible even with one received DNA strand'", and that errors may be localized to their proper positions within the sequence.
Fig. 25A - 25C depict the coded strand architecture for robust information storage in imperfectly synthesized DNA strands. Fig. 25A depicts that the message "Eureka!" was encoded and partitioned into four template sequences, E1-E4. Each sequence stores a 2-bit address and 14 bits of data and these bits are mapped to a template sequence of 16 nucleotides, which includes four synchronization nucleotides (dark gray). Synthesis performed with initiators tethered to beads and sequencing performed on the Illumina platform. Fig. 25B depicts that retrieving information from E1-E4. Synthesized strands R were sequenced using the Illumina sequencing-by-synthesis (SBS) platform and purified in silico based on raw length of 32-48 nucleotides (Methods), The decoding accuracy for each sequence is defined as the probability of 100% correct data retrieval for a given number of reads, estimated over 500 decoding trials. Each trial is based on a randomly drawn set of purified strand " variants. A 90% decoding accuracy (gray band) is considered sufficient for robust data retrieval, and the accuracy could be further reinforced by other codec modules.
Fig. 25C depicts the decoding of E3. A set of 10 DNA strands " is decoded as two sets of five n
strands The decoder uses MAP estimation and a scaffold to determine the probability for each of the four nucleotides at every position. The decoded sequence is a probabilistic consensus of the reconstructed sequences from MAP estimation and successfully retrieves the data stored in E3.
Fig. 26 depicts the raw lengths for all and perfect strands for E1-E4, All synthesized strands of E1 -E4 were sequenced with Illumina. Length distribution for the set of all (gray with shading) and perfectly (dashed line) synthesized and sequenced raw strands are shown.
Distributions are derived via kernel density estimation. The number of all strands to perfect strands for each template sequence are as follows: El {all: 1 19677, perfect: 21 }; E2 {all: 106983, perfect: 3 }; E3 {all: 106793, perfect: 3 }; E4 {all: 146710, perfect: 19}.
Figs. 27A ~ 27B depict the synthesis error analysis for all strands and purified strands
R
of E1 -E4. All synthesized strands were sequenced with Ulumina and transitions extracted to
C C
form strands . Each of these strands is aligned, by Needleman-Wunsch, to its respective template sequence. For each alignment, the fraction of strands with the indicated number of mismatches, insertions, and missing nucleotides are tabulated. The set of all strands are evaluated in (Fig. 27 A) and the set of purified strands obtained by filtering the length of the corresponding strands R between 32-48 bases, assuming an extension length of 3 to 4 bases per template nucleotide are evaluated in (Fig, 27B).
Figs. 28A - 28B depict the lengths, diversity, and edit distance for all and purified strands for E1-E4. All synthesized strandsR of E1-E4 were sequenced with Ulumina and
C C
transitions extracted to form strands " . Strands ' were filtered for read counts of at least 3 to remove aberrantly synthesized or sequenced variants. The number of sequencing reads at each length (number of strand*" nucleotides) is tabulated. Diversity is evaluated as the number of unique variants at each length and the Levenshtein edit distance is computed according to its respective template sequence. These measurements are presented for all
C C
synthesized strands (Fig. 28A) or a set of purified strands obtained by filtering the length of the corresponding strands R between 32-48 bases, assuming an extension length of 3 to 4 bases per template nucleotide are evaluated in (Fig. 28B).
Fig. 29 depicts the diversity of compressed synthesized strands for EO. Strands^ obtained for template sequence E0. Different strand variants are ranked in the vertical axis in order of the number of reads per variant. The strands are arranged on the horizontal axis in order of increasing length. In comparison to the E0 template sequence, most diverse compressed strands are missing nucleotides, although some strands may have insertions or mismatches (substitutions).
C
Fig. 30 depicts the diversity of compressed synthesized strands for El . Strands obtained for sequence El . Different strand variants are ranked in the vertical axis in order of the number of reads per variant. The strands are arranged on the horizontal axis in order of increasing length. In comparison to the El template sequence, most diverse compressed strands are missing nucleotides, although some strands may have insertions or mismatches (substitutions).
C
Fig. 31 depicts the diversity of compressed synthesized strands for E2. Strands obtained for sequence E2. Different strand variants are ranked in the vertical axis in order of the number of reads per variant. The strands are arranged on the horizontal axis in order of increasing length. In comparison to the E2 template sequence, most diverse compressed strands are missing nucleotides, although some strands may have insertions or mismatches (substitutions).
C
Fig. 32 depicts the diversity of compressed synthesized strands for E3. Strands obtained for sequence E3. Different strand variants are ranked in the vertical axis in order of the number of reads per variant. The strands are arranged on the horizontal axis in order of increasing length. In comparison to the E3 template sequence, most diverse compressed strands are missing nucleotides, although some strands may have insertions or mismatches (substitutions). Fig. 33 depicts the diversity of compressed synthesized strands for E4. Strands obtained for sequence E4, Different strand variants are ranked in the vertical axis in order of the number of reads per variant. The strands are arranged on the horizontal axis in order of increasing length. In comparison to the E4 template sequence, most diverse compressed strands are missing nucleotides, although some strands may have insertions or mismatches (substitutions).
Figs. 34A - 34H depict the decoding curves for E1-E4 template sequences for "Eureka! ". Results for the successful reconstruction of sequences E1-E4 from the in silico size-selected set of DNA strands^ . All decoding curves illustrate the probability of correct decoding of a sequence vs. the number of purified reads of synthesized DNA strands For each datapoint on a curve, the probability of correct decoding is based on 500 decoding trials, each of which involves sampling a set of purified DNA strands according to the target number of total reads. In each decoding trial, the sampled set of DNA strands is filtered further based on the number of reads per strand (between 1 and 5 reads per strand). The 10 strands with the longest length are selected for reconstruction via MAP decoding and consensus. Decoding curves are presented for sequences E1-E4 in (Fig. 34 A), (Fig. 34C), (Fig. 34E), and (Fig. 34G) respectively when applying the different filters based on reads per strand. The best decoding results from the filters are compiled for each datapoint to produce the "Best MAP Decoding" curve in (Fig. 34B), (Fig. 34D), (Fig. 34F), and (Fig. 34H). This curve is compared to the two-step baseline filter, used for HQ1-H12, decoding which outputs the longest DNA strand which also has the highest number of reads amongst other strands of equal length. Taken together, these results show that decoding accuracy improves substantially when applying MAP decoding and consensus with 10 filtered strands compared to baseline decoding with one filtered strand.
Figs. 35 A - 35C depict a roadmap for scaling DNA storage systems. Fig. 35A depicts the efficiency of storage for experimental and simulated systems. Experimental systems (black) include storing 12 bits in 8 -nucleotide template sequences, and 16 bits in 16- nucieotide template sequences. Simulated maximum storage systems (white circles) include gigabyte scale which stores 36-bits in 74 transition sequences, and petabyte scale which stores 57-bits in 152 sequences. The amount of bits stored per sequence is dependent on the amount of error-correction codes (ECC) that are applied. Reducing ECCs increases the efficiency rate of storage. The upper bound theoretical limit represents a maximum efficiency of storage of -1.58 bits per transition between non-identical nucleotides. The lower bound theoretical limit represents the minimum number of bits per template sequence that must be stored for addressing only. Fig. 35B depicts that flexible- write storage is enabled by a codec which harnesses diversely synthesized strands. The decoding pipeline supports robust data retrieval from synthesized strands with a significant percentage of errors. Inset: With ten strand0 variants, each with -30% missing nucleotides, the correct decoded sequence can be reconstructed for both gigabyte- and petabyte-scale maximum storage capacities. Fig. 35C depicts a system architecture for storing information in enzymatiealiy-synthesized DNA. A bitstream is partitioned into rows, each augmented with an address to delineate its order for reassembly. An ECC such as a Bose-Chaudhuri-Hocquenghem (BCH) code can be applied to each row, or an ECC such as a Reed-Solomon (RS) code can be applied across multiple rows, to protect data from errors. Modulation consists of mapping sequences of bits to template sequences, which includes synchronization nucleotides. Enzymatic synthesis then produces multiple diverse strands0 per template sequence. The resulting strands^ are used for sequence reconstruction based on MAP estimation and probabilistic consensus. Subsequently, the reconstructed sequence is demodulated into bits. Error-correction is applied to ensure data retrieval.
Figs. 36A ~ 36F depict the estimated capacity in bits per template sequence with increased synthesis accuracy for simulated DNA storage systems. Tradeoffs between estimated capacity (bits stored per sequence) vs. synthesis accuracy. For template sequences with 38 nucleotides, (Fig. 36A) estimated capacity vs. synthesis accuracy measured in terms of the probability of deletions only (missing nucleotides) or (Fig. 36B) including additional 5% substitution and 2% insertion errors. For template sequences with 74 nucleotides, (Fig. 36C) estimated capacity vs. synthesis accuracy measured in terms of the probability of deletions only (missing nucleotides) or (Fig. 36D) including additional 5% substitution and 2% insertion errors. For sequences with 152 nucleotides, (Fig. 36E) estimated capacity vs. synthesis accuracy measured in terms of the probability of deletions only (missing nucleotides) or (Fig. 36F) including additional 5% substitution and 2% insertion errors. The estimated capacity decreases smoothly as synthesis accuracy decreases. The tradeoffs are non-linear. If more compressed strand variants are utilized for decoding, the estimated capacity increases.
Figs. 37A - 37F depict the waterfall decoding curves for simulated DNA storage systems. Simulation results for successfully decoding and retrieving information from multiple DNA strands synthesized per sequence. Decoding results are visualized as "waterfall curves'", representing the probability of correct retrieval for varying levels of errors tolerated per strand. The boundary of error-tolerance for all displayed systems is between 25-30% per strand*", including missing nucleotides (deletions), mismatches (substitutions), and insertion errors. This error tolerance is obtained for decoding with up to 10 diverse strands*" per sequence. (Fig. 37 A) Decoding 23 bits of information stored in template sequences of 38
C
nucleotides, based on multiple strands " containing only missing nucleotides and (Fig. 37B) with the inclusion of mismatches (substitutions) and insertion errors. (Fig. 37C) Decoding 36 bits of information stored in template sequences of 74 nucleotides, based on multiple
C
strands containing only missing nucleotides and (Fig. 37D) with the inclusion of mismatches (substitutions) and insertion errors. (Fig. 37E) Decoding 57 bits of information
C
stored in template sequences of 152 nucleotides, based on multiple strands containing only missing nucleotides and (Fig. 37F) with the inclusion of mismatches (substitutions) and insertion errors.
Figs. 38A - 38D depict the majority alignment of DNA strands per sequence. Simulation results for decoding sequences using the majority alignment algorithm. Template sequences have (Fig. 38 A) 16, (Fig. 38B) 24, (Fig. 38C) 74, and (Fig. 38D) 152 nucleotides respectively. Each template sequence is randomly created per decoding trial. A total of 1000 decoding trials were simulated per datapoiiit. The production of DNA strands from a template sequence is simulated according to a Markov model with probability of deletion per nucleotide. Sequences are decoded from either 10, 100, or 1000 diverse strands " . Majority alignment achieves an increase in decoding accuracy given more strands C . However, the decoding accuracy reaches a theoretical limit. The error-tolerance saturates at approximately
Figure imgf000030_0001
Figs. 39 A. - 39B depict the system architecture of codec for storing information in DNA. Fig. 39A depicts a high-level block diagram of a DNA storage system. Data is represented as bits of information which are encoded into a set of DNA sequences. De novo synthesis (e.g., enzymatic synthesis) of each sequence results in the creation of diverse DNA strands which can be stored at high volumetric density. For random-access retrieval of data, a subset of the DNA strands may be PCR-amplified and then sequenced (e.g., using Illumina or nanopore sequencing technologies), DNA sequencing results in several reads. All reads are clustered, filtered, processed in-silico, and provided to a decoder for reconstruction. The decoder applies several steps to reconstruct the original DNA sequences, and to decode the original bits of information. Fig. 39B depicts a detailed block diagram of a codec for robust storage of digital information in DN A. The encoder first partitions payload data into rows of bits. Each row is prefixed with an address (turquoise) to delineate its order. To recover missing rows of data, an error-correction code (ECC) may be applied per block of rows, resulting in redundant rows of information (purple). Additionally, an ECC may be applied per row/sequence of data, resulting in redundant bits per row (light green). Each row of bits is modulated into a DNA sequence of nucleotides (blue) containing interspersed synchronization nucleotides (orange). Synthesis of each sequence results in diverse compressed strands which may contain nucleotide errors (red). The decoder fully or partially reconstructs DNA sequences using synchronization alignment and consensus algorithms. After demodulation of DNA sequences to rows/sequences of bits, the decoder may apply error-correction decoding per row/sequence to correct remaining bit eirors (red). The decoder then orders all rows according to their addresses. If any rows are missing, additional error- correction may be applied across rows using a block ECC. The final step of the decoder is to extract the original payload data from the ordered rows of bits. Overall, the encoding and decoding pipelines ensure the robust storage of data in DNA sequences.
Figs. 40A - 40E depict an array-format enzymatic synthesis platform. A prototype for enzymatic synthesis of DNA strands in a 2D array format. Fig. 40 A depicts that the prototype is comprised of two main parts: a Mantis liquid handler, which has a single robotic arm that can be programmed to dispense one of six reagents at a time, and custom jigs, which were either laser cut (Epilog Legend 36EXT) or machined (gift from Formulatrix) to hold the glass slide acting as a solid support substrate for the DNA. In the reagent banks, there are four nucleotide triphosphates and one enzymatic mix. For attaching initiators and detaching synthesized strands from the slide, other reagents can be dispensed by changing the dispense head. Fig. 40B depicts that the enzymatic mix is dispensed according to programmed coordinates on the treated slide, resulting in a 2D grid of features. Fig. 40C depicts that the Mantis places the enzymatic mix, according to programmed coordinates, in serial to all features on the slide. Fig. 40D depicts that for each synthesis cycle, there are four dispense cycles, one for each of the four nucleotide triphosphates used. The specific nucleotide triphosphate is dispensed only to the desired features (bold). Fig. 40E depicts that the Mantis has a single dispenser and places the nucleotide triphosphate, according to programmed coordinates, in serial to the desired features on the slide.
Fig. 41 depicts the raw lengths for all and perfect raw strands for S01 -S03.
Length distribution for the set of all (gray with shading) and perfectly (dashed line) synthesized and sequenced raw strands. Distributions are derived via kernel density estimation. As perfect reads are longer, on average, size selection will increase the yield of perfectly synthesized strands. The number of all strands and perfect strands for each template sequence are as follows: SOI repl (all: 192989, perfect: 1 }, SO I rep 2 {all: 220921, perfect: 684}, SOI rep 3 {all: 153002, perfect: 286}, S02 rep 1 (all: 277897, perfect: 3545}, S02 rep 2 (all: 385615, perfect: 4889}, S02 rep 3 {all: 176680, perfect: 248}, S03 rep 3 {all : 185327, perfect: 464}, S03 rep 2 {all : 169000, perfect: 273 }, S03 rep 3 {all: 209018, perfect 898} , The S01 rep 1 distribution for perfect strands is not visible due to the low number of perfect strands.
Figs. 42 A - 42B depict the synthesis error analysis for ail and purified strands for S01- R
S03. All synthesized strands were sequenced with Illumina and transitions extracted to form strands0. Each of these strands is aligned, by Needleman-Wunsch, to its respective template sequence. For each alignment, the fraction of strands^ with the indicated number of mismatches, insertions, and missing nucleotides are tabulated. The set of all strands are evaluated in (Fig. 42 A) and the set of purified strands obtained by filtering the length of the corresponding strandsR between 39-52 bases, assuming an extension length of 3 to 4 bases per template nucleotide are evaluated in (Fig. 42B).
Figs. 43A - 43B depict the lengths, diversity, and edit distance for all and purified strands for S01-S03. All synthesized strands* " of S01-S03 were sequenced with Illumina and transitions extracted. Run-length compressed strands (strands C ) were filtered for read counts of at least 3 to remove aberrantly synthesized or sequenced variants. The number of sequencing reads at each length (number of strand " nucleotides) is tabulated. Diversity is evaluated as the number of unique variants at each length and the Levenshtein edit distance is computed according to its respective template sequence. These measurements are presented for all synthesized strands " (Fig. 43 A) or a set of purified strands1' obtained by filtering the length of the corresponding strandsR between 39-52 bases, assuming an extension length of 3 to 4 bases per template nucleotide are evaluated in (Fig. 43B).
Figs. 44A - 44B depict the reagent cost projections for phosphoramidite chemistry and enzymatic synthesis. Fig, 44A depicts the reagent cost per cycle projections for phosphoramidite chemistr ' (gray line) and enzymatic synthesis (black lines) based on estimated number of features (n= 1 ,000,000) and density of 71,000 features per square era
(D = 71,000) of the Agilent SurePrint G3 system. Commercial pricing of enzymes $:fj^=:: 61.3 (solid black line) and bulk pricing with enzyme recycling $^ϊβ= 4.38 (dashed black line) are indicated. The minimum feature size is 2.37 nm, which corresponds to the diameter of double-stranded DNA, and the maximum feature size is 37.5 microns, which corresponds to having no gap between features for the given density. Current feature sizes are estimated to be 15-38 microns based on dispense volumes between 1 -10 picoliters. Fig. 44B depicts the reagent cost per megabyte projections for phosphoramidite chemistry (gray line) and enzymatic synthesis (black lines) based on maximally packing features into a given 14cm2 surface area (A = 14). Commercial pricing of enzymes 61.3 (solid black line) and bulk pricing with enzyme recycling $Λίβ= 4.38 (dashed black line) are indicated. The minimum feature size is 2.37 nm, which corresponds to the diameter of double-stranded DNA. The price per megabyte for 1 million features with current feature sizes of 15 (gray circle) or 38 microns (gray diamond) are indicated.
DETAILS D DE SCRIPTION
Embodiments of the present disclosure are directed to methods of decoding a nucleotide sequence. The nucleotide sequence contains encoded one, or more, or a series of values corresponding to a format of information. Each value or value point within the nucleotide sequence is represented as a transition or boundary or edge between different or nonidentical nucleotides of the nucleotide sequence. The steps of decoding include determining the nucleotide sequence, identifying a transition or boundary or edge between different or nonidentical nucleotides of the nucleotide sequence, and assigning a predetermined value to the identified transition or boundary or edge to create the value that was originally encoded in the nucleotide sequence corresponding to the format of information. The step of determining the nucleotide sequence includes sequencing according to methods known to a skilled in the art. in one embodiment, sequencing includes nanopore sequencing. When multiple or series of values are encoded in the nucleotide sequence, the values are represented by a plurality of transitions or boundaries or edges between different or nonidentical nucleotides of the nucleotide sequence, which can be identified. Each identified transition or boundary or edge is assigned a predetermined value to create the series of values encoded in the nucleotide sequence corresponding to the format of information. The value corresponding to the format of information can be obtained from many input sources, including but are not limited to analog, digital, optical, visible or non-visible wavelengths, chemical, or physical input sources. The disclosure contemplates digital values. Digital values can include multiple digits according to a specific need. For example, the digital values include two digits, three digits, four digits, five digits, six digits, seven digits, eight digits, nine digits, ten digits, eleven digits, twelve digits, thirteen digits, fourteen digits, fifteen digits, sixteen digits, or more digits to accommodate a certain need or application. In some embodiments, the digital values are bits (binary digits), trits (ternary digits), octet (eight digits), hexadecimal (sixteen digits), or more digits. In certain embodiments, the series of digital values comprises two, three or more different digital values. Each of the digital value of the series of digital values represents two, three or more different digital values. Each of the digital value of the series of digital values represents a digital value of the two, three or more different digital values. The disclosure contemplates natural nucleotides or nonnatural nucleotides for information encoding, storage and decoding. The nucleotides can be R A or DNA. For example, the nucleotides can include adenine, cytosine, guanine, thymine and uridine. Any format of information can be converted into corresponding values and encoded in the nucleotide sequence. For example, a format of information includes but is not limited to text, image, video or audio format, sensor data, and combinations thereof.
The present disclosure contemplates the use of nucleotide transitions for information encoding and decoding. The transition can be from a certain nucleotide to another different or nonidentical nucleotide. The transition can also be from a certain nucleotide or nucleotide homopolymer to another different or nonidentical nucleotide or nucleotide homopolymer. In certain embodiment, the transition between one nucleotide homopolymer to a different or nonidentical nucleotide homopolymer is a single transition or boundary or edge. In one embodiment, the each nucleotide transition or boundary or edge is assigned a predetermined digital value. In one embodiment, the series of digital values includes a corresponding barcode. The disclosed method further contemplates decoding a plurality of nucleotide sequences. Each member of the plurality encodes for an identical value or series of identical values corresponding to the format of information. The nucleotide sequence or a plurality of nucleotide sequences can be attached to a substrate or solid support.
Embodiments of the present disclosure are directed to a method of decoding a nucleotide sequence encoding for a series of digital values corresponding to a format of information. The nucleotide sequence can be determined by sequencing methods known to a skilled in the art to identify nucleotide homopolymers. Each homopolymer is assigned one or more of the nucleotides based on a predetermined predicted homopolymer length of the nucleotide produced using enzymatic synthesis, and a particular digital value is assigned for each of the one or more nucleotides. The predicted homopolymer length can be determined from empirical observation. The predicted homopolymer length is a median, a mean, or a mode based on data collected from empirical observation.
Embodiments of the present disclosure are directed a method of sequencing and decoding a plurality of nucleotide sequences representing a format of information wherein each nucleotide sequence encodes a portion of the format of information and wherein each portion of the format of information has more than two corresponding nucleotide sequences. The nucleotide sequences are determined and series of digital values for the sequences within a first portion of the plurality of nucleotide sequences are decoded and translated into the portion of the format of information. The sequencing and decoding are continued in series for additional portions into series of digital values and the series of digital values are translated into the portions of the format of information until the entire format of information is achieved.
Embodiments of the present disclosure further provides a method of encoding a series of digital values corresponding to a format of information into a nucleotide sequence. In one embodiment, the method includes for each digital value, assigning a corresponding nucleotide to different or nonidentical nucleotide transition to generate the nucleotide sequence, synthesizing the nucleotide sequence, and optionally storing the nucleotide sequence. In some embodiments, the digital values are two digits, three digits, four digits, five digits, six digits, seven digits, eight digits, nine digits, ten digits, eleven digits, twelve digits, thirteen digits, fourteen digits, fifteen digits, sixteen digits, or more digits. In other embodiments, the digital values are bits (binary digits), trits (ternary digits), octet (eight digits), hexadecimal (sixteen digits), or more digits.
Embodiments of the present disclosure also provides a method for high-throughput decoding of a format of information encoded in a plurality of nucleotide sequences or a plurality of DNA strands. In one embodiment, the plurality of nucleotide sequences or DNA strands are separated (packetized) into many packets. In one embodiment, each packet includes a plurality of DNA strands. In one embodiment, each packet includes a plurality of identical DNA strands. In certain embodiments, each of the nucleotide sequence or DNA stand can include a unique identifier (such as a barcode sequence) corresponding to the specific packet of information. In an exemplary embodiment, at least one of the plurality of nucleotide sequences is sequenced using a selective sequencer. In one embodiment, each packet includes a plurality of identical nucleotide sequences (each as an independent DNA strand), thus, sequencing one strand in that packet is sufficient since the remaining strands are considered redundant. In another embodiment, each packet includes a plurality of near perfect identical nucleotide sequences (each as an independent DNA strand), due to encoding errors. In this case, an algorithm is designed to sample a predetermined number of nucleotide sequences with redundant identifiers, which leads to decoding of the format of information. The algorithm will dictate for each packet, sequencing and decoding more than one strand with a specific identifier until a certain confidence of correctness is reached, without requiring sequencing of all the strands with the same/redundant identifier. The sequence with its unique identifier is stored. In this manner, redundant sequencing of the same nucleotide sequence is prevented using the selective sequencer. The selective sequencer is a sequencing platform that can prevent or halt redundant sequencing of the nucleotide sequences based on the unique identifier that is associated with the nucleotide sequence. In one embodiment, the selective sequencer is a nanopore sequencer that includes the selective functionality. Embodiments of the disclosure relate to optimizing packet information management to improve data accuracy and increase the content loading speed, which can drive faster internet connections for many types of utilities including cellphones. In one embodiment, the information stored in DNA is packetized (separated) into units of DNA strands. In another embodiment, each packet can contain multiple copies of representative DNA strands. In decoding or retrieving the stored information, it would be more efficient to sequence one or a few representative DNA strands for each packet. The initial results and simulations shown in Fig, 9D indicated that sequencing time and cost can be reduced by at least 2 fold, which would be a dramatic benefit when scaled to very large datasets. Embodiments of the disclosure are directed to the use of the selective sequencer to optimize packet information management. According to one embodiment, the selective sequencer has a first feature which can generate DNA sequences on the fly. This is an improvement over the current state of the art sequencer (Illumina being an exemplary case), which must fully sequence the DN A strand that was deposited on the sequencer before the sequence data can be used for further decoding, retrieval or recovery. In contrast, the Oxford Nanopore sequencer allows each DNA strand to be sequenced and decoded independently. This asynchronous sequencing allows processing and decoding each packet on the fly. According to another embodiment, the selective sequencer has a second feature such that after a packet is sequenced and decoded, the sequencer moves on to sequence only the strands of the remaining unsequenced packets. In one embodiment, the sequencer is able to physically prevent further redundant sequencing of copies of DNA strands of the decoded packets. In one embodiment, a unique identifier such as a barcode, or header index is included in the DNA strands which signals the sequencer whether the strand has been decoded so that the sequencer can make a decision of whether to block continued sequencing.
Oxford Nanopore' s nanopore sequencing platform has the first feature, and there has been a proof-of-concept demonstration for the second feature for sequencing genomes (DNA strands of biological origin, not of synthetic origin). This platform performs the second feature by physically kicking the DNA strand out of the pore after reading just a fraction of the DNA strand. Currently, nanopore sequencing is artificially slowed down to obtain high accuracy reads because it is highly error-prone. Embodiments of the disclosure are thus directed to interspersing the unique identifier throughout the DNA strand to improve accuracy of sequencing using nanopore sequencing. Theoretically, the sequencing rate of nanopore sequencing can increase more than 20 fold, and at this rate, the error-rate will likely be even higher.
The sequence information can be stored in a suitable medium including computer memory. The stored sequence information can be further decoded into digital values. Any unique identifier can be used including a synthetic sequence or barcode sequence. The synthetic sequence or barcode sequence is located at the 3' end, the 5' end of the nucleotide sequence, or is interspersed within the nucleotide sequence. A plurality of nucleotide sequences can be labeled with a plurality of unique identifiers. The method can further include sequencing a predetermined number of nucleotide sequences; assembling the packet of information; and analyzing the assembled information to determine if the information is correctly decoded. In one embodiment, the method further includes permitting sequencing of any nucleotide sequences that were not correctly decoded. The assembled information can be analyzed using a decoding algorithm.
The present disclosure contemplates encoding a format of information into nucleotide sequences. According to certain embodiments, a format of information is first converted to a binary sequence, such as zeros "0s" and ones " Is", and then to a ternary sequence, such as zeros "0s", ones "Is", and twos "2s", although any number can be used. Each digit of the ternary sequence corresponds to a transition of different or non -identical nucleotides according to a conversion scheme. In this manner, the ternary bit sequence is further converted to a corresponding oligonucleotide sequence. Figs. 8B-8C and Fig. 9A provide an exemplary embodiment of such a conversion scheme. The oligonucleotide sequence is synthesized and containing the encoded format of information. Synthesis can be carried out according to methods known to a skilled in the art. Embodiments of the disclosure are direct to enzymatic synthesis of oligonucleotides. In one embodiment, a template-independent D'NA polymerase, such as a terminal deoxynucleotidyiy transferase (TdT) is used. According to one embodiment, an initiator oligonucleotide (a primer/an initiator) immobilized to a solid support is sequentially contacted by a reaction mixture that comprises an amount of terminal deoxynucleotide transferase (TdT), an amount of apyrase, one or more selected nucleotide triphosphates, and divalent cations. The TdT adds one or more of the selected nucleotide triphosphates to the 3' terminal nucleotide of the initiator oligonucleotide, and the apyrase degrades excessive nucleotide triphosphates to inactive diphosphates and monophosphates. In addition to apyrase, any enzymatic, chemical or physical methods or reagents can be used to control the length of the nucleotide extension/polymerization. In each contact, one or more desired/selected nucleotides is added to the extending oligonucleotide chain until corresponding oligonucleotide sequence is formed. In some embodiments, the nucleotide triphosphate includes dATP, dTTP, dCTP, dGTP, and dUTP. When apyrase is used, the synthesis activity is modulated by the ratio of the amount of TdT to the amount of apyrase. Further, divalent cations comprising magnesium and cobalt, and additives comprising glycerol, sucrose, PEG8000, betaine, DMSA, Triton-Xl OO and Tween20 can also modulate the enzymatic reaction. Since each bit represents a transition between different or non- identical nucleotides, the information can be accurately encoded into oligonucleotide sequences independent of the lengt of each nucleotide extension/polymerization. The disclosure provides that during each round of nucleotide extension/polymerization, one type of selected nucleotide triphosphate is added. In one embodiment, the excessive nucleotide triphosphate is inactivated by apyrase. This inactivation allows for multiple rounds of nucleotide polymerization that each adds a different nucleotide to the initiator or growing polynucleotide chain.
Embodiments of the present disclosure are directed to a method of decoding a format of information from a synthesized oligonucleotide sequence encoding bit sequences of the formation of information. In one embodiment, the synthesized oligonucleotide sequence containing the encoded information can be amplified. The amplified oligonucleotide sequence is sequenced and the sequence can be converted to bit sequences according to the encoding scheme wherein each bit represents a transition between different or non-identical nucleotides. The bit sequences can be converted back to the format of information. In certain embodiments, the oligonucleotide sequence is ligated to a universal adaptor before amplification.
Embodiments of the present disclosure are directed to a method of storing information using nucleotides. In some embodiments, a format of information is first converted into a sequence of binary ASCII bits, then converted into a ternary sequence, which is further converted into a corresponding oligonucleotide sequence such that one bit of the ternary sequence represents a transition between different or non-identical nucleotides. In one embodiment, the corresponding oligonucleotide sequence is synthesized by the following steps: (a) providing a reaction mixture to an initiator oligonucleotide immobilized to a solid support wherein the reaction mixture comprises an amount of terminal deoxynucleotide transferase (TdT), an amount of apyrase, one or more selected nucleotide triphosphates, and divalent cations, wherein the TdT adds one or more of the selected nucleotide triphosphates to the 3' terminal nucleotide of the initiator oligonucleotide, and wherein the apyrase degrades excessive nucleotide triphosphates to inactive diphosphates and monophosphates, and (b) repeating step (a) until the corresponding oligonucleotide sequence is formed, and storing the synthesized corresponding oligonucleotide sequence.
In certain embodiments, the initiator oligonucleotides are immobilized on beads and pre-mixed with reagents that include TdT, apyrase and reaction buffer. The initiator oligonucleotides can also be immobilized on the surface of a solid support such as beads or on the surface of a fluidic channel. Certain embodiment of the disclosure is directed to an initiator that is attached by a cleavable moiety. This mixture is sequentially contacted with one type of the desired nucleotide triphosphates (dNTPs). The ratio of the amount of TdT to the amount of apyrase in the reaction reagents modulates the enzymatic synthesis. In some embodiments, the desired or selected nucleotide is a natural nucleotide or any nucleotide analog known to a skilled in the art.
The present disclosure provides that a different condition can be used for each of the different dNTP types. This is important as the kinetics of the enzyme may be different for different dNTPs. Thus, to obtain optimal results, different conditions, such as type and concentration of divalent ions may need to be used for different dNTPs, The reaction reagent can include a buffer comprising a monovalent salt, a divalent salt, a buffering agent, and a reducing agent at a suitable pH and temperature. The selected concentration of reaction reagents is determined by the selected nucleotide triphosphate present in the reaction reagent. In some embodiments, a washing step is included between each round of enzymatic synthesis.
The present disclosure provides methods of enzymatic oligonucleotide synthesis which enable rapid and high-accuracy synthesis of custom DNA sequences by the template- independent DNA-polymerase terminal deoxynucleotidyl transferase (TdT). The methods according to the present disclosure can be used for synthesis of cheaper, more accurate and longer custom DNA sequences for various biochemical, biomedical, or biosynthetic applications. Furthermore, given the potential for high-speed DNA synthesis, the methods according to the present disclosure can facilitate the use of DNA as an information storage medium. In this case, a solid-phase synthesis device can be used to record digital information in DNA molecules.
The method according to the disclosure further comprises releasing the polynucleotide after the desired sequence of nucleotides has been added to the 3' end of the polynucleotide. The method according to the disclosure further comprises releasing the polynucleotide using an enzyme, a chemical, light, heat or other suitable method or reagent. The method according to the disclosure further comprises releasing the polynucleotide, collecting the polynucleotide, amplifying the polynucleotide and sequencing the polynucleotide.
The present disclosure contemplates the use of nucleotide triphosphate inactivating enzyme. In one embodiment, the nucleotide triphosphate inactivating enzyme is an apyrase. In one embodiment, the nucleotide triphosphate inactivating enzyme is a nucleotide triphosphate degrading enzyme that degrades nucleotide triphosphates at a rate slower than rate of addition of nucleotides by the error prone or template independent DNA polymerase. In certain embodiment, the nucleotide triphosphate inactivating enzyme is a nucleotide triphosphate degrading enzyme present at a concentration that degrades nucleotide triphosphates at a rate slower than rate of addition of nucleotides by the present concentration of the error prone or template independent DNA polymerase. In some embodiments, the nucleotide triphosphate inactivating enzyme comprises ATP diphosphohydrolase, dNTP pyrophosphatases, dNTPases, and phosphatases.
In one embodiment, the concentration of nucleotide triphosphate inactivating enzyme is modulated to control addition of one or more nucleotides. In one embodiment, the nucleotide triphosphate inactivating enzyme renders free nucleotide triphosphates inactive. In one embodiment, the nucleotide inactivating enzyme renders free nucleotide triphosphates inactive by degradation. In another embodiment, the nucleotide inactivating enzyme renders free nucleotide triphosphates inactive by polymerizing them with each other. In certain embodiment, the reaction conditions present a competing reaction between addition of free nucleotide triphosphates to the initiator sequence and degradation of free nucleotide triphosphates.
Polymerases, including without limitation error-prone or template-dependent polymerases, modified or otherwise, can be used to create nucleotide polymers having a random or known or desired sequence of nucleotides. Template-independent polymerases, whether modified or otherwise, can be used to create the nucleic acids de novo. Ordinary nucleotides are used, such as A, T/U, C or G. Nucleotides may be used which lack chain terminating moieties. A template independent polymerase may be used to make the nucleic acid sequence. Such template independent polymerase may be error-prone which may lead to the addition of more than one nucleotide resulting in a homopolymer.
According to some embodiments, oligonucleotide sequences or polynucleotide sequences are synthesized using an error prone polymerase, such as template independent error prone polymerase, and common or natural nucleic acids, which may be unmodified. Initiator sequences or primers are attached to a substrate, such as a silicon dioxide substrate, at various locations whether known, such as in an addressable array, or random. Reagents including at least a selected nucleotide, a template independent polymerase and other reagents required for enzymatic activity of the polymerase are applied at one or more locations of the substrate where the initiator sequences are located and under conditions where the polymerase adds one or more than one or a plurality of the nucleotide to the initiator sequence to extend the initiator sequence. The nucleotides ("dNTPs") may be applied or flow in periodic applications. Nucleotides with blocking groups or reversible terminators can be used with the dNTPs under reaction conditions that are sufficient to limit or reduce the probability of enzymatic addition of the dNTP to one dNTP, i.e. one dNTP is added using the selected reaction conditions taking into consideration the reaction kinetics.
The present disclosure provides methods of enzymatic nucleotide synthesis that can be used with a flow cell or other channel. For example, a microfluidic channel or microfluidic channels having an input and an output can be used to deliver reaction fluids including reagents, such as a polymerase, a nucleotide and other appropriate reagents and washes to particular locations on a substrate within the flow cell, such as within a microfluidic channel. One of skill will recognize that reaction conditions will be based on dimensions of the substrate reaction region, reagents, concentrations, reaction temperature, and the structures used to create and deliver the reagents and washes. According to certain aspects, pH and other reactants and reaction conditions can be optimized for the use of TdT to add a dNTP to an existing nucleotide or oligonucleotide in a template independent manner. For example, Ashley et al., Virology 77, 367-375 (1977) hereby incorporated by reference in its entirety identifies certain reagents and reaction conditions for dNTP addition, such as initiator size, divalent cation and pH. TdT was reported to be active over a wide pH range with an optimal pH of 6.85. Methods of providing or delivering dNTP, rNTP or rNDP are useful in making nucleic acids. Release of a lipase or other membrane-lytic enzyme from pH-sensitive viral particles inside dNTP filled-liposomes is described in J Clin Microbiol, May 1988; 26(5): 804-807. Photo-caged rNTPs or dNTPs from which NTPs can be released, typically nitrobenzyl derivatives sensitive to 350nm light, are commercially available from Life Technologies. Rhoposin or bacterio-opsin triggered signal transduction resulting in vesicular or other secretion of nucleotides is known in the art. With these methods for delivering dNTPs, the nucleotides should be removed or sequestered between the first primer-poly merase encountered and any downstream.
Terms and symbols of nucleic acid chemistry, biochemistry, genetics, and molecular biology used herein follow those of standard treatises and texts in the field, e.g., Komberg and Baker, D A Replication, Second Edition (W.H. Freeman, New York, 1992), Lehninger, Biochemistry, Second Edition (Worth Publishers, New York, 1975); Strachan and Read, Human Molecular Genetics, Second Edition (Wiley-Liss, New York, 1999); Eckstein, editor, Oligonucleotides and Analogs: A Practical Approach (Oxford University Press, New York, 1991); Gait, editor, Oligonucleotide Synthesis: A Practical Approach (IRL Press, Oxford, 1984); and the like.
Nucleic Acids ¾s¾d Nucleotides
As used herein, the terms "nucleic acid molecule," "nucleic acid sequence," "nucleic acid fragment" and "oligomer" are used interchangeably and are intended to include, but are not limited to, a polymeric form of nucleotides that may have various lengths, including either deoxyribonucleotides or ribonucleotides, or analogs thereof. The term "nucleotide" refers to a nucleoside having one or more phosphate groups joined in ester linkages to the sugar moiety. Exemplary nucleotides include nucleoside monophosphates, diphosphates and triphosphates.
In general, the terms "nucleic acid molecule," "nucleic acid sequence," "nucleic acid fragment," "oligonucleotide" and "polynucleotide" are used interchangeably and are intended to include, but not limited to, a polymeric form of nucleotides that may have various lengths, either deoxyribonucleotides (DNA) or ribonucleotides (RNA), or analogs thereof. A oligonucleotide is typically composed of a specific sequence of four nucleotide bases: adenine (A); cytosine (C); guanine (G); and thymine (T) (uracil (U) for thymine (T) when the polynucleotide is RNA). According to certain aspects, deoxynucleotides (dNTPs, such as dATP, dCTP, dGTP, dTTP) may be used. According to certain aspects, ribonucleotide triphosphates (rNTPs) may be used. According to certain aspects, ribonucleotide diphosphates (rNDPs) may be used.
The term "oligonucleotide sequence" is the alphabetical representation of a polynucleotide molecule; alternatively, the term may be applied to the polynucleotide molecule itself. This alphabetical representation can be input into databases in a computer having a central processing unit and used for bioinformatics applications such as functional genomics and homology searching. Oligonucleotides may optionally include one or more non-standard nucieotide(s), nucleotide analog(s) and/or modified nucleotides. The present disclosure contemplates any deoxyribonucleotide or ribonucleotide and chemical variants thereof, such as methylated, hydroxymethylated or glycosylated forms of the bases, and the like. According to certain aspects, natural nucleotides are used in the methods of making the nucleic acids. Natural nucleotides lack chain terminating moieties. According to certain aspects, nucleotides with blocking groups or reversible terminators can be used in certain embodiments. Nucleotides with blocking groups or reversible terminators are known to those of skill in the art.
The terms "nucleotide analog," "altered nucleotide" and "modified nucleotide" refer to a non-standard nucleotide, including non-naturally occurring ribonucleotides or deoxyribonucleotides. In certain exemplary embodiments, nucleotide analogs are modified at any position so as to alter certain chemical properties of the nucleotide yet retain the ability of the nucleotide analog to perform its intended function. Examples of positions of the nucleotide which may he derivitized include the 5 position, e.g., 5-(2-amino)propyl uridine, 5-bromo uridine, 5-propyne uridine, 5-propenyl uridine, etc.; the 6 position, e.g., 6-(2-amino) propyl uridine, the 8-position for adenosine and/or guanosines, e.g., 8-bromo guanosine, 8- chloro guanosine, 8-fluoroguanosine, etc. Nucleotide analogs also include deaza nucleotides, e.g., 7-deaza-adenosine; O- and N-modified (e.g., alkylated, e.g., N6-methyl adenosine, or as otherwise known in the art) nucleotides; and other heterocyclicaliy modified nucleotide analogs such as those described in Herdewijn, Antisense Nucleic Acid Drug Dev., 2000 Aug. 10(4):297-310.
Nucleotide analogs may also comprise modifications to the sugar portion of the nucieotides. For example the 2' OH-group may be replaced by a group selected from H, OR, R, F, CI, Br, I, Sit SR, NII2, M IR. NR2, COOR, or OR, wherein R is substituted or unsubstituted O-Ce alkyl, alkenyl, alkynyl, aryl, etc. Other possible modifications include those described in U.S. Pat. Nos. 5,858,988, and 6,291,438.
The phosphate group of the nucleotide may also be modified, e.g., by substituting one or more of the oxygens of the phosphate group with sulfur (e.g., phosphorothioates), or by making other substitutions which allow the nucieotide to perform its intended function such as described in, for example, Eckstein, Antisense Nucleic Acid Drug Dev. 2000 Apr. 10(2): 1 17-21, Rusckowski et al. Antisense Nucleic Acid Drug Dev. 2000 Oct. 10(5):333~45, Stein, Antisense Nucleic Acid Drag Dev. 2001 Oct. 1 1(5): 317-25, Vorobj ev et al . Antisense Nucleic Acid Drug Dev. 2001 Apr. l l(2):77-85, and U.S. Pat. No. 5,684, 143. Certain of the above-referenced modifications (e.g., phosphate group modifications) decrease the rate of hydrolysis of, for example, polynucleotides comprising said analogs in vivo or in vitro.
Examples of modified nucleotides include, but are not limited to diaminopurine, S2T, 5-fluorouracil, 5-bromouracil, 5-chlorouracil, 5-iodouracil, hypoxanthine, xantine, 4- acetyl cytosine, 5-(carboxyhydroxylmethyl)uracil, 5-carboxymethylaminomethyl-2- thiouridine, 5-carboxymethylaminomethyluracil, dihydrouracil, beta-D-galactosylqueosine, inosine, N6-isopentenyladenine, 1 -methylguanine, 1-methylinosine, 2,2-dimethylguanine, 2- methyl adenine, 2-methylguanine, 3-methylcytosine, 5 -methyl cytosine, N6-adenine, 7- methyi guanine, 5-methylaminomethyluracil, 5-methoxyaminomethyl-2-thiouracil, beta-D- mannosyiqueosine, 5'-methoxycarboxymethyluracil, 5-methoxyuracil, 2-methylthio-D46- isopentenyladenine, uracil-5-oxy acetic acid (v), wybutoxosine, pseudouracil, queosine, 2- thiocytosine, 5-methyl-2-thiouracil, 2-thiouracil, 4-thiouracil, 5-methyluracil, uracil -5- oxyacetic acid methylester, uracil-5-oxyacetic acid (v), 5-methyl-2-thiouracil, 3-(3-amino-3- N-2-carboxypropyl) uracil, (acp3)w, 2,6-diaminopurine and the like. Nucleic acid molecules may also be modified at the base moiety (e.g., at one or more atoms that typically are available to form a hydrogen bond with a complementary nucleotide and/or at one or more atoms that are not typically capable of forming a hydrogen bond with a complementary nucleotide), sugar moiety or phosphate backbone. Nucleic acid molecules may also contain amine-modified groups, such as aminoallyl-dUTP (aa-dUTP) and aminohexhylacrylamide- dCTP (aha-dCTP) to allow covalent attachment of amine reactive moieties, such as N- hydroxy succinimide esters (NHS).
A nucleic acid used in the invention can also include native or non-native bases. In this regard a native deoxyribonucleic acid can have one or more bases selected from the group consisting of adenine, thymine, cytosine or guanine and a ribonucleic acid can have one or more bases seiected from the group consisting of uracil, adenine, cytosine or guanine.
Exemplar}' non-native bases that can be included in a nucleic acid, whether having a native backbone or analog structure, include, without limitation, inosine, xathanine, hypoxathanine, isocytosine, isoguanine, 5 -methyl cytosine, 5-hydroxymethyl cytosine, 2- aminoadenine, 6-methyl adenine, 6-methyl guanine, 2 -propyl guanine, 2-propyl adenine, 2- thioLiracil, 2-thiothymine, 2- thiocytosine, 15 -halouracil, 15 -halocytosine, 5-propynyl uracil, 5-propynyl cytosine, 6-azo uracil, 6-azo cytosine, 6-azo thymine, 5-uracil, 4- thiouracil, 8-halo adenine or guanine, 8- amino adenine or guanine, 8-thiol adenine or guanine, 8-thioaikyi adenine or guanine, 8- hydroxy! adenine or guanine, 5-halo substituted uracil or cytosine, 7-methylguanine, 7- methyiadenine, 8-azaguanine, 8-azaadenine, 7- deazaguanine, 7-deazaadenine, 3-deazaguanine, 3-deazaadenine or the like.
In certain embodiments unique barcode sequences may be attached to each nucleic acid, i.e. DNA or RNA strands. Then adapters and or primers or other reagents known to those of skill in the art may be used as desired to sequence or amplify the nucleic acid with the unique barcode sequence.
Polymerases
According to an alternate embodiment of the present invention, polymerases are used to build nucleic acid molecules, such as for representing information which is referred to herein as being recorded in the nucleic acid sequence or the nucleic acid is referred to herein as being storage media. Polymerases are enzymes that produce a nucleic acid sequence, for example, using DNA or RNA as a template. Polymerases that produce RNA polymers are known as RNA polymerases, while polymerases that produce DNA polymers are known as DNA polymerases. Polymerases that incorporate errors are known in the art and are referred to herein as an "error-prone polymerases". Template independent polymerases may be error prone polymerases. Using an error-prone polymerase allows the incorporation of specific bases at precise locations of the DNA molecule. Error-prone polymerases will either accept a non-standard base, such as a reversible chain terminating base, or will incorporate a different nucleotide, such as a natural or unmodified nucleotide that is selectively given to it as it tries to copy a template. Template-independent polymerases such as terminal deoxynucleotidyl transferase (TdT), also known as DNA nucleotidylexotransferase (DNTT) or terminal transferase create nucleic acid strands by catalyzing the addition of nucleotides to the 3' terminus of a DNA molecule without a template. The preferred substrate of TdT is a 3'- overhang, but it can also add nucleotides to blunt or recessed 3' ends. Cobalt is a cofactor, however the enzyme catalyzes reaction upon Mg and Mn administration in vitro. Nucleic acid initiators may be 4 or 5 nucleotides or longer and may be single stranded or double stranded. Double stranded initiators may have a 3' overhang or they may be blunt ended or they may have a 3' recessed end.
TdT, like all DNA polymerases, also requires divalent metal ions for catalysis. However, TdT is unique in its ability to use a variety of divalent cations such as Co2+, Mn2+, Zn2+ and Mg2+. In general, the extension rate of the primer p(dA)n (where n is the chain length from 4 through 50) with dATP in the presence of divalent metal ions is ranked in the following order: Mg2+ > Zn2+ > Co2+ > Mn2+. In addition, each metal ion has different effects on the kinetics of nucleotide incorporation. For example, Mg2+ facilitates the preferential utilization of dGTP and dATP whereas Co2+ increases the catalytic polymerization efficiency of the pyrimidines, dCTP and dTTP. Zn2+ behaves as a unique positive effector for TdT since reaction rates with Mg2+ are stimulated by the addition of micromolar quantities of Zn2+. This enhancement may reflect the ability of Zn2+ to induce conformational changes in TdT that yields higher catalytic efficiencies. Polymerization rates are lower in the presence of Mn2+ compared to Mg2+, suggesting that Mn2+ does not support the reaction as efficiently as Mg2+. Further description of TdT is provided in Biochim Biophys Acta., May 2010; 1804(5): 1151-1 166 hereby incorporated by reference in its entirety. In addition, one may replace Mg2+, Zn2+, Co2+, or Mn2+ in the nucleotide pulse with other cations designed modulate nucleotide attachment. For example, if the nucleotide pulse replaces Mg++ with other cation(s), such as Na+, K+, Rb+, Be++, Ca++, or Sr++, then the nucleotide can bind but not incorporate, thereby regulating whether the nucleotide will incorporate or not. Then a pulse of (optional) pre-wash without nucleotide or Mg++ can be provided or then Mg++ buffer without nucleotide can be provided.
By controlling the primer/initiator, the nucleotide substrate, or the polymerase or apyrase, the incorporation of specific nucleic acids into the polymer can be regulated. Thus, these polymerases are capable of incorporating nucleotides independent of the template sequence and are therefore beneficial for creating nucleic acid sequences de novo. The combination of an error-prone polymerase and a primer sequence serves as a writing mechanism for imparting information into a nucleic acid sequence.
By controlling the primer/initiator, the nucleotide substrate, or the template independent polymerase, the addition of a nucleotide to an initiator sequence or an existing nucleotide or oligonucleotide can be regulated to produce an oligonucleotide by extension. Thus, these polymerases are capable of incorporating nucleotides without a template sequence and are therefore beneficial for creating nucleic acid sequences de novo.
Sequencing
According to certain aspects, polymers such as nucleotide sequences, including DNA strands identified herein may be sequenced by passing the strand through nanopores or nanogaps or nanochannels to determine the individual nucleic acid/nucleotide.
"Nanopore" means a hole or passage having a nanometer scale width. Exemplary nanopores include a hole or passage through a membrane formed by a multimeric protein ring. Typically, the passage is 0.2-25 nm wide. Nanopores, as used herein, may include transmembrane structures that may permit the passage of molecules through a membrane. Examples of nanopores include a-hemolysin (Staphylococcus aureus) and MspA (Mycobacterium smegmatis). Other examples of nanopores may be found in the art describing nanopore sequencing or described in the art as pore-forming toxins, such as the β- PFTs Panton-Valentine leukocidin S, aeroiysin, and Clostridial Epsilon-toxin, the a-PFTs cytolysin A, the binary PFT anthrax toxin, or others such as pneumolysin or gramicidin. Nanopores have become technologically and economically significant with the advent of nanopore sequencing technology. Methods for nanopore sequencing are known in the art, for example, as described in US 5,795,782, which is incorporated by reference. Briefly, nanopore detection involves a nanopore-perforated membrane immersed in a voltage- conducting fluid, such as an ionic solution including, for example, KC1, NaCl, NiCL LiCi or other ion forming inorganic compounds known to those of skill in the art. A voltage is applied across the membrane, and an electric current results from the conduction of ions through the nanopore. When the nanopore interacts with polymers, such as DNA or other non-DNA polymers, flow through the nanopore is modulated in a monomer-specific manner, resulting in a change in the current that permits identification of the monomer(s), Nanopores within the scope of the present disclosure include solid state nonprotein nanopores known to those of skill in the art and DNA origami nanopores known to those of skill in the art. Such nanopores provide a nanopore width larger than known protein nanopores which allow the passage of larger molecules for detection while still being sensitive enough to detect a change in ionic current when the complex passes through the nanopore.
"Nanopore sequencing" means a method of determining the components of a polymer based upon interaction of the polymer with the nanopore. Nanopore sequencing may be achieved by measuring a change in the conductance of ions through a nanopore that occurs when the size of the opening is altered by interaction with the polymer. In addition to a nanopore, the present disclosure envisions the use of a nanogap which is known in the art as being a gap between two electrodes where the gap is about a few nanometers in width such as between about 0.2 ran to about 25 ran or between about 2 and about 5 nm. The gap mimics the opening in a nanopore and allows polymers to pass through the gap and between the electrodes. Aspects of the present disclosure also envision use of a nanochannel electrodes are placed adjacent to a nanochannel through which the polymer passes. It is to be understood that one of skill will readily envision different embodiments of molecule or moiety identification and sequencing based on movement of a molecule or moiety through an electric field and creating a distortion of the electric field representative of the structure passing through the electric field.
Methods described herein are capable of generating large amounts of data (billions of bits). Accordingly, high throughput methods of sequencing these nucleic acid molecules, such as that disclosed in Mitra (1999) Nucleic Acids Res. 27(24):e34; pp.1-6, are useful. In preferred embodiments, high throughput methods are used with PCR amplicons or other nucleic acid molecules having lengths of less than 100 bp. In other preferred embodiments, PCR amplicons of 100 bp, 1 10 bp, 120 bp, 130 bp, 140 bp, 150 bp, 160 bp, 170 bp, 180 bp, 190 bp, 200 bp, 250 bp, 300 bp, 350 bp, 400 bp, 450 bp, 500 bp, 550 bp, 600 bp, 650 bp, 700 bp, 750 bp, 800 bp, 850 bp, 900 bp, 950 bp, 1000 bp or more may be used.
Sequencing methods useful in the present disclosure include sequencing-by-ligation, sequencing-by-synthesis, sequencing-by-hybridization known to a skilled in the art. Shendure et al., Accurate multiplex polony sequencing of an evolved bacterial genome, Science, vol. 309, p. 1728-32. 2005, Drmanac et al., Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays, Science, vol. 327, p. 78-81 . 2009, McKernan et al, Sequence and structural variation in a human genome uncovered by short- read, massively parallel ligation sequencing using two-base encoding, Genome Res., vol. 19, p. 1527-41. 2009; Rodrigue et al., Unlocking short read sequencing for metagenomics, PLoS One, vol. 28, el 1840. 2010, Rothberg et al., An integrated semiconductor device enabling non-optical genome sequencing. Nature, vol. 475, p. 348-352. 2011 , Margulies et al., Genome sequencing in microfabricated high-density picolitre reactors, Nature, vol. 437, p. 376-380. 2005; Rasko et al. Origins of the E. coli strain causing an outbreak of hemolytic- uremic syndrome in Germany, N. Engl. J. Med., Epub. 201 1; Hutter et al., Labeled nucleoside triphosphates with reversibly terminating aminoalkoxyl groups, Nucleos. Nucleot. Nucl., vol. 92, p. 879-895. 2010; Seo et al., Four-color DNA sequencing by synthesis on a chip using photocieavable fluorescent nucleotides, Proc. Natl. Acad. Sci. USA., Vol. 102, P. 5926-5931 (2005); Olejnik et al.; Photocieavable biotin derivatives: a versatile approach for the isolation of biomolecules, Proc. Natl. Acad. Sci. U.S.A., vol. 92, p. 7590-7594. 1995; US 5,750,34; US 2009/0062129 and US 2009/0191553.
Sequencing primers according to the present disclosure are those that are capable of binding to a known binding region of the target polynucleotide and facilitating ligation of an oligonucleotide probe of the present disclosure. Sequencing primers may be designed with the aid of a computer program such as, for example, DNAWorks, or Gene20iigo. The binding region can vary in length but it should be long enough to hybridize the sequencing primer. Target polynucleotides may have multiple different binding regions thereby allowing different sections of the target polynucleotide to be sequenced. Sequencing primers are selected to form highly stable duplexes so that they remain hybridized during successive cycles of ligation. Sequencing primers can be selected such that ligation can proceed in either the 5' to 3' direction or the 3' to 5' direction or both. Sequencing primers may contain modified nucleotides or bonds to enhance their hybridization efficiency, or improve their stability, or prevent extension from a one terminus or the other.
According to one aspect, single stranded DNA templates (ssDNA) are prepared by PGR amplification to be used with sequencing primers. Alternatively single stranded template is attached to beads or nanoparticles in an emulsion and amplified through ePCR. Supports and Attachment
In certain exemplary embodiments, one or more oligonucleotide sequences described herein are immobilized on a support (e.g., a solid and/or semi-solid support). In certain aspects, an oligonucleotide sequence can be attached to a support using one or more of the phosphoramidite linkers described herein. Suitable supports include, but are not limited to, slides, beads, chips, particles, strands, gels, sheets, tubing, spheres, containers, capillaries, pads, slices, films, plates and the like. In various embodiments, a solid support may be biological, nonbiologicai, organic, inorganic, or any combination thereof. Supports of the present invention can be any shape, size, or geometry as desired. Supports may be made from glass (silicon dioxide), metal, ceramic, polymer or other materials known to those of skill in the art. Supports may be a solid, semi-solid, elastomer or gel.
In certain exemplary embodiments, a support is a microarray. Oligonucleotides immobilized on microarrays include nucleic acids that are generated in or from an assay reaction. Typically, the oligonucleotides or polynucleotides on microarrays are single stranded and are covalently attached to the solid phase support, usually by a 5!-end or a 3'- end. In certain exemplary embodiments, probes are immobilized via one or more cleavabie linkers.
As used herein, the term "attach" refers to both covalent interactions and noncovalent interactions. A covalent interaction is a chemical linkage between two atoms or radicals formed by the sharing of a pair of electrons (i.e., a single bond), two pairs of electrons (i.e., a double bond) or three pairs of electrons (i.e., a triple bond). Covalent interactions are also known in the art as electron pair interactions or electron pair bonds. Noncovalent interactions include, but are not limited to, van der Waals interactions, hydrogen bonds, weak chemical bonds (i.e., via short-range noncovalent forces), hydrophobic interactions, ionic bonds and the like. A review of noncovalent interactions can be found in Alberts et al., in Molecular Biology of the Cell, 3d edition, Garland Publishing, 1994.
According to certain aspects, affixing or immobilizing nucleic acid molecules to the substrate is performed using a covalent linker that is selected from the group that includes oxidized 3 -methyl uridine, an acrylyl group and hexaethylene glycol, hi addition to the attachment of linker sequences to the molecules of the pool for use in directional attachment to the support, a restriction site or regulatory element (such as a promoter element, cap site or translational termination signal), is, if desired, joined with the members of the pool. Nucleic acids that have been synthesized on the surface of a support may be removed, such as by a cleavable linker or linkers known to those of skill in the art. Linkers can be designed with chemically reactive segments which are optionally cleavable with agents such as enzymes, light, heat, pH buffers, and redox reagents. Such linkers can be employed to pre-fabricate an in situ solid-phase inactive reservoir of a different solution-phase primer for each discrete feature. Upon linker cleavage, the primer would be released into solution for PGR, perhaps by using the heat from the thermocycling process as the trigger.
It is also contemplated that affixing of nucleic acid molecules to the support is performed via hybridization of the members of the pool to nucleic acid molecules that are covalently bound to the support.
Reagent Delivery Systems According to certain aspects, reagents and washes are delivered that the reactants are present at a desired location for a desired period of time to, for example, covalently attached dNTP to an initiator sequence or an existing nucleotide attached at the desired location, A selected nucleotide reagent liquid is pulsed or flowed or deposited at the reaction site where reaction takes place and then may be optionally followed by deliver}- of a buffer or wash that does not include the nucleotide. Suitable delivery systems include fluidics systems, microfluidics systems, syringe systems, ink jet systems, pipette systems and other fluid deliver}' systems known to those of skill in the art. Various flow cell embodiments or flow channel embodiments or microfluidic channel embodiments are envisioned which can deliver separate reagents or a mixture of reagents or washes using pumps or electrodes or other methods known to those of skill in the art of moving fluids through channels or microfluidic channels through one or more channels to a reaction region or vessel where the surface of the substrate is positioned so that the reagents can contact the desired location where a nucleotide is to be added.
According to another embodiment, a microfluidic device is provided with one or more reservoirs which include one or more reagents which are then transferred via microchannels to a reaction zone where the reagents are mixed and the reaction occurs. Such microfluidic devices and the methods of moving fluid reagents through such microfluidic devices are known to those of skill in the art.
Immobilized nucleic acid molecules may, if desired, be produced using a device (e.g., any commercially-available inkjet printer, which may be used in substantially unmodified form) which sprays a focused burst of reagent-containing solution onto a support (see Castellino (1997) Genome Res. 7:943-976, incorporated herein in its entirety by reference). Such a method is currently in practice at ineyte Pharmaceuticals and Rosetta Biosystems, Inc., the latter of which employs "minimally modified Epson Inkjet cartridges" (Epson America, Inc.; Torrance, CA). The method of inkjet deposition depends upon the piezoelectric effect, whereby a narrow tube containing a liquid of interest (in this case, oligonucleotide synthesis reagents) is encircled by an adapter. An electric charge sent across the adapter causes the adapter to expand at a different rate than the tube, and forces a small drop of liquid reagents from the tube onto a coated slide or other support.
Reagents can be deposited onto a discrete region of the support, such that each region forms a feature of the array. The feature is capable of generating an anion toroidal vortex as described herein. The desired nucleic acid sequence can be synthesized drop-by-drop at each position, as is true for other methods known in the art. If the angle of dispersion of reagents is narrow, it is possible to create an array comprising many features. Alternatively, if the spraying device is more broadly focused, such that it disperses nucleic acid synthesis reagents in a wider angle, as much as an entire support is covered each time, and an array is produced in which each member has the same sequence (i.e., the array has only a single feature).
The following examples are set forth as being representative of the present disclosure. These examples are not to be construed as limiting the scope of the present disclosure as these and other equivalent embodiments will be apparent in view of the present disclosure, figures and accompanying claims,
EXAMPLES
Example I. Enzymatic DNA Synthesis for Information Encoding and Decoding
This example describes an embodiment of using nucleotide transitions to encode a format of information using DNA polymerases catalyzed DNA oligonucleotide sequences. The encoded DNA sequence can be stored or decoded. Such an enzymatic based nucleotide synthesis can catalyze the linkage of naturally occurring deoxynucieotide triphosphates (dNTPs) rapidly, in a single step, and under non-toxic biocompatible conditions, as compared to chemical methods (Fig, 1).
In one embodiment, the methods used terminal deoxynucleotidyl transferase (TdT), a unique template-independent DNA polymerase which rampantly and indiscriminately adds dNTP substrates to the 3' termini of DNA strands (F. J. Bollum, Thermal conversion of nonprinting deoxyribonucleic acid to primer. J. Biol. Chem. 234, 2733-2734 (1959), F. J. Bollum, Oligodeoxyribonucleoti de-primed reactions catalyzed by calf thymus polymerase, J, Biol. Chem. 237, 1945-1949 ( 1962), L. M. Chang, F. J. Bollum, Molecular biology of terminal transferase. CRC Crit. Rev. Biochem. 21, 27-52 (1986), E. A. Motea, A. J. Berdis, Terminal deoxynucleotidyl transferase: The story of a misguided DNA polymerase. Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics. 1804, 1151-1 166 (2010), K. I. Kato, J. M. Goncaives, G. E. Houts, F. J. Bollum, Deoxynucleotide-polymerizing enzymes of calf thymus gland. II. Properties of the terminal deoxynucleotidyltransf erase. J. Biol. Chem. 242, 2780-2789 (1967)). As such, TdT has been used only in reactions where one nucleotide is added to indeterminate lengths (D. A. Jackson, R. H. Symons, P. Berg, Biochemical Method for Inserting New Genetic Information into DNA of Simian Virus 40: Circular SV40 DNA Molecules Containing Lambda Phage Genes and the Galactose Operon of Escherichia coli. Proceedings of the National Academy of Sciences. 69, 2904-2909 (1972), P. E. Lobban, A. D. Kaiser, Enzymatic end-to-end joining of DNA molecules. J. Mol. Biol. 78, 453-471 (1973), G. Deng, R. Wu, An improved procedure for utilizing terminal transferase to add homopolymers to the 3' termini of DNA. Nucleic Acids Res. 9, 4173-4188 (1981)). To control this polymerization activity, apyrase was used in the reaction with TdT. Apyrase is known to competitively degrade dNTPs into their TdT-inactive diphosphate and monophosphate precursors (M. Ronaghi, M. Uhlen, P. Nyren, A sequencing method based on real-time pyrophosphate. Science. 281, 363, 365 (1998), A. Ahmadian, B. Gharizadeh, D. O'Meara, J. Odeberg, J. Lundeberg, Genotyping by apyrase-mediated allele-specific extension. Nucleic Acids Res. 29, El 21 (2001), E. Hultin, M. Kaller, A. Ahmadian, J. Lundeberg, Competitive enzymatic reaction to control allele-specific extensions. Nucleic Acids Res. 33, e48 (2005), F. Suri et al., Screening of common CYP1B 1 mutations in Iranian POAG patients using a mi croarr ay-based PrASE protocol. Mol. Vis, 14, 2349-2356 (2008)). A mixture containing a tuned ratio of these two enzymes was created and optimized such that dNTPsnucleotides are added by TdT before being degraded by apyrase (Figs. 2A-2C, Figs. 3A-3C, Figs. 4A-4C and Fig. 5). The lowest dNTPnucleotide concentrations required for maximum coupling efficiency was further determined (Fig. 6), such that adding nucleotide substrates in series would result in stepwise increases in DNA length (Fig. 7).
The present disclosure provides an enzymatic synthesis strategy that is rapid and simple, requiring few components to produce DNA with a given information content (Fig, 8 A). Embodiments of the disclosure include a reaction mixture of short oligonucleotide initiators, TdT, and apyrase. The initiators are immobilized on solid supports, such as beads or a surface, to allow removal of reaction byproducts and facilitate downstream processing and amplification. Upon the addition of a dNTPnucleotide substrate, TdT extends the initiators until the substrate is degraded by apyrase, allowing immediate addition of subsequent nucleotide substrates. Adding a series of dNTPsnucleotides results in a population of DNA strands, all extended by the same order of nucleotides. While extension lengths may vary across strands, the same information content is stored in the whole population as transitions between different or non-identical nucleotides (Fig. 8B). In an exemplary embodiment, trits was used (Trits are the ternary equivalent of bits. One trit is log(3)/log(2) (about 1.58496) bits of information) to maximize information capacity, given three possible transitions for each nucleotide, (Fig. 8C).
In one embodiment, the message "hello world!" was encoded and synthesized (Fig. 9A). To encode each character, its binary ASCII representation was first converted to ternary and then to nucleotide transitions (Table 1). Each character was then synthesized as its own DNA strand preceded by a header index to specify strand order. Following synthesis, these strands were ii ated to a universal adapter, PCR amplified, and stored as a single pool without additional purification (Materials and Methods), These 12 DNA strands, each with 8 trits, carry the 144 bits of data. (Table 1).
In one embodiment for decoding the message, the pool of DNA strands was sequenced using both Alumina and Oxford Nanopore platforms and extracted nucleotide transitions from each read by performing run-length encoding, a lossless data compression algorithm ubiquitously used in modern communications. In the more accurate Alumina dataset, it was found that the correct transition was the most abundant species, comprising 88,6%, on average, of sequences filtered for the expected number of transitions and 19%, on average, of all sequences (Fig. 9B). The remainder of the reads largely contained deletions and, to a smaller extent, mismatches and insertions (Figs. 10 and I I). The same pool with Oxford Nanopore MinlON was next sequenced and a similar result was observed (Fig. 9B): the correct transition was the most abundant species, comprising, on average, 49.9% of filtered sequences and 5.1% of all sequences. These decreases are likely due to errors currently inherent to state-of-the-art nanopore sequencing (M. Jain et al, Nanopore sequencing and assembly of a human genome with ultra-long reads. bioRxiv (2017), p. 128835). Fast readout of information stored using the enzymatic DNA synthesis strategy may be accomplished with real-time sequencing. While Alumina sequences all DNA strands in parallel and reports the outcome in batch, arrays of nanopores from Oxford Nanopore asynchronously sequence DNA strands and produce a stream of data (Fig. 9C). This stream allows real-time data reconstruction, which makes it possible to minimize the time and cost of sequencing. In one embodiment it was determined that the nanopore sequencing time required for decoding the strands with greater than 99.9% confidence (Methods). These simulations indicate that full data reconstruction required only half of the total sequencing amount initially used (Fig. 9D, Table 2). In another embodiment, additional improvements to read speed could be gained by selective sequencing, a strategy where nanopores reject decoded strands in favor of missing strands by detecting the index header (M. Loose, S, Malia, M. Stout, Real-time selective sequencing using nanopore technology. Nat. Methods. 13, 751-754 (2016)). More broadly, DNA translocation rates through nanopores, currently slowed for accurate single-base sequencing, may be increased since nucleotide transitions are, in principle, easier to detect (D. Fologea, J. Uplinger, B. Thomas, D. S, McNabb, J. Li, Slowing DNA translocation in a solid-state nanopore. Nano Lett. 5, 1734-1737 (2005), M. Vega, P. Granell, C. Lasorsa, B. Lerner, M. Perez, Automated and inexpensive method to manufacture solid- state nanopores and micropores in robust silicon wafers. J. Phys. Conf. Ser. 687, 012029 (2016), B. McNally et at., Optical recognition of converted DNA nucleotides for single-molecule DNA sequencing using nanopore arrays. Nano Lett. 10, 2237-2244 (2010), R. deia Torre, J. Larkin, A. Singer, A. Melier, Fabrication and characterization of solid-state nanopore arrays for high-throughput DNA sequencing, Nanotechnology. 23, 385308 (2012)), These alternative design parameters facilitates the development of less precise sequencing technologies that are faster, more affordable, and specifically designed for DNA information storage.
The present disclosure contemplates improvements and design optimizations of the nucleotide encoding and decoding methods described herein. The current implementation of the methods results in an approximately 25 -fold decrease in information density compared to the maximum possible for DNA which is more than a thousand fold better than electronic storage systems (V. Zhirnov, R. M. Zadegan, G. S. Sandhu, G. M. Church, W. !,. Hughes, Nucleic acid memory. Nat. Mater. 15, 366-370 (2016), G. M. Church, Y. Gao, S, Kosuri, Next-generation digital information storage in DNA. Science. 337, 1628 (2012), Y. Erlich, D. Zieiinski, DNA Fountain enables a robust and efficient storage architecture. Science. 355, 950-954 (2017)). This reduction is due to three factors: forbidding transitions between identical nucleotides reduces the possible states from four to three 3-fold loss, Fig. 8C), average of four nucleotide extensions per transition (-4-fold loss, Table 1), and -20% yield of perfectly synthesized strands (~5-fold loss, Fig. 9B). The disclosure contemplates addressing these limitations. Process engineering of reactions may improve coupling efficiencies, which in turn can enable efficient utilization of TdT's ability to add 300 to 20,000 nucleotides per strand (Fig. 4B, (L. M. Chang, F. J. Bollum, Molecular biology of terminal transferase. CRC Crit. Rev. Biochem. 21, 27-52 (1986))). Furthermore, coding systems that are tailored to these biochemical processes may enable the use of all transitions, by considering extension lengths, and provide highly efficient data recovery, saving on synthesis and sequencing costs even with imperfectly synthesized DNA strands. On the other hand, the length of nucleotide extensions per transition may be considered a design optimizations and tuned according to application demands, trading density for read-out by specialized nanopore sequencing (D. Fologea, J. Uplinger, B. Thomas, D. S. McNabb, J. Li, Slowing DNA translocation in a solid-state nanopore. Nano Lett. 5, 1734-1737 (2005), M. Vega, P. Granell, C. Lasorsa, B. Lerner, M. Perez, Automated and inexpensive method to manufacture solid- state nanopores and micropores in robust silicon wafers. J. Phys. Con Ser, 687, 012029 (2016), B, MeNally et al., Optical recognition of converted DNA nucleotides for single-molecule DNA sequencing using nanopore arrays. Nano Lett. 10, 2237-2244 (2010), . deia Torre, J. Larkin, A. Singer, A. Melier, Fabrication and characterization of solid-state nanopore arrays for high-throughput DNA sequencing. Nanotechnology. 23, 385308 (2012)),
While this work illustrates DNA information storage in vitro, the disclosed biochemical approach provides a foundation for the development of biocompatible de novo molecular recording systems in vivo (S. L. Shipman, J. Nivala, J. D. Macklis, G. M. Church, CRISPR-Cas encoding of a digital movie into the genomes of a population of living bacteria. Nature. 547, 345-349 (2017), B. M. Zamft et al., Measuring cation dependent DNA polymerase fidelity landscapes by deep sequencing, PLoS One. 7, e43876 (2012)). Importantly, accurate information storage in DNA was demonstrated that does not require single-base accuracy and alternative processes could yield potentially dramatic benefits to synthesis and sequencing cost and speed. Further enhancements, through improved enzymatics and tailored coding systems, the quality of this enzymatic strategy can be improved and inform the design of a complete read and write storage system to advance practical information storage in DNA.
Example II. Materials and Methods
Tdl' apyrase system
TdT to apyrase ratio optimization To obtain a ratio of TdT polymerization activity to apyrase degradation activity that would allow for net positive extension of the initiator, initiator extensions was assessed by TdT in presence of a wide range of apyrase concentrations with every dNTP substrate (Fig, 2A).
Each reaction was carried out in 20μΕ total volume. All reaction components but the dNTP were assembled in 18μΙ_< while the dNTP was prepared in 2μΕ of water. The 18μΕ mix was composed such that upon mixing with the 2μΙ, dNTP solution, the following initial composition would be obtained: 200μΜ dNTP, I X Enzymatics Green Buffer, 0.05μΜ f-P5- SBS3 initiator oligo, Ι ϋ/μΕ TdT, and 4, 2, 1, 0,5, or 0.25 milliunits (mil) of apyrase per microliter. To initiate the reaction, the 18μΙ_< mixture was added to a tube containing the 2μΕ dNTP mix and mixed immediately by pipetting. The reaction was then incubated at room temperature for at least two minutes after which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
The results show that the TdT: apyrase mixture behaves as expected: increasing level of apyrase activity leads to shorter extensions while decreasing amounts lead to longer extensions by TdT. These results also indicate that 0.25 to mU/^L of apyrase allows for some extension with all nucleotides. As the exact level of activity for each nucleotide should be tunable based on its own concentration,
Figure imgf000067_0001
or I mU iL of apyrase was used for all ensuing experiments.
In fact, this tunability of extension reaction was tested by altering both apyrase concentrations and nucleotide concentrations (Figs. 2B and 2C). Extension experiments were carried out with TdT: apyrase mix of varying apyrase concentrations with both dCTP (Fig, 2B) and dGTP (Fig, 2C), representing pyrimidines and purines, respectively. Each reaction was carried out in 20μΕ total volume. All reaction components but the dNTP were assembled in 16(uL while the dNTP was prepared in 4μΙ_. of water. The 16μΙ_. mix was composed such that upon mixing with the 4μΕ dNTP solution, the following initial composition would be obtained: IX Enzymatics Green Buffer, 0.1 μΜ 150617 LT2 initiator (AGATCAATTAATACGATACCTGCG) (36), ΙΙΙ/μΕ TdT, and 0.125, 0.25, 0,5, or mU/^L apyrase. The starting final concentration of substrate was varied at 5, 10, 20, 40, or 80μΜ for dCTP or at 1.25, 2.5, 5, 10, 20μΜ for dGTP. To initiate the reaction, the 16μΤ mixture was added to a tube containing the 4μΕ dNTP sample and mixed immediately by pipetting. After mixing, each reaction was incubated at room temperature for at least two minutes at which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
The results show that not only extension levels are tunable based on the dNTP and apyrase concentrations but also apyrase and dNTP concentrations have a linear inverse effect, at least in the ranges tested. That is, doubling apyrase concentrations will have a similar impact on extension profiles as halving the dNTP concentration and vice versa. In other words, samples with a similar [dCTP]/[apyrase] ratio extend similarly. For example, one can see that 20μΜ dCTP with ΐυ/μί, apyrase leads to the same level of extension as 10μΜ dCTP with 0.5υ/μΕ apyrase, 5μΜ dCTP with 0.25U^L apyrase, and 2.5μΜ dCTP with 0.125υ/μΕ apyrase.
Optimizing the reaction conditions for TdT: apyrase
Divalent cation
The effect of divalent cations on performance of TdT has been extensively studied (T. P. Chirpich, The effect of different buffers on terminal deoxynucleotidyl transferase activity. Biochim. Biophys. Acta. 518, 535-538 (1978), M. R. Deibel Jr, M. S. Coleman, Biochemical properties of purified human terminal deoxynucleotidyltransferase. J. Biol. Chem. 255, 4206-4212 (1980), L. M. Chang, F. J. Bollum, Multiple roles of divalent cation in the terminal deoxynucleotidyltransferase reaction. J. Biol. Chem. 265, 17436-17440 (1990)). Commercially available preparations of TdT often come with one of two different buffer condition recommendations. In one embodiment, the buffer system as disclosed is based on magnesium as divalent cation with the option of supplementing cobalt. In another embodiment, the buffer system as disclosed is based on cobalt as the sole divalent cation. The performance of the TdT:apyrase system in all three conditions were evaluated, namely, magnesium as the only divalent cation, magnesium supplemented with cobalt, and cobalt as the only divalent cation. For that, two experiments were carried out, comparing each of magnesium with cobalt and cobalt-only conditions separately with magnesium-only condition (Figs. 3A-3C).
In comparing magnesium only condition with magnesium with cobalt condition (Fig. 3A), each reaction was carried out in 20μΙ_ total volume. All reaction components but the dNTP were assembled in 18μΕ while the dNTP was prepared in 2μΕ of water. The 18μΕ mix was composed such that upon mixing with the 2μΕ dNTP solution, the following initial composition would be obtained: 200μΜ dNTP, IX Enzymatics Green Buffer, 0.05 μΜ f-P5- SBS3 initiator oligo, 250μΜ cobalt chloride (if present), ΙΙΙ/μΕ TdT, and 4, 2, 1, 0.5, or 0.25 milliunits (raU) of apyrase per microliter. To initiate the reaction, the 18μΙ_, mixture was added to a tube containing the 2μΕ dNTP mix and mixed immediately by pipetting. The reaction was then incubated at room temperature for at least two minutes after which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a
15% Novex TBE-Urea gel.
While cobalt's presence affected different nucleotides differently, in general, it seemed to increase the size heterogeneity of the extended product. Importantly, in almost all reactions with cobalt, a substantial fraction of the initiator molecules remained unextended even with very low apyrase levels at which all initiator molecules are extended in the cobalt- free reaction equivalent. This observation perhaps hints at a transient structural or conformational state in the initiator or the enzyme, caused by presence of cobalt alongside magnesium, which prevents extension.
This supplemental cobalt effect was further evaluated by comparing reactions with various levels of added cobalt chloride (Fig. 3B). In this experiment, each reaction was carried out in 2()μΕ total volume. All reaction components but the dNTP and cobalt were assembled in 14μΙ_. while the dNTP and desired amount of cobalt were prepared in 6μΕ volume. The 14μΕ mix was prepared as a master mix for ail reactions and composed such that upon mixing with the 6μ!, dNTP and cobalt solution, the following initial composition would be obtained: 300μΜ dATP, 0.05μΜ f-P5-SBS3 initiator oligo, IX Enzymatics Green Buffer, lU/ L TdT, ImU/uL apyrase and 50, 100, 150, 200, 250, or 300μΜ cobalt chloride. To initiate the reaction, the 14μΕ mixture was added to a tube containing the 6μΕ dATP and cobalt mixture and mixed immediately by pipetting. The reaction was then incubated at room temperature for at least two minutes after which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
The results confirm that supplementary cobalt chloride results in a lack of extension in a fraction of initiators with the TdT: apyrase system. The fraction of unextended products grows with increasing cobalt chloride. This effect can be detrimental to the synthesis scheme as "deletion" mutants would persist even at high average homopolymer extension lengths. Consequently, it was decided not to use supplementary cobalt chloride in the reaction.
Next, a series of experiments were performed to compare magnesium -only conditions with cobalt-only conditions. Extension experiments were carried out with TdT: apyrase mix with both the magnesium -based Enzymatics Green Buffer or cobalt-based Promega TdT Buffer with each of the four nucleotides (Fig. 3C). Each reaction was carried out in 20μΕ total volume. All reaction components but the dNTP were assembled in 16μ1, while the dNTP was prepared in 4μΕ of water. The 16μΕ mix was composed such that upon mixing with the 4μΕ dNTP solution, the following initial composition would be obtained: IX Enzymatics Green Buffer (Composition of 10X Green Buffer (BO 120) from Enzymatics according to the manufacturer: 200 mM Tris- Acetate, 500 mM Potassium Acetate, 100 mM Magnesium Acetate , pH 7.9 @ 25°C) or X Promega TdT buffer (Composition of Terminal Transferase 5X Buffer (Ml 89 A) from Promega according to the manufacturer: 500mM cacodylate buffer (pH 6.8), 5mM CoC12 and 0.5niM DTT), 0.1 μΜ 150617 LT2 initiator, ΐυ/μΕ TdT, and 1 τηυ/μΕ apyrase. The starting final concentration of dNTPs was varied at 25, 50, 100, 200, or 400μΜ for dCTP, dATP, and dTTP, or at 12.5, 25, 50, 100, or 2ί)0μ\ί for dGTP. To initiate the reaction, the 16μΕ mixture was added to a tube containing the 4μΕ dNTP sample and mixed immediately by pipetting. After mixing, each reaction was then incubated at room temperature for at least two minutes at which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE- Urea gel.
Side-by-side comparison of extension dynamics with TdT: apyrase mix for each nucleotide makes a few patterns clear. First, extension with pyrimi dines (dCTP and dTTP) is stimulated by cobalt as the divalent cation while extension with purines (dATP and dGTP) is hampered. This observation is consistent with previous reports about TdT behavior (L. M. Chang, F. J. Bollum, Multiple roles of divalent cation in the terminal deoxynucleotidyltransferase reaction. J. Biol. Chem. 265, 17436-17440 (1990), K. I. Kato, J. M. Goncalves, G. E. Houts, F. J. Bollum, Deoxynucleotide-polymerizing enzymes of calf thymus gland. II. Properties of the terminal deoxynucleotidyltransferase. J. Biol. Chem. 242, 2780-2789 (1967)). Second, the distribution of extension lengths is wider in almost all samples with cobalt as the divalent cation. As this outcome is not favorable for the DNA synthesis scheme, a magnesium -based buffer was chosen for the experiments.
Buffer and salt concentrations
Given previous results on the effect of various monovalent ions and buffering agents on TdT activity (T. P. Chirpich, The effect of different buffers on terminal deoxynucleotidyl transferase activity. Biochim. Biophys. Acta, 518, 535-538 ( 1978), K. I. Kara, J. M. Goncalves, G. E. Houts, F. J. Bollum, Deoxynucleoti de-polymerizing enzymes of calf thymus gland. II. Properties of the terminal deoxynucleotidyltransferase. J. Biol. Chem. 242, 2780-2789 (1967), F. Grosse, A. Manns, Terminal deoxyribonucl eotidyi transferase (EC 2.7.7.31 ). Methods Mol. Biol. 16, 95-105 (1993)), experiments were conducted to determine the optimal conditions for the TdT:apyrase mix in the buffer space around the Enzvmatics Green Buffer (Fig. 4A).
Each reaction was carried out in 20μΙ... total volume. All reaction components but the dNTP and buffer were assembled in 14μΕ while the dNTP and desired amount of buffer were prepared in 6μΕ volume. The 14μΕ mix was prepared as a master mix for all reactions and composed such that upon mixing with the 6μ!, dNTP and buffer solution, the following initial composition would be obtained: 300μΜ dATP, 0,05μΜ f~P5-SBS3 initiator oligo, ΙΙΙ/μΕ TdT, lmU/μΙ. apyrase and 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0, 1.2, or 1.4X Enzvmatics Green Buffer. To initiate the reaction, the 14μΕ mixture was added to a tube containing the 6μΕ dNTP mix and mixed immediately by pipetting. The reaction was then incubated at room temperature for at least two minutes after which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
The highest extension length together with lowest product size distribution variability are observed with 0.6-0.7X Enzymatics Green Buffer. Based on these results, it was determined that 0.7X Enzymatics Green Buffer, composed of 14 mM Tris-Acetate, 35 mM Potassium Acetate, 7 mM Magnesium Acetate, pH 7.9, is a more optimal condition for using TdT:apyrase.
Additives
The effects of various additives were also explored that are commonly used to enhance the performance of various DNA polymerases or other DNA-binding enzymes, seeking to obtain more consistent performance from TdT between various nucleotides and more uniform extension of the initiator (Fig. 4B). This group of additives were glycerol, sucrose, and PEG 8000, betaine, and DMSO, Triton-XlOO, and Tween 20.
Each reaction was composed of 0.1 μΜ 150617 LT2, 0.7X Enzymatics Green Buffer, 125μΜ of each dNTP, ΙΙΙ/μΙ. TdT, and the desired amount of the additive. The additives were glycerol at 27% (v/v), sucrose at 20 and 40% (w/v), PEG 8000 at 5 and 10% (w/v), betaine at 0.5 and 1M, DMSO at 5, 10, 20, and 30% (v/v), Triton X-100 at 0.01, 0.1, 0.5, and 1.0% (v/v), and Tween 20 at 0.01, 0.1, 0.5, and 1.0% (v/v). The reaction were carried out at room temperature for 20 minutes and then mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 10% Novex TBE-Urea gel.
The results show that glycerol, sucrose, and betaine have a detrimental effect, at least at the assayed concentrations, that Triton X-100 and Tween 20 slightly improve the results, and that PEG8000 and DMSO can have a positive effect on extension by TdT, leading to longer extensions with a more homogeneous size distribution. Further experiments showed that no additional benefit can be obtained by including both PEG8000 and DMSO in the reaction, neither does the lighter PEG3350 outperform PEG8000 (data not shown). Consequently, 0.1% (v/v) Triton X-100 and 10% PEG8000 (w/v) in later experiments was chosen to be used.
To compare the performance of the TdT: apyrase mix in optimized buffer conditions with that in the standard conditions, side-by-side extension experiments were run using dCTP in the optimized buffer conditions as well as the standard condition (Fig. 4C). Each reaction was carried out in 20(u.L total volume. All reaction components but the dNTP were assembled in 16(uL while the dNTP was prepared in 4μΙ_. of water. The 16μΙ_. mix was composed such that upon mixing with the 4μΕ dNTP solution, the following initial composition would be obtained: 0.7X Enzymatics Green Buffer or 0.7X Enzymatics Green Buffer with 10% PEG8000 and 0.1% Triton X-100, Ο. Ι μΜ 150617_LT2 initiator, Ιϋ/μΕ TdT, and 1 ιηϋ/μΕ apyrase. The starting final concentrations of dCTPs were 25, 50, 100, 200, or 400μΜ. To initiate the reaction, the ΙόμΕ mixture was added to a tube containing the 4μΕ dNTP sample and mixed immediately by pipetting. After mixing, each reaction was then incubated at room temperature for at least two minutes at which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
The results clearly indicate enhanced extension capacity by TdT:apyrase in the optimized buffer conditions. This enhancement is largely due to PEG8000 increasing the effective concentrations of some reagents in the reaction.
Polymerase to initiator ratio
Consistent and reproducible extension of the initiator upon addition of various nucleotides in presence of apyrase demands that TdT be at saturating concentrations relative to the initiator. Subsaturation levels of TdT can result in a high extension variability, or extension of less than the maximum possible fraction of initiators upon the addition of dNTPs. With the final composition of the reaction having taken shape, it was examined what levels of TdT would be saturating relative to the initiator concentrations that was commonly used.
For that, a series of otherwise identical reactions with increasing TdT levels was arranged (Fig. 5). Each reaction was carried out in 20μΕ total volume and assembled as two ΙΟμΙ, halves, with the first half containing dATP and TdT, and the second half containing apyrase and initiator. The two halves were composed such that upon mixing following initial composition would be obtained: IX Custom TdT Buffer (2X Custom Synthesis Buffer: 28 n M Tris-Acetate, 70 mM Potassium Acetate, 14 mM Magnesium Acetate, 0.2% Triton X- 100, 20% (w/v) PEG 8000, pH 7.9) (based on above experiments), Ο. Ι μΜ 150617 LT2+3C initiator, 50μΜ dATP, lmU/μΙ. apyrase, and 0.075, 0.15, 0.3, 0.6, 1.2, 2.4, 4.8, 9.6, or 19.3 υ/μΤ TdT. To initiate the reaction, the halves were mixed quickly by pipetting. The reaction was then incubated at room temperature for at least two minutes after which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
These results show that, with 0.1 μΜ initiator oiigo, concentrations of TdT above 0,6υ/μΕ are saturating, and thus lU/μ!, TdT was continue to be used in the reactions so long as initiator concentration does not surpass 0.1 μΜ
The experiments and optimizations above provide the following reaction conditions:
IX Custom Synthesis Buffer, 0.1 μΜ (or less) initiator oiigo, lU/μΕ TdT (or more), and lmU./μΙ, apyrase.
Optimizing extension length for each nucleotide transition type Thus far, the performance and optimized the concentrations of enzymes, buffers, salts, and other additives in the polymerization reaction have been examined. Next, the substrates were examined: initiator and nucleotide triphosphate (dNTP),
The effect of the incoming dNTP monomer on TdT polymerization rate has been extensively studied; it was observed that different nucleotides have different binding and kinetic parameters (M. R. Deibel Jr, M. S. Coleman, Biochemical properties of purified human terminal deoxynucleotidyltransferase. J. Biol. Chem. 255, 4206-4212 (1980), K. I. Kato, J, M. Goncalves, G. E. Houts, F. J. Bollum, Deoxynucleotide-polymerizing enzymes of calf thymus gland. II Properties of the terminal deoxynucleotidyltransferase. J. Biol. Chem. 242, 2780-2789 (1967), F. Grosse, A. Manns, Terminal deoxyribonucleotidyi transferase (EC 2.7.7.31), Methods Moi. Biol. 16, 95-105 (1993), B. Yang, K. N. Gathy, M. S, Coleman, Mutational analysis of residues in the nucleotide binding domain of human terminal deoxynucieotidyl transferase. J. Biol. Chem. 269, 11859-11868 (1994), M. J. Modak, Biochemistry of terminal deoxynucleotidyltransferase: mechanism of inhibition by adenosine 5 '-triphosphate. Biochemistry, 17, 31 16-3120 (1978)). Consequently, a detailed study of the dNTP monomer effect in the system was warranted. Furthermore, unlike most template-dependent polymerases, the nucleotide composition of the initiator at the 3' is also important (K, I Kato, J. M. Goncalves, G. E, Houts, F. J. Bollum, Deoxynucleotide- polymerizing enzymes of calf thymus gland. II. Properties of the terminal deoxynucleotidyltransferase. J. Biol. Chem. 242, 2780-2789 (1967), F. Grosse, A. Manns, Terminal deoxyribonucleotidyi transferase (EC 2.7.7.31). Methods Moi. Biol. 16, 95-105 (1993)), This is because TdT operates in a distributive manner (M. R. Deibel Jr, M. S. Coleman, Biochemical properties of purified human terminal deoxynucleotidyltransferase. J. Biol. Chem. 255, 4206-4212 ( 1980), E. A. Motea, A. J. Berdis, Terminal deoxynucieotidyl transferase: the story of a misguided DNA polymerase. Biochim. Biophys. Acta. 1804, 1 S il l 66 (2010)); it does not remain bound to the nascent oligonucleotide and is not processive. It also shows no preference in binding order to the initiator or the dNTP monomer. While structural studies of TdT have not identified specific polar interactions between the bases of the initiator and the enzyme (M. Delarue et al., Crystal structures of a template-independent DNA polymerase: murine terminal deoxynucleotidyitransferase. EMBO J. 21, 427-439 (2002)), non-specific hydrophobic interactions do play a role. Specifically, Gouge and colleagues (J. Gouge, S. Rosario, F. Romain, P. Beguin, M, Delarue, Structures of intermediates along the catalytic cycle of terminal deoxynucleotidyitransferase: dynamical aspects of the two-metal ion mechanism. J. Mol. Biol. 425, 4334-4352 (2013)) observed that Leucine 398 of TdT inserts itself between the last two bases of the initiator in the pre- and post-catalytic states of the enzyme, forming hydrophobic interactions with the aromatic rings of the bases and disrupting their stacking of the bases in the process. The importance of this interaction was underscored by the significant deterioration of enzyme activity upon mutating Leucine 398 to Alanine (J. Gouge, S, Rosario, F. Romain, P. Beguin, M, Delarue, Structures of intermediates along the catalytic cycle of terminal deoxynucleotidyitransferase: dynamical aspects of the two-metal ion mechanism. J. Mol. Biol. 425, 4334-4352 (2013)). As this hydrophobic interaction will be different based on the nature of the last two nucleotides of the initiator, one would expect that both binding and kinetic parameters of the polymerization reaction to be affected by the identity of the last two nucleotides.
Experiments were set out, therefore, to simultaneously evaluate the effect of incoming dNTP monomer as well as the 3 ' end of the initiator on the performance of the TdT:apyrase reagent (Fig, 6). Extension experiments were performed using the TdT:apyrase mixture on four different initiators, each ending in a 3 -nucleotide stretch of As, Cs, Gs, or Ts, using each of the four different dNTP monomers. The initiators were LT2+3A, LT2+3C, LT+3G, and LT2+3T. Each reaction was carried out in 20μΙ_ total volume. All reaction components but the dNTP were assembled in 18μΕ while the dNTP was prepared in 2μΕ of water. The Ι 8μΕ mix was composed such that upon mixing with the 2μΕ dNTP solution, the following initial composition would be obtained: IX Custom Synthesis Buffer Buffer, 0.1 μΜ initiator oiigo, lU/μΕ TdT, and 0.25 niU/μΙ-, apyrase. The initial final concentration of dNTPs was varied at 2, 4, 8, 16, or 32μΜ for dCTP, dATP, and dTTP, or at 1, 2, 4, 8, or 16μΜ for dGTP. To initiate the reaction, the 18 L mixture was added to a tube containing the 2μΙ_. dNTP sample and mixed immediately by pipetting. The reaction was then incubated at room temperature for at least two minutes at which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
These experiments show the importance of the initiator sequence in the efficiency of the TdT: apyrase mix in the initial extension of the initiator. The best results are obtained when the 3' nucleotides of the initiator are purines (As or Gs). Such initiators appear to be a good substrate for TdT and be extended completely in the TdT: apyrase reaction. The worst results are obtained with initiators ending in cytosine, which appear to be a poor substrate for the enzyme. This results in a less desirable outcome when initiators ending in Cs are being extended by anything but dCTP. In such circumstances, an initiator with an added monomer (A, T, or G) becomes a more favorable substrate than the unextended initiators, which end in Cs. As such, their subsequent extension in the same round far outpaces the extension of unextended initiators, resulting in a large size heterogeneity in extension which is undesirable for the encoding schemes. These results suggest that the order preference for TdT in the 3' end of the initiator is GGG > AAA » TTT » CCC. These results were also used to select the nucleotide concentrations that result in a minimum amount of non-zero polymerization for each of the twelve possible transitions (Table S4, please provide).
Synthesizing a template sequence using optimized TdT:apyrase
Having optimized the reaction conditions for the TdT:apyrase reagent and determined the concentration required for non-zero polymerization for each incoming nucleotide based on the identity of the terminal nucleotide of the initiator, the performance of the optimizations to synthesize a template sequence was evaluated (Fig. 7).
Template sequences were synthesized using the TdT;apyrase mixture by cyclic addition of nucleotide triphosphates to the reaction. In one experiment, the template sequence GATGTAGA was synthesized (Fig 7, left) and in another, the template sequence CGCACTCG was synthesized (Fig. 7, right). Each reaction was carried out in ΙΟΟμΕ total volume and was mixed with a 2 l . of dNTP at 50X the desired final concentration. The ΙΟΟμΕ mix consisted of: IX Custom Synthesis Buffer, 0. 1 μΜ initiator oiigo, ΙΙΤ/μΕ TdT, and 0.25 ηιυ/μΐ, apyrase. The initial final concentration of dNTP was 40μΜ for dATP, 200μΜ for dCTP, 20μΜ for dGTP, and Ι όΟμ ! for dTTP, To initiate the reaction, the ΙΟΟμί, mixture was added to a tube containing 2\iL of the desired dNTP sample and mixed immediately by pipetting. After 1 minute incubation at room temperature, a 2μ1 sample of the mix was taken to be run on a gel. The remaining ΙΟΟμΕ was added to another tube containing 2μΕ of the next nucleotide, mixed and incubated as before, following by collection of another 2μΕ sample for PAGE analysis. These steps were repeated for 8 cycles without washing, thereby extending the initiator with 8 different dNTPs while using the same enzymatic mix. Afterwards, each of the 2\iL samples that were taken was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 1 5% Novex TBE-Urea gel.
Example III. Enzymatic DNA Synthesis for Digital Information Storage This example describes a de novo enzymatic DNA synthesis strategy designed from the bottom-up for information storage. In an exemplary embodiment, a template-independent DNA polymerase for controlled synthesis of sequences with user-defined information was harnessed. In one embodiment, retrieval of 144-bits, including addressing, from perfectly synthesized DNA strands using batch Illumina and real-time Oxford Nanopore sequencing was demonstrated. In another embodiment, a codec was then developed for data retrieval from populations of diverse but imperfectly synthesized DNA strands, each with -30% error tolerance. With this codec, a kilobyte-scale design was experimentally validated which stores 1 bit per nucleotide. Simulations of the codec supported reliable and robust storage of information for large-scale systems.
In one embodiment, a de novo DNA synthesis strategy and a digital codec designed specifically for information storage is provided. Whereas DNA for biological functionality requires single-base precision and accuracy, these demands can be relaxed for DNA for digital information. For synthesis, a template-independent DNA polymerase was used, a protein evolved to rapidly catalyze the linkage of naturally occurring nucleotide triphosphates (dNTPs) under non-toxic biocompatible conditions. Information in transitions were encoded between non-identical nucleotides, rather than single nucleotides. It was demonstrated that enzymatic synthesis and tailored computational tools provide robust information storage, as assessed using batch (Illumina) and real-time (Oxford Nanopore) sequencing. The presently- disclosed enzymatic synthesis strategy is cheaper than phosphoramidite chemistry and may reduce reagent costs by orders of magnitude, facilitating the adoption of DNA as a storage medium.
Enzymatic DNA synthesis According to one embodiment, the enzyme terminal deoxynucleotidyi transferase (TdT) is used. TdT is a template-independent DNA polymerase which rampantly and indiscriminately adds dNTPs to the 3' termini of DNA. As such, TdT is largely used in reactions where one nucleotide triphosphate is added to indeterminate lengths. In one embodiment, it is sought to leverage apyrase, which degrades nucleotide triphosphates into their TdT-inactive diphosphate and monophosphate precursors. By competing with TdT for nucleotide triphosphates, apyrase effectively limits DNA polymerization. A mixture was thus created and optimized containing a tuned ratio of these two enzymes such that a nucleotide triphosphate is added at least once to each strand by TdT before being degraded by apyrase (Figs. 2A-2C and Fig. 5). The lowest nucleotide triphosphate concentrations required was determined such that adding a series of nucleotides results in stepwise increases in the length of synthesized DNA (Figs. 6-7).
The enzymatic synthesis strategy requires few components to rapidly polymerize DNA (Figs. 13A-13B). In certain embodiments, the core of the reaction contemplates a mixture of TdT, apyrase, and short oligonucleotide initiators. Upon addition of a nucleotide triphosphate, TdT extends the initiators until ail added substrate is degraded by apyrase. The number of polymerized nucleotides was define as 'extension length' . Subsequent nucleotide triphosphates are added to continue the synthesis process. While the extension length for each added nucleotide triphosphate may vary, the resulting population of synthesized strands all share the same number and sequence of nucleotide transitions (Fig. 13B).
In certain embodiments, information was chosen to encode as transitions between non-identical nucleotides (Fig. 13C). Given three possible transitions for each nucleotide, trits was used (Trits are the ternar equivalent of bits. One trit is log(3)/log(2) (about 1.58496) bits of information.) to maximize information capacity. To convert information to DNA, information in trits were mapped to a template sequence of non-identical nucleotides, starting with the last nucleotide of the initiator. Enzymatic DNA synthesis of each template sequence produced 'raw strands', or strands , which can be physically stored. To retrieve information
R
stored in DIN A, strands are sequenced and transitions between non-identical nucleotides extracted, resulting in 'compressed strands', or strands If a strand ' is equivalent to the template sequence, the strand (compressed or raw) is considered 'perfect' and the information is retrieved by mapping the sequence of non-identical nucleotides back to trits.
In one embodiment, "hello world!," a message containing 96-bits of ASCII data (Fig. 1 A) was encoded and synthesized. This message was split into twelve individual 8-bit characters, and prefixed each character's bit representation with a 4-bit address to denote their order. These 144 total bits of information, including addressing, were also expressed in trits and mapped according to nucleotide transitions (Fig. 13C), resulting in twelve eight- nucleotide template sequences (Table 1). All twelve template sequences were synthesized (HOI -HI 2) in parallel on bead-conjugated initiators, and performed washing every two
R
cycles. Following the last synthesis cycle, all strands were ligated to a universal adapter,
PCR amplified, and stored as a single pool (Methods).
Data retrieval and error analyses
In one embodiment, Alumina sequencing was used to read out the synthesized strands11 and to assess the information stored in corresponding strands (Methods).
Sequencing was started by analyzing perfect strands. It was found that the extension length for each nucleotide varied based on the type of transition (Fig. 14B, Fig. 15, Table 5). As a result, perfectly synthesized strands for each template sequence may be of variable raw- Si length. Additionally, when extension lengths were compiled for each nucleotide across strands and positions based on type of transition, it was observed that these lengths were qualitatively consistent between bead-conjugated (Fig. 14C) and freely-diffuse initiators (Fig. 6), For example, the median extension length of C was among the lowest following A, T, or G. Conversely, the median extension lengths for A, T, and G were among the highest when following C. Considering ail synthesized strands, it was found stepwise increases to the median raw lengths with an increasing number of non-identical nucleotides (compressed strand length), indicating controlled polymerization for the population of strands over multiple cycles (Fig. 14D). However, compared to a median length of 30 nucleotides for all perfect strandsR, the median length for all synthesized strandsR was 26 bases, suggesting that not every strand polymerized the added nucleotide triphosphate in each cycle (Figs. 14E, Fig. 16).
To identify the types and magnitude of synthesis error in the system, all synthesized strands " was aligned to their respective template sequences and tabulated the number of missing, mismatched, and inserted nucleotides (Fig. 14F, Fig. 17). While multiple alignments exist for several strands*", which ambiguate the exact position of errors, the type and
C
magnitude of error for each strand as a whole can be distinguished. This analysis indicates that 9.5% of strands^ contained 1 or more mismatches, 10.7% contained I or more insertions, and 66.1% contained 1 or more missing nucleotides. Thus, the dominant type of error is missing nucleotides in strand^ which corresponds to strands which did not polymerize an added nucleotide triphosphate during a synthesis cycle.
Importantly, information was able to be retrieved from the pool of synthesized DNA strands ' by applying a simple two-step in silico filter. As each template sequence is designed with a specific architecture (Methods), synthesized strands ' was first filtered for length and the presence of a terminal 'C . By using this filter, the fraction of perfect strands for all template sequences (HOI -H 2) increased from an average of -19% to an average of -89%
(Fig. 14G). The most abundantly synthesized strand variant in this subset was then selected for to retrieve data.
Finally, it was shown that quick access to information stored in DNA may be accomplished with real-time sequencing. While the Alumina platform sequences all DNA strands in parallel and reports the outcome in batch, the Oxford Nanopore platform offers asynchronous sequencing by translocation of DNA strands through independent nanopores, and streams the outcome. As a result, sequencing can be terminated as soon as data is retrieved and remaining reagents provisioned for later use.
To demonstrate the advantage of real-time information retrieval, DNA strands *" synthesized for HOI -HI 2 was first sequenced using an entire MinlON flowceli (Oxford Nanopore) and observed that the most abundant species, an average of 49.9% of filtered strands " , were perfectly synthesized (Fig. 18A), This is largely consistent with results from
Alumina sequencing, with the slight decrease likely due to errors currently inherent to state- of-the-art nanopore sequencing (M. Jain el al, Nanopore sequencing and assembly of a human genome with ultra-long reads. bioRxiv (2017), p. 128835). With these experimental results, simulations were further performed to determine the fraction of sequencing resources required for robust data retrieval from each of the twelve template sequences H01 -H12 with at least 99,9% probability. Repeated trials were simulated which, at a given fraction of the R total sequencing run, randomized the translocation time of each DNA strand through the nanopore and assessed whether data could be retrieved (Methods). The simulations indicated that only half of the total sequencing resources are needed to robustly retrieve data from DNA using Oxford Nanopore compared to Illumina (Fig. 14H, Fig. 18B, Table 2).
More broadly, nanopore sequencing can enable faster and more efficient information retrieval from strands synthesized with the enzymatic strategy. Currently, DNA translocation rates are slowed through nanopores for accurate single-base sequencing. This rate may be increased since it is, in principle, easier to detect transitions between non-identical nucleotides, each with extension lengths greater than one (D. Fologea, J. Uplinger, B. Thomas, D. S. McNabb, J. Li, Slowing DNA translocation in a solid-state nanopore. Nano Lett. 5, 1734-1737 (2005); M. Vega, P. Granell, C. Lasorsa, B. Lerner, M Perez, Automated and inexpensive method to manufacture solid- state nanopores and micropores in robust silicon wafers. J. Phys. Conf. Ser. 687, 012029 (2016); B. McNally et ai, Optical recognition of converted DNA nucleotides for single-molecule DNA sequencing using nanopore arrays. Nano Lett 10, 2237-2244 (2010); R. dela Torre, J. Larkin, A. Singer, A. Meller, Fabrication and characterization of solid-state nanopore arrays for high-throughput DNA sequencing. Nanotechnolog . 23, 385308 (2012)). Furthermore, through selective sequencing, nanopores could reject strands corresponding to already recovered sequences in favor of strands for remaining template sequences (M. Loose, S. Malla, M. Stout, Real-time selective sequencing using nanopore technology. Nat. Methods, 13, 751-754 (2016)). Such an approach may be accomplished by detection of each strand's address. These alternative design parameters likely will inspire the development of sequencing technologies that are faster, more affordable, and specifically designed for DNA information storage.
Coded strand architecture It has been established that data can be stored in enzymatically-synthesized DNA and retrieved by in silico filtering for perfectly synthesized DNA strands. However, perfect strands0 may not be required for data retrieval, imperfectly synthesized strands may be used to reconstruct template sequences if nucleotide errors occur in different locations. It was thus sought to develop a codec for robust data retrieval which leverages the diversity of imperfectly synthesized strands C for template sequence reconstruction. The core of the codec relies on three elements: (i) A coded strand architecture which includes synchronization nucleotides to facilitate error localization, (ii) Sufficiently diverse strands ' produced by
C
synthesis, and (iii) Sequence reconstruction from strands with statistical inference based on mathematical models of synthesis. The presently disclosed codec models information storage in DNA as a communications channel to enable correction of errors accumulated from synthesis, storage, and sequencing (Fig. 19 A).
A key feature of the presently disclosed codec is the addition of synchronization nucleotides which are interspersed between information-encoding nucleotides (Fig. 19B). These nucleotides act as a scaffold to aid reconstruction of a template sequence from imperfectly-synthesized DNA strands that may contain errors as a result of missing, mismatched, and inserted nucleotides. As an example, consider a template sequence of 8 nucleotides (CTCGTGCT) and two synthesized DNA strands0 (CTCTGC and TCGTCT), each with two missing nucleotides. Without a scaffold, data cannot be retrieved since three equally valid reconstruction sequences are possible. In contrast, a scaffold constrains the number of possible sequences to one, allowing data retrieval from otherwise unusable DNA strands0. Accordingly, the codec includes a module for encoding information in template sequences which incorporates synchronization nucleotides.
C
To reconstruct missing nucleotides from strands " by scaffolding, the population of synthesized DNA strands for a desired sequence must be sufficiently diverse. That is, if the same nucleotide is missing systematically across all strands, then it cannot be retrieved without additional forms of error correction. It was thus analyzed diversity generated from the synthesis process by synthesizing a longer 16 -nucleotide template sequence (called EO), which contains 12 unique transitions between nucleotides to mitigate ambiguous alignments
(Fig. 19C). In silico size selection was performed of strandsR ranging 32 to 48 bases in length, assuming that each of the 16 template nucleotides were synthesized with an extension length of two to three bases (Fig. 20A). This purified set was analyzed by aligning the corresponding strands to the EO template and observed that missing nucleotides were predominant, in line with the previous analyses, but could occur in different positions (Fig. 19C, Fig, 19D, Fig. 20B),
Next, diversity was assessed by analyzing the lengths, number of variants, and
Levenshtein edit distances of strands^ from the purified set (Fig. 19D, Fig. 20C). It was observed that the median strand C length was 12 nucleotides and the maximal number of variants occurred at this length. The Levenshtein edit distance was also calculated (V. I. Levenshtein, in Soviet physics doklady (1966), vol. 10, pp. 707-710), which summarizes the number of single-nucleotide edits required to repair a strand " to the desired E0 sequence. The median edit distance for these variants was four, indicating that synchronization nucleotides could be placed approximately every three or four nucleotides to recall missing strand nucleotides from diversely synthesized strands. It was thus set out to reconstruct a template sequence from a population of diverse but imperfect strands ' using statistical inference and mathematical models.
For efficient reconstruction, a statistical framework known as maximum a posterior (MAP) estimation was adapted (J. Friedman, T. Hastie, R. Tibshirani, The elements of statistical learning (Springer series in statistics New York, 2001), vol, 1). To utilize this framework, a Markov model was built to describe the synthesis of a strand^ with error probabilities for mismatches, insertions, and missing nucleotides, derived from the analyses of the purified set of E0 strands " (Fig, 23 A). These state probabilities can be used to score all possible reconstruction solutions consistent with a scaffold, considering mismatches and insertions in addition to missing nucleotides (Fig, 24), The calculations provide a probability of occurrence for each nucleotide at each position. Ultimately, a consensus can be obtained to indicate the most probable nucleotide per position, which ideally yields a reconstructed sequence that is equivalent to the template sequence.
Finally, it. was sought to experimentally verify the codec design by encoding and synthesizing the message "Eureka!" as four template sequences, E1-E4 (Figs. 25A-25E). Each template sequence contained a 2-bit address to delineate its order, and 14 bits of data. These 16 bits are encoded in a template sequence of 16 nucleotides, which includes four synchronization nucleotides, resulting in 1 bit stored per nucleotide (Fig. 22B), Sequences E1-E4 carry a total of 64 bits of information including addressing, and were synthesized in parallel on beads with a wash every cycle. Following the last synthesis cycle, strands were ligated to a universal adapter, PGR amplified, and stored as a single pool.
The stored message "Eureka!" was reconstructed by using only implicit error correction provided by the diversity generated from enzymatic synthesis. In silico size R
selection was performed of all strands of length 32-48 nucleotides (Fig. 26). This set of
R
4521 purified strands contained 3 perfect strands (Fig. 28B). It was then attempted to reconstruct template sequences by MAP estimation with scaffolding and probabilistic consensus. It was found that 10 strand variants, each with an error tolerance of -30% as a result of missing an average of 4 or 5 out of 16 nucleotides, could accurately reconstruct a template sequence (Figs. 34A-34H). It was then assessed the number of sequencing reads required for a 90% probability of data retrieval. It was found that all four template sequences were robustly reconstructed with 200, 150, 500, and 100 sequencing reads for E1 -E4 respectively, with a median of 175 reads (Fig. 25B). Sequence E3 required the most sequencing reads for reconstruction as synthesized strands contained one extra edit on average in comparison to synthesized strands for other template sequences (Figs. 27A-27B and Figs. 28A-28B). It was also found that MAP estimation was a more robust decoding algorithm than the previous two-step filter for 1 10 ! -1 1 12, requiring fewer reads for data retrieval (Figs. 34A-34H). These results show that the codec can accurately reconstruct data without requiring perfectly synthesized DNA strands.
Scalable Codec for DNA Information Storage
The experimental results demonstrate that byte- and kilobyte-scale storage systems can be achieved if sufficient number of strands are synthesized (Fig, 35A). In one embodiment, the "hello world!" experiment stored 12 bits per template sequence. This is sufficient for a 256-byte maximum storage system where 11 bits are used for addressing 2,048 total template sequences, each with I bit of data. In contrast, the "Eureka!" experiment stored 16 bits per template sequence. This allows for a 4-kilobyte maximum storage system, where 15 bits are used for addressing 32,768 total template sequences, each with 1 bit of data (Table 7).
The scalability of the DNA storage codec was next assessed for gigabyte- and petabyte-scale storage through simulation, assuming that the requisite number of DNA strands for each could be produced. Increased storage capacity requires more nucleotides per template sequence for additional address space, synchronization nucleotides, and data, in one embodiment, 36 bits were stored, including data and address, in a 74-nucleotide template sequence and similarly, 57 bits in a 152-nucleotide template sequence to simulate gigabyte- and petabyte-scale systems, respectively (Fig, 3 A). For simulations, randomly generated data were partitioned, mapped to template sequences, and synthesized strands^' were generated in silico using the Markov model for a wide range of synthesis accuracies (Methods). An additional error-correction code was provided (ECC) to each template sequence to ensure accurate reconstruction. It was found that data could be accurately retrieved from the simulated synthesized strands with the decoding pipeline (Figs. 37A-37F). These efficiency rates, calculated as bits stored per template nucleotide, may be considered competitive to those demonstrated in prior systems, considering that the bottleneck for attaining large-scale storage capacities is the massive parallelization of affordable synthesis reactions. The efficiency rates can be increased towards the theoretical maximum of -1.58 bits for transitions between non-identical nucleotides, given improvements to synthesis accuracy (Figs. 36A-36F).
To assess robustness of the digital codec, repeated reconstruction trials were performed with many sets of strands^ synthesized in silico and measured the probability of data retrieval (Methods). The simulations indicate that if at least 10 unique strand^ variants per sequence are available, then each variant could tolerate on average -30% missing strand nucleotides (Fig. 35B). It was found qualitatively similar results when simulated strands " also included mismatch and insertion rates exceeding those observed experimentally, further illustrating the robustness of the codec (Figs. 37A-37F). Notably, the experimental and simulated results indicate that data retrieval does not require perfectly synthesized strands.
The codec is able to resolve several types of errors, including missing nucleotides in synthesized strands*", which would otherwise drastically reduce information storage capacities (M. C. Davey, D. J. C. Mackay, Reliable communication over channels with insertions, deletions, and substitutions. IEEE, Trans. Inf. Theory. 47, 687-698 (2001); M. Mitzenmacher, A survey of results for deletion channels and related synchronization channels. Probab. Surv. 6, 1-33 (2009)). The comprehensive codec architecture consists of encoding and decoding frameworks to extract information from diversely synthesized DNA strands (Figs. 35C, Figs, 39A-39B). The encoder consists of several core components; (i) Partitioning of data into ordered rows of bits; (ii) Prefixing of rows with addresses; (iii) Error correction per row of bits via an error-correction code (ECC) per template sequence (e.g., Bose-Chaudhuri-Hocquenghem code), and error correction per block of rows via a block ECC (e.g., Reed-Solomon or Fountain code, (iv) Modulation to map rows of bits to template sequences. All template sequences are subsequently synthesized enzymatically, resulting in a population of diverse DNA strands. StrandsR are read out by sequencing and corresponding strands are input to a decoder. The crucial first step of the decoding pipeline is MAP estimation aided by scaffolding, followed by probabilistic consensus. Multiple subsets of strands C can be used for sequence reconstruction. Each reconstructed sequence need not be identical to the template sequence. After demodulation of the reconstructed sequence, the resulting bit sequence can be corrected by bit-level ECCs in the decoding pipeline to reinforce error-free data retrieval. Overall, the design harnesses the diversity of enzymatically-synthesized DNA strands and supports a flexible-write approach to provide a functional and robust storage system.
Continued Improvements
The results show that information can be stored accurately in imperfectly synthesized DNA strands. The present disclosure further contemplates additional improvements and design optimizations. Assuming only size selected strands are stored for the kilobyte-scale design, such an implementation likely incurs a 6-fold loss in volumetric density of information. This reduction is due to two factors: an extension length up to three bases per transition (~3-fold loss, Fig. 26) and an efficiency rate of storage of 1 bit per template nucleotide (-2-fold loss). This density loss is mild considering DNA's thousand-fold advantage over the projected fundamental density limit of flash drives and may be addressable. The efficiency rate of storage may be increased as synthesis accuracy improves. Improved accuracy will also enable provisioning of TdT's ability to add -500 (Fig. 25B) to thousands of nucleotides per strand to increase the number of strand " nucleotides for increased storage capacities. On the other hand, extension lengths per template nucleotide may be considered a design optimization and tuned according to application demands, trading density for read-out speed and cost by specialized nanopore sequencing (S. M. H. T. Yazdi,
S. Hossein Tabatabaei, R. Gabrys, O. Milenkovic, Portable and Error-Free DNA-Based Data
Storage. Sa. Rep. 7 (2017), doi : 10.1038/s41598-017-05188-1).
Currently, DNA for information storage is synthesized in a high-density array format with proprietary machines. The presently disclosed bead-based process was thus translated to a 2D array-based platform (Figs. 40A-40E). In one embodiment, this prototype produced perfectly synthesized strands for each of the three 13 -nucleotide template sequences tested herein. Analyses of the synthesized strands indicate similar error and diversity profiles to those observed using the bead-based process, indicating that the codec could be used to store information in DNA synthesized with this platform (Fig. 41, Figs. 42A-42B and Figs. 43A- 43B). Synthesis accuracy can be further improved by additional process engineering, e.g., more stringent washing per cycle that reduces carryover of nucleotide triphosphates from previous cycles to further diminish the rate of substituted strand " nucleotides. Optimization of reaction conditions to improve mixing or the use of more processive, rather than c
distributive, TdT mutants may reduce the rate of missing strand nucleotides (M. A. Jensen,
R. W. Davis, Template-Independent Enzymatic Oligonucleotide Synthesis (TiEOS): Its History, Prospects, and Challenges. Biochemistry. 57, 1821 -1832 (2018); E, W. J, S. J. E, MODIFIED TEMPLATE-INDEPENDENT ENZYMES FOR POLYDEOXY IJCLEOTIDE SYSTHESIS. World Patent (2016), J. W. Efcavitch, Methods and apparatus for synthesizing nucleic acid. US Patent 9, 771, 613 (2017)). Automation and parailelization to increase DNA production for large-scale data storage can be improved by additional efforts.
The presently disclosed enzymatic DNA synthesis strategy disclosed herein is advantageous in speed and cost relative to phosphoramidite chemistry. In one embodiment, assuming implementation on the same microarray instrument, reagent costs were compared for both processes as a function of feature size (reagent volume) (Figs, 44A-44B, Table 6). The analyses indicate that the enzymatic synthesis strategy could already be cheaper as a drop-in replacement to phosphoramidite chemistry when using existing automation which synthesizes DNA strands in 15-30-micron features (Figs. 44A-44B). Further miniaturization, together with reductions to enzyme cost through recycling, provide a potential roadmap for overall reduction in reagent costs by several orders of magnitude (Figs. 44A-44B). In addition, the increased rate of enzymatic catalysis over chemical coupling and a lack of blocking moieties may shorten the synthesis cycle times compared to phosphoramidite chemistry, reducing write speed and equipment amortization time (Table 6).
Aspects of the present disclosure are directed to an enzymatic synthesis strategy and tailored coding architecture for robust information storage in DNA. This storage solution is an alternative to prior studies which utilized phosphoramidite chemistry to produce DNA for information storage. This approach offers potentially dramatic benefits to the cost and speed of synthesis and sequencing without requiring single-base accuracy. Additionally, this approach may alleviate biosecurity concerns associated with widespread DNA synthesis of genetic information, as genes are unlikely to be produced with this strategy. While this work illustrates DNA information storage in vitro, it could provide a foundation for development of de novo molecular recording systems in vivo (B. M. Zamft et al, Measuring cation dependent DNA polymerase fidelity landscapes by deep sequencing. PLoS One. 7, e43876 (2012); G. Church, A . Marblestone, R. Kalhor, in The Future of the Brain; G. Marcus, J. Freeman, lids. (Princeton University Press, 41 William Street Princeton, New Jersey 08540 USA, 2014). S. L. Shipman, J. Nivala, J. D. Macklis, G. M. Church, Molecular recordings by directed CRISPR spacer acquisition. Science. 353, aafl l75 (2016); A. H. Marblestone et al, Rosetta Brains: A Strategy for Molecularly- Annotated Connectomics (2014)). Further technological achievements, through industrial -grade automation, refined enzymatic reactions, and advanced coding systems, will improve the efficiency and accessibility of this platform and inform the design of a complete read and write system to advance large-scale DNA information storage.
Example IV. Additional Materials and Methods The "hello world! " experiment
Encoding via nucleotide transitions
According to one embodiment, the phrase "hello world!" was converted to decimal ASCII and then to ternary as shown in Table 1.
Initiator immobilization on carboxyl beads
The initiator oligo (5Aml2-fSBS3-acgtactgag: /5 AmMC 12/TTTTTTTTTT UCTACACTCTTTCCCTACACGACGCTCTTCCGATCTACGTACTGAG) was immobilized on 5.28 micron carboxyl polystyrene beads (Spherotech CP-50-10) using carbodiimide conjugation. To do so, 5mg beads were washed twice in lOOmM MES buffer pH=5.2 and resuspended in ΙΟΟμΙ of the same buffer. The oligo, 5Aml2-fSBS3-acgtactgag, was resuspended at ΙΟΟ Μ in water. A 1.25M batch of EDC was prepared by dissolving 120mg EDC (Sigma E1769, from -20C storage) in 500μ1 of lOOmM MES pi I 5.2. 40μ1 of the 1.25M EDC batch was mixed with 30μ1 (3nmole) of the 5Aml2-fSBS3-acgtactgag oligo and 30μ1 of l OOmM MES pH=5.2 and added to the beads and mixed by vortexing for 10 seconds. The suspension was rotated at room temperature overnight. After incubation overnight, the beads were washed three times with lmL buffer containing 250mM Tris pH 8 and 0.01% Tween 20, each time rotating at RT for 30 min. The beads were then resuspended in 500μ1 Tris-EDTA buffer with 0.01% Tween 20 and stored at 4°C until use.
Synthesis
According to one embodiment, for each character of "hello world!" the ASCII decimal (data) was converted to base 2 (for binary, 8 bits) or to base 3 (for ternary, 5 trits). Similarly, the addresses were converted from a decimal value to base 2 (for binary, 4 bits) or base 3 (for ternary, 3 trits). Addresses were concatenated to data to form a resulting string of 2 bits or 8 trits. A custom Python script was used to map trits to template sequences H01 - H 2 shown in Table 1.
Nucleotide triphosphates (Invitrogen) were prepared at the following concentrations: 8mM dATP, 4mM dCTP, 4mM dGTP, and 16mM dTTP. For eac template sequence (Table 1), the required dNTP volumes corresponding to each transition type were dispensed (Table 3) in a 96-well PGR plate (VVVR) using a Mantis liquid handler (Formulatrix), which has a minimum dispense volume of 0.2μΕ. Once the dNTPs were loaded, 30^ig of initiator- conjugated polystyrene beads for each of the twelve template sequences were suspended in an enzymatic reaction mix, comprised of Ix Custom Synthesis Buffer (14 mM Tris-Acetate, 35 mM: Potassium Acetate pH 7.9, 7 mM Magnesium Acetate, 0. 1 % Triton X-100, 10% (w/v) PEG 8000) with lU/^L TdT (Enzymatics) and ImU/^L apyrase (NEB). For each synthesis cycle, beads that are suspended in the enzymatic reaction mix are exposed to dNTPs, by transfer to the subsequent well, with a multichannel pipettor. Reactions are incubated at room temperature for one minute. Every 2 cycles, the beads were collected by centiifugation (3 minutes at 131 Og), washed with Ix Custom Synthesis Bufter without PEG, collected by centnfugation, and resuspended in fresh enzymatic mix. Following addition of the last dNTP for each sequence, a poly-deoxycytidylate (poiy-C) tail was synthesized by addition of \ iL of 1.6mM dCTP (32μΜ final) to the enzymatic mix to enable efficient ligation. Afterwards, beads were collected by centnfugation then washed with l OmM Tris- HC1 pH8.0 with 0.1% Triton X- 100, and resuspended in 10μΕ of the same buffer.
Ligation of universal 3' adapter and PCR amplification.
A universal adapter was ligated to the 3 ' of the synthesized strands using a hybridization-based strategy as previously described (C. . wok, Y. Ding, M. E, Sherlock, S. M. Assmann, P, C. Bevilacqua, A hybridization-based approach for quantitative and low- bias single-stranded DNA ligation. Anal. Biochem. 435, 181- 186 (2013). The 5P-rSBS9- GGG adaptor (/5Phos/AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC T/ideoxyU/CCGATCT GGG/3SpC3/) forms a hairpin with a 5' polyG overhang which hybridizes to single-stranded DNA strands ending in polyC. The beads carrying synthesized DNA with polyC tail were resuspended in a reaction composed of Ι μΜ 5P-rSBS9-GGG adaptor, IX NEB T4 DNA Ligase Buffer, 20% PEG 8K, 500mM Betaine, and 6 units of T4 DNA Ligase (Enzymatics). Ligation mixture was incubated at 16C overnight. After ligation, beads were washed twice with ΙΟΟμΙ, of lOmM Tris-HCl pH8.0 with 0, 1% Triton X-100 and resuspended in ΙΟΟμΙ. of lOmM Tris HC1 pH8 with 0.01% Tween 20.
5μΕ of each sample was amplified with the following primers in a 10μΕ reaction for 15 cycles.
tSBS3 CTACACTCTTTCCCTACACGAC
tSB S9 GTGACTGGAGTTC AGACGTG
Each sample was column purified after amplification.
Hlumina MiSeq sequencing
To add the complete Illumina sequencing adapters, amplified strands were diluted and used as a template for a PCR reaction with NEBNext Dual Indexing Primers. Each strand received a different index by real-time cycle-limited PCR for 15 cycles. Barcoded strands were then combined and sequenced single end using Illumina MiSeq v2 150 Micro. Samples were then combined and sequenced using Illumina MiSeq with reagent kit v2. Sequencing was done in one direction, starting from the forward primer in each sample for 150 bases. Oxford Nanopore sequencing
For Oxford Nanopore sequencing, Ι μΙ, of each Illumina-barcoded strand was diluted 100-fold in Tris-HCl pH 8,0 with 0.01% Tween -20 and amplified with nested primers, comprising a barcoding primer pair, LWB 01-12 from SQK-LWB001 (Oxford Nanopore), and 50nM of primers PR2-P5 and 3580F-P7 for 10 cycles. 5μΙ_ of each strand was then pooled (60μΙ_ total) and cleaned with 90μΙ_, of Agencourt Ampure XP beads according to the manufacturer's protocol. i .uL of the pooled library was diluted with 9μΙ. of lOmM Tris-HCl pH8 with 50mM NaCl (10 .L total). I L of Rapid I D Sequencing Adapter (Oxford Nanopore) was added, flicked, and incubated for 5 minutes at room temperature. 11 μΙ_. of this presequencing mix (PSM) was combined with 30.5μΕ of Running Buffer with Fuel Mix (RBF), 26.5μ!, of library loading beads (LLB), and 7μ!, of water, added to a R9.4 flow cell, and run with SQ -LWB001 settings for 48 hours of sequencing.
PR2-P5
TTTCTGTTGGTGCTGATATTGC (Forward)
AATGATACGGCGACCACCGA (Reverse)
3580F-P7
ACTTGCCTGTCGCTCTATCTTC (Forward)
CAAGCAGAAGACGGCATACGA (Reverse)
Analysis of synthesized strands and data retrieval
For the lilumina dataset, sequences from demultiplexed reads were first trimmed with cutadapt 1.9.1 (M. Martin, Cutadapt removes adapter sequences from high -throughput sequencing reads. EMBnel.journal. 17, 10-12 (2011), with an error tolerance up to 10%, to remove the 5' initiator oligo sequence and the 3' universal oligo sequence. Only reads containing both sequences for trimming were retained for further analysis. Custom Perl or Python scripts were used to process these trimmed reads. Sequences of non-identical nucleotides were extracted from each read by run-length compression (Run-length encoding - Rosetta Code, (available at https://www.rosettacode.org/wiki/Run-length_encoding) and the occurrence of each unique sequence was tabulated. To determine the type of synthesis errors, each strand was aligned to its respective sequence using Needleman-Wunsch algorithm (S. B. Needieman, C. D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol 48, 443-453 (1970) implemented as pairwise2 in Biopython 1.70, with match, mismatch, gap-open and gap-extension scoring set as 2, -3, -5 and -5 respectively. For each alignment, the number of mismatches, insertions and missing nucleotides were tabulated. For data retrieval, a two-step filter was used. The first step is to filter for the designed number of nucleotides which contain a terminal 'C, used for ligation, in compressed strands. As a result, eight of twelve template sequences, specifically H01 , H02, H04, H08, H09, H10, and HI 1 , have 9 nucleotides to be synthesized. In contrast, four of the twelve template sequences, specifically H3, H5, H6, and H7, contain only 8 nucleotides to be synthesized. The second step is to select the most frequently synthesized compressed strand variant.
For the Oxford Nanopore dataset, base calling and demultiplexing was performed with Albacore 1.2.6 using the configuration file for ID reads at 450bp per second using the R.9,4 chemistry to match the flowcell FLO-MIN106 and kit SQK-LWB001. Demultiplexing was further verified with Porechop 0.2.2 (R. Wick, Porechop) with default settings for quality control. The resulting reads were trimmed with cutadapt as described above, except with an empirically determined increased error tolerance to compensate for the higher error rate observed for nanopore sequencing. Strands can be sequenced in either orientation. Accordingly, it was found that an error tolerance of 25% resulted in no more than 50% of strands being trimmed. Only reads which presented both sequences for trimming were retained for further analysis. Reads in the opposite orientation were not processed. Data retrieval for each sequence was performed as above for Illumina sequencing with a two-step filter. Real-time data reconstruction with nanopore sequencing reads was simulated by applying the two-step data retrieval filter to a subsampled number of shuffled sequencing reads obtained up to a given time point. The 48 -hour sequencing run was split into 2 -hour increments. For each increment, the timestamp for all reads obtained during the entire sequencing run were shuffled and the number of reads corresponding to the total elapsed sequencing time up to the given increment were randomly sampled. The probability of correct retrieval was assessed by performing 10,000 decoding trials for each increment and expressed each time interval to fraction of total sequencing time.
"Eureka!" experiment
Codec implementation
Methods of encoding and decoding, including sequence reconstruction from synthesized strands, were implemented according to specified and listed mathematical equations described herein. Encoding and decoding pipelines were implemented partly using the C++ programming language, compiled via a g++ compiler on an Ubuntu Linux operating system, and partly via specialized MATLAB (Mathworks) functions.
Encoding and decoding of '"'"Eureka! "
The message ''Eureka?'' consisting of 7 ASCII characters, equivalent to 56 bits of payioad data, was encoded into 4 template sequences E1-E4 each containing 16 nucleotides. The encoding steps consisted of data partitioning, addressing, and modulation of bit sequences to nucleotide sequences with no repeated bases (i.e., self-transitions). Modulation included the placement of synchronization nucleotides within DNA sequences as described herein. In addition to E1-E4, sequence E0 was specified and designed for puiposes of error analysis. After enzymatic synthesis of E0 and E1-E4, decoding consisted of sequence reconstruction from run-length compressed DNA strands via MAP estimation and consensus. Reconstmcted E1-E4 DNA sequences were demodulated into bit sequences, and data were extracted by ordering according to addresses.
Initiator immobilization on magnetic beads
The initiator oligonucleotide Bio-U-LT2 was conjugated to streptavidin beads (Invitrogen) according to manufacturer instaictions at 20% binding capacity and Biotin-14- dCTP was used to bind remaining free streptavidin. Blank beads, which have free streptavidin bound by Biotiti-14-dCTP were also prepared. Prior to use, the initiator conjugated beads were di luted 10-fold with blank beads and washed with Ix Custom Synthesis Buffer without PEG.
Synthesis, seqnenring, analysis, and decoding
Synthesis of E0-E4 was performed similarly as described above. However, Bromo- dCTP was used instead of dCTP (Figs. 12A-12B) and concentrations of each dNTP regardless of transition type were fixed. The final concentration of dNTPs for each cycle were as follows: ΙΟμΜ dATP, 15μΜ Bromo-dCTP, 5μΜ dGTP, and 15μΜ dTTP. As above, a series of dNTPs were di spensed for each nucleotide of the template sequence in a 96-weil PCR plate. Once the dNTPs were loaded, l 0μg of initiator-conjugated magnetic beads were suspended in the enzymatic reaction mix, comprised of Ix Custom Synthesis Buffer (14 mM Tris- Acetate, 35 mM Potassium Acetate pH 7.9, 7 mM Magnesium Acetate, 0.1% Triton X- 100, 10% (w/v) PEG 8000) with lU/uL TdT (Enzymatics) and 0.25mU/uL apyrase (NEB). For each synthesis cycle, beads that were suspended in the enzymatic reaction mix were exposed to dNTPs, by transfer to the subsequent well, with a single or multichannel pipettor. At every cycle, each reaction was pulse vortexed and incubated for 30 seconds at room temperature. Beads were collected by magnet and washed in Ix Custom Synthesis Buffer without PEG and resuspended with fresh enzymatic mix. The reaction mixture was then transferred to the next well containing the next nucleotide substrate. Following the last cycle, each sample was prepared for Illumina sequencing as described above. Complete Alumina sequencing adapters were added by real-time cycle-limited PGR for 12 cycles, Barcoded strands were then combined and sequenced as single-end 175bp reads using Illumina Mi Seq v2 Nano. Sequences were trimmed as before to remove the 5' initiator oligo sequence (Bio- U-LT2) and the 3' universal oligo sequence (5P-rSBS9-GGG). Only reads which presented both sequences for trimming were retained for further analysis. A sequence of non -identical nucleotides for each raw strand was extracted as above. Purified strands were obtained by selecting strands with raw lengths between 32-48 bases, corresponding to average extension lengths of 2 to 3 per template nucleotide. Purified strands were used for analysis of synthesis errors with Needleman-Wunsch and for sequence reconstruction of E1-E4 with the decoding pipeline.
Following purification, DNA strands synthesized for each template sequence E1 -E4 were randomly sampled from data according to a target number of reads, and then subject to a two-step filter, A filter was first applied to include those DNA strands with read counts either 1, 2, 3, 4, or 5 depending on the target number of reads, to exclude aberrant DNA strands, which could arise from combinations of synthesis and sequencing errors. A second filter was applied to rank DNA strands according to compressed strand lengths. A total of 10 top-ranked DNA strands were selected from all purified and filtered strands. These 10 strands were used to reconstruct each template sequence using MAP estimation and consensus implemented according to equations explained herein. The probability of correct retrieval of each template sequence E1-E4 was assessed by performing 500 decoding trials for each target number of reads, Each trial consisted of a random sampling of purified reads. Simulated large-scale storage systems
Encoding of random data into DN A sequences
For simulated DNA storage systems, Ω bits of data and addresses per template sequence were randomly generated. A Bose-Chaudhuri-Hocquenghem (BCH) code was applied for error-correction to convert il random bits to bits per template sequence. The BCH code was computed via MATLAB' s built-in standard BCH encoder. The Sbits were mapped and modulated into A' nucleotides per template DNA sequence according to designed modulation schemes as explained herein. The following DNA storage systems of increasing storage capacity were simulated, represented by codec parameters: (Κ, Β>£Ϊ)— (38,33, 23), {K> Β,Ω) = (74, 63, 36), and (Κ, Β, Ω) = (152, 128., 57).
Simulating enzymatic synthesis of DNA strands
In order to simulate enzymatic synthesis, a Markov model was used to produce compressed strands. For these, £T nucleotides per template DNA sequence were subject to errors including missing nucleotides (deletions), insertions, or substitutions. Synthesis accuracy was varied by simulating a range of error probabilities, primarily for missing nucleotides in compressed strands. Each type of error was assumed to occur independently per nucleotide within a strand. Each strand was produced independently according to the same error statistics.
Data retrieval using decoding pipeline
The robustness of the codec was next assessed by performing 500 decoding trials for varying levels of synthesis accuracies. For each decoding trial, a template sequence was randomly generated and ten compressed strands were synthesized by simulation with the Markov model. These compressed strands were used towards reconstruction of the template sequence via MAP estimation and probabilistic consensus. Each reconstructed sequence of K nucleotides was demodulated intoi?bits, and decoded with a Matlab BCH decoder (Mathworks) to yield £} bits. The probability of correct data retrieval for a specific level of synthesis accuracy was computed as the fraction of successful decoding trials. Results for data retrieval were benchmarked on a multi-core server.
Example V. Use of The Nucleotide Analogue, 5-Bromo-dCTP
This example describes the evaluation of the use of 5-Bromo-dCTP (5Br-dCTP) as an altemative to natural dCTP in the synthesis reactions. In one embodiment, it was evaluated the extension lengths, length distributions, and fraction of extended products when dCTP or 5Br-dCTP was added, using the TdT:apyrase system, to the initiator LT2+3C which ends in three cytosines.
Each reaction was carried out in 2()μΕ total volume. Reaction components, not including the dNTP were assembled in 18μΕ while nucleotide triphosphates were prepared in 2μΕ of water. The 18μΕ mix was composed such that upon mixing with a 2μΕ nucleotide triphosphate solution, the following initial composition would be obtained: IX Custom Synthesis Buffer, 0.1 μΜ LT2+3C initiator, Ι ΙΙ/μΕ TdT, and 0,25 τηυ/μΕ apyrase. The initial final concentration of the dNTP was varied at 2, 4, 8, 16, or 32μΜ. To initiate the reaction, the 18μΕ mixture was added to a tube containing the 2μΙ_< dNTP sample and mixed immediately by pipetting. The reaction was then incubated at room temperature for at least two minutes at which point it was mixed with an equal volume of Novex TBE-Urea Sample Buffer (2X) and resolved on a 15% Novex TBE-Urea gel.
The results show that the use of 5Br-dCTP for synthesis has advantages over the use of natural dCTP (Figs. 12A-12B), In particular, extensions occurred at a lower concentration of 5Br-dCTP compared to natural dCTP, 2μΜ and 8μΜ respectively. Example VI. Codec for Information Storage with Enzymatically-Synthesized DNA
Encoder and decoder architecture
According to one aspect of the present disclosure, a modular design for encoding and decoding digital information in DNA is presented (Fig. 35C, Figs. 39A-39B). While single monolithic architectures can he more efficient, modular designs allow for the optimization of encoding and decoding blocks separately. Such a distributed approach simplifies the design space considerably. Within individual blocks, error-correcting codes borrowed from traditional communication systems (e.g., Reed-Solomon, Fountain, BCH, LDPC) may be applied to handle multiple types of errors. However, several factors distinguish DNA storage systems from traditional systems such as wireless communication. Eiformation is stored in short sequences of DNA, and must be reassembled by a decoder. Alignment errors (e.g., missing or inserted nucleotides) due to inaccurate DNA synthesis or sequencing are more difficult to correct compared to substitutions or erasures common in communication systems.
As part of a complete system for DNA storage (Figs. 35C, Figs. 39A-39B), encoding and decoding frameworks were presented, together defined as a codec, for storing and extracting information from populations of diverse DNA strands. An important part of the encoding strategy is the placement of synchronization patterns which are regularly interspersed throughout data, allowing a decoder to compute accurate alignments from diverse synthesized strands. Synchronization patterns are inserted in the modulation step of the encoding pipeline, which translates rows of bits into DNA sequences which adhere to modulation constraints (Fig. 39B, Figs. 21A-21B). In one embodiment, the codec is inclusive of core components such as Reed-Solomon or Fountain codes utilized in prior DNA storage systems (Y. Erlich, D. Zielinski, DNA Fountain enables a robust and efficient storage architecture. Science. 355, 950-954 (2017); M. Blawat et al., Forward Error Correction for DNA Data Storage. Procedia Comput. Sci. 80, 1011-1022 (2016); L. Organick et ai., Random access in large-scale DNA data storage. Nat. Biotechnol. 36, 242-248 (2018)). The encoder first partitions data into ordered rows of bits, prefixing an address to each row to delineate its order in reassembly. Error-correction is incorporated within each row of bits, or block of rows to protect against synthesis errors, missing sequences, or low sequencing coverage. The encoder outputs a book of template sequences, which are written by enzymatic synthesis to DNA strands. The resulting strands can then be stored. The stored strands are read by high-throughput DNA sequencing and transitions extracted to form a sequence of non-identical nucleotides, which is then fed into a decoder.
A crucial first step of the decoder is to harness information latent in diverse DNA strands by MAP estimation and probabilistic consensus (Fig. 13B). The decoder is designed to function with minimal sequencing reads. Existing approaches for strand alignments (S. B. Needleman, C. D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443-453 (1970); T. F. Smith, M. S. Waterman, Identification of common molecular subsequences. J, Mol. Biol . 147, 195-197 ( 1981); C. Notredame, D. G. Higgins, J. Heringa, T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302, 205-217 (2000); I. M. Wallace, O. O' Sullivan, D. G. Higgins, C. Notredame, M-Coffee: Combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Res. 34, 1692-1699 (2006); D. J. Russell, Multiple Sequence Alignment Methods (Humana Press, 2016)), or greedy consensus by majority voting (T. Batu, S. Kannan, S. Khanna, A. McGregor, Reconstructing Strings From Random Traces, in Proceedings of the Fifteenth Annual (ACM-SXAM) Symposium on Discrete Algorithms, (SODA), New Orleans, Louisiana, USA, January 11 -14, 2004, pp. 910- 918), are specialized for genomic data, and therefore do not compute optimal alignments for coded, synthetic DNA strands. Similarly, codes based on Viterbi-style recursions (M. C. Davey, D. J. C. Mackay, Reliable communication over channels with insertions, deletions, and substitutions. IEEE Trans. Inf. Theory. 47, 687-698 (2001); M. Mitzenmacher, A survey of results for deletion channels and related synchronization channels. Probab. Surv. 6, 1-33 (2009)) do not adequately exploit the diversity of strands produced by enzymatic synthesis. By contrast, the multidisciplinary approach combines codes, multiple strand alignment, and probabilistic consensus (J. Kittler, Combining classifiers: A theoretical framework. Pattern Anal. Appl. 1, 18-27 (1998)). Alignment by MAP estimation and consensus creates a complete or partial reconstruction of a sequence from multiple strands. Subsequent to alignment, each sequence is demodulated into bits. The final steps of the decoding pipeline are composed of decoding modules for error-correction codes which ensure error-free reconstruction of the retrieved data (Fig. 39B).
In the following sections, each part of the encoder and decoder architecture (Fig. 39B) is explained in greater detail. A list of parameters and design specifications for all experimental and simulated DNA storage systems is provided in Table 7 and Table 8.
Addressing of rows/sequences
The encoder (Fig. 39B) first partitions data into ordered rows of bits. Each row of bits is eventually stored in one template sequence of DNA. In subsequent paragraphs, this correspondence is maintained between rows of bits and template sequences of DNA. Each row is prefixed with a unique address to delineate its order in reassembly. Let ""denote the total number of bits stored per row, including both payload data and addresses. Let μ < ^indic te the number of address bits. With μ bits, it is possible to address a total of
" template sequences in which each template sequence stores (i3— ^Ejbits ot payload data. The storage capacity is equal to the number of DNA sequences multiplied by the number of bits of payload data stored per template sequence. The storage capacity is maximized by maximizing the total number of DNA template sequences. The following equations specify the storage capacity, and the maximum storage capacity.
Storage Capacity = 2μ(Ω - )bits, for 0 < μ < Ω.
Maximum Storage Capacity = 2ii~ibits, achieved when μ = 12— 1.
The goal of an encoder and decoder architecture is to recover both the address and payload data correctly. If the address is irretrievable or only partially reconstructed, the order of information is lost. In this sense, it is more critical to recover the address. If the address is correct, it is possible to correct errors in the payload data using redundant information stored in other DNA sequences. However, in the analyses, both the address and payload information (a total of Ubits per sequence) are decoded reliably with equal error protection.
Reed-Solomon and Fountain Codes (Block ECC)
Reed-Solomon (RS) codes and Fountain codes which may be incorporated within the encoding and decoding architecture are briefly described (Fig, 39B). However, these codes were not explicitly used in the experiments or simulations. If synthesizing thousands or millions of DNA sequences, error-correction across multiple sequences is necessary to protect against the following types of errors: 1) Missing strands for particular sequences; 2) Strands with low sequencing coverage (i .e., too few reads of DNA strands for particular sequences after PCR-ampiification and sequencing; 3) Sequences with detected errors after reconstruction from multiple strands, 4) Sequences with undetected errors in either the address or payload after reconstruction from multiple strands.
In the first three cases above, the error locations within a block of reconstructed sequences are known and may be pinpointed by checking the addresses of all sequences. For example, missing sequences can be identified by their missing addresses which are not available in a block. In the fourth case, the error locations within a block of reconstructed sequences are unknown and undetected. The fourth type of error is not accommodated by most Fountain codes which are specialized only to handle erased/missing sequences. Fountain codes were originally applied in packet communication networks for recovering missing packets at a high-level abstraction layer in the communication protocol stack. While RS codes can correct undetected errors in the payload data, they assume that the address per sequence is reconstructed correctly.
Within the context of DNA storage systems, an RS code is applied in the vertical direction across multiple rows/sequences of bits (Fig. 39B). A total of i!bits per row exist prior to RS encoding, and a total of i!bits per row exist after RS encoding (Fig, 39B). The organization of information in the horizontal direction is unaltered. However, the RS code inserts extra rows of redundant parity bits. Each extra row of parity bits contains its own unique address which is utilized by the RS decoder for error-correction.
In slightly more detail, consider an RS(Hr≤.., fers.)code, which has a minimum Hamming distance of (n S— krs— 1). For this code, krswws store address and payload information bits, while (? S additional rows store RS parity bits. The RS code is able to correct up toJfsequences with known error locations within a block of sequences, andUsequences containing undetected errors, where 2U + E < n S— k S) . The undetected errors cost twice as much in terms of added redundancy required. An S(n s., fers)code operates over symbols from a Galois field such that in usual instantiations nTS = 2m— Ijfor m > 1 a positive integer. For example, setting m = 8, the well-known
R S(255,223)code is specified, which corrects up to 16- sequences with undetected errors within a block of sequences, or corrects up to32sequences with known error locations within a block. In summary, the RS code may be applied to a block of fl S sequences of bits as a layer of protection for both detected and undetected errors, with the assumption that the address for sequences is known (Fig. 39B).
Error-Correcting Codes per sequence
Traditional error-correcting codes (ECCs) such as Bose-Chaudhuri-Hocquenghem (BCH) codes, or low-density parity check (LDPC) codes rely on the pre-established synchronization of information. In many traditional engineered systems (e.g., wireless systems), synchronization is either assumed or resolved through various strategies. In the context of DNA storage, and within the encoder and decoder architecture of this paper (Fig. 39B), synchronization is gained based on alignment to a scaffold by MAP estimation, and consensus of multiple DNA strands per sequence. Thus, synchronization is assumed to hold prior to error-correction in the decoding pipeline.
Synchronization itself is insufficient for correct decoding. For example, a missing nucleotide in the compressed strand (deletion) causes a synchronization error, but even if the position of the deletion is known via synchronization, the missing nucleotide must be recovered correctly. The alignment and synchronization step of the decoder (Fig. 39B) may resolve ail errors perfectly by utilizing the diversity of synthesized strands per sequence. If enough diversity is available, the missing information in one strand variant may be recovered correctly from other variants. In this way, alignment and consensus algorithms have a probability of success for decoding correctly. However, when considering scalable storage systems, the number of nucleotides for template sequences must increase. In these systems, a few errors may still occur after the alignment and consensus step of the decoder.
To correct errors explicitly after synchronization, BCH codes were applied to encode and decode bits stored per DNA sequence, LDPC codes could also provide similar error- correction capabilities. Primitive BCH codes are a standard class of BCH codes. Of this class, the
Figure imgf000111_0001
c¾5code is able to correct ¾tfterrors. The code takes &¾c.¾bits of information as input, and outputs a total of ¾c.& bits, where (¾ff¾
Figure imgf000111_0002
are added for redundancy. The BCH code is able to correct more errors if more redundancy is added. The following BCH^n^^ = 31, &^&* ^a**)00^8 are able to correct £&fffterrors:
BCH(31, 26,1); BCH(31( 21,2); BCH(31, 16,3);
The following longer BCE^n^ = 63, kbffh, £¾ffA)codes are able to correct £6fffc errors:
BCH(63f S7,l): BCH(63,45,3); BCH(63,3¾4);
BCH(63, 36,5); B('! l(63, 30,6).
Similarly, the following set of BCH(T¾wi8 = 127. k^.t^ codes correct up to tbeh errors:
ΒΟ¾127, 78,7); BCH(127> 71>9); BCH(127,64,10);
BCH(I27,57,11); BCH(t 27, 50,13).
In simulations for 0.5-megabyte, gigabyte and petabyte maximum storage capacities, applied BCH(31, 2I,2), BCH(63,36,S), and 601 (12 , 5 ,11) codes were applied respectively. These BCH codes are applicable for DNA storage due to their short sequence length requirements, and efficient error-correcting abilities.
The table below summarizes the use of ECCs for simulated DNA storage systems analyzed in this paper. For experimental systems, no explicit error-correction of bits was necessary since alignment and consensus were sufficient. In simulations for 0.5-megabyte, gigabyte and petabyte maximum storage capacities, sequence ECCs were able to correct for partial alignments. Each sequence ECC inserts redundant bits per row of bits (Fig, 39B).
More precis >ely, -Obits/row are encoded into ^bits/row in the encoder, and Bbits/row are decoded into JObits/row in the decoder. To clarify the notation, parameter_d¾nd Sdenote overall system parameters in the encoding and decoding pipeline (Fig. 39B). However, parameters ^½-:i¾and :i^¾^¾fo the BCH code directly affect overall system parameters. Briefly, it is noted that the coding scheme establishes baseline efficiencies in simulations, towards a flexible-write strategy for DNA storage. The level of efficiency for coded systems is anticipated to improve.
ECC Per Sequence of Bits: Parameters
Figure imgf000112_0001
Modulation and demodulation (Bits <=> Nucleotides)
A principal element of DNA storage is the encoder's mapping from bits to template nucleotides (modulation), as well as the decoder's mapping from nucleotides of reconstructed sequences to bits (demodulation). Thus, in this section the interconversion between sequences of bits and sequences of DNA nucleotides was formalized. The modulation block of the encoder (Fig. 39B) maps J?bits to /^nucleotides: 1¾2¾1¾, .. bs^oia2&3... οκΛη the ideal case, one template nucleotide stores a maximum of 2bits. Therefore, an upper bound for eveiy modulation scheme is the limit: B < Similarly, in the decoder architecture (Fig. 39B), a demodulation step maps K nucleotides to Hbits:
¾¾¾... ¾ b b7h%... i½Note that the demodulation block takes nucleotides from compressed strands which may contain errors as input, and outputs a sequence of bits which may also contain errors. If no errors occur, the modulation and demodulation maps together constitute an identity map, and - » ¾ = htb2b3,..bBA£ errors occur in either sequences of bits or sequences of compressed nucleotides, separate decoding steps provide error-correction capabilities (Fig. 39B).
For enzymatic synthesis, S = 2K is not achievable for several reasons. The controlled process of synthesis adds each nucleotide one by one. According to a specific concentration of nucleotide triphosphates, each nucleotide is added correctly to strand^, or an error such as a missing nucleotide in strand^ (deletion) may occur. A current design constraint for enzymatic synthesis is to specify sequences with non-identical nucleotides (e.g., without AA, TT, CC, GG transitions). Specifying information only in sequences of non-identical nucleotides allows for facile data processing. Further work to account for polymerization extension lengths could remove such a constraint.
Nucleotide transition matrix
Constraints reflecting valid and invalid transitions between nucleotides may be expressed via a transition matrix Γ (Figs. 21A-21B). An upper bound for the maximum amount of bits stored per nucleotide is l g2. >irii;aos (<r) .where πΐί1χ{Γ) the maximum eigenvalue of F, For enzymatic synthesis in this paper, self transitions were forbidden, leading to an upper bound of B < (¾¾C3)) % ¾· 21 A). In alternative modulation designs, minimizing the use of certain transition types, such as CA or CG, would improve synthesis accuracy but reduce the amount of information bits stored per template nucleotide (Fig. 21B). Synchronization nucleotides
An important aspect of the modulation step of the encoder (Fig. 39B) is the insertion of synchronization nucleotides at regular intervals within each sequence. Crucially, embedded synchronization patterns provide resilience against alignment errors. The error- resilience is boosted significantly during the alignment and consensus step of the decoder (prior to the demodulation step in the pipeline). Synchronization nucleotides are also utilized in the demodulation step of the decoder (Fig, 39B). As a tradeoff, the inclusion of synchronization nucleotides reduces the total space allocated for address and payload information.
Modulation and demodulation (Experimental)
An explicit modulation scheme for an experimental DNA storage system with parameters (A' JO) = (16,16) is provided (Fig, 22, Table 9). This scheme maps B = 16bits to K = I&nucleotides per template sequence, while adhering to the constraint that the template sequence consist of a sequence of non-identical nucleotides (no self-transitions is allowed). Most importantly, synchronization nucleotides are embedded within each sequence. As part of the modulation, B = 16bits are first converted to an intermediate form of information which i s a mixture of bits with values {O.l j nd trits with values {0,i(2}{ ] 'i . 22B, Table 9). The intermediate form of information is then converted to nucleotides using specified tables (Table 9). According to the placement of information within each DNA sequence, each template nucleotide either stores 1 bit or 1 bit, or is selected for synchronization (Fig. 22 A). Without the necessity for synchronization nucleotides, it would be possible to store up to 1.5 bits per template nucleotide (close to the upper bound of ¾¾{3}bits per nucleotide) by converting all input bits directly into trits. The modulation scheme for specific sequences El, E2, E3, E4 synthesized in experiment is provided (Figs. 25A, 22B).
The demodulation step of the decoding pipeline attempts to reverse the steps of modulation. With the assumption that synchronization nucleotides are provided by a prior step in the decoding pipeline, demodulation converts a sequence of nucleotides into a mixture of bits and trits, and subsequently extracts a sequence of bits according to tables of conversion (Fig. 22B, Table 9). If errors exist within the sequence of nucleotides, the demodulation step may also output a sequence of bits containing errors. Synchronization nucleotides (Figs, 22A-22B) ensure that errors are localized within a sequence to some degree, limiting a propagation of errors.
Modulation and demodulation (Simulation)
A modulation scheme is provided (Fig, 22A) for simulated DNA storage systems with parameters (Κ, Β,Ω) = (38,33, 23), (JE, fl(JQ = (74,63,36), and
(K, Β,ίϊ) = (152, 128, 57). The modulation scheme for simulations is nearly identical to the modulation scheme used in the "Eureka! " experiment and includes a similar synchronization pattern embedded per sequence. A sequence of bits is converted to a mixture of bits and trits, and then to a sequence of nucleotides. It is noted that the intermediate mixture of bits and trits is designed to facilitate placement of information between synchronization nucleotides, while also ensuring that no self-transitions are possible. The demodulation step consists of reciprocal conversions to map a sequence of nucleotides to a sequence of bits (Table 9).
Summary of modulation and demodulation
The following table specifies the conversion of B bits per sequence to K nucleotides per template sequence for all DNA storage systems analyzed in this paper. The conversion utilizes an intermediate form of information which consists of a mixture of bits and trits. The demodulation step of the decoder reverses the steps of modulation.
Modulation and Demodulation: Design Parameters
Figure imgf000116_0001
Efficiency rates for DNA storage with a digital codec
Efficiency rates for experimental and simulated systems
The end-to-end efficiency rate of storage may be computed for all experimental and simulated systems. Specifically, starting with 12 bits of data and addresses stored per sequence, an ECC per sequence results in B bits per sequence. Then E bits per sequence are converted and modulated into K nucleotides per sequence, including synchronization nucleotides. The following table lists these efficiencies for information storage in template DNA sequences.
Efficiency Rates for DNA Storage: Digital Codec Parameters
Figure imgf000116_0002
j S i m dst k ill \ m
Higher efficiency rates with increased synthesis accuracy
For theoretical DNA storage systems, it was shown by simulations that efficiency rates may be increased given increased synthesis accuracy. In particular, with increased synthesis accuracy, Ω bits of data and addresses stored per sequence may be increased up to B bits per sequence. To derive trade-offs between efficiency rates and synthesis accuracy, DNA storage was modeled as an input-output subsystem involving only a sequence of Sbits (Fig. 39B). Based on this abstraction, the input to the DNA storage system can be represented by a sequence of Bbits prior to modulation. Similarly, the output can be represented by a sequence of iSbits, obtained after demodulation. The output bit sequence may contain errors. Random input sequences of Sbits were generated, and obtained output sequences of i?bits by simulating a subsystem within the encoding and decoding pipeline (Fig, 39B). The probability of bit error, denoted by ]? t-srr&T> was estimated by averaging over all input- output bit sequences. Assuming independent and identically-distributed symmetric bit errors, the capacity was derived to be S{I— ¾ (P r-«¥wr))bits- m tn*s standard capacity formula for a bit-error memoryless channel, i½(*)denotes the binary entropy function (T. M, Cover, J. A. Thomas, Elements of Information Theory (John Wiley & Sons, 2012)). The capacity in bits up to a maximum of Sliits per template sequence was plotted for different levels of synthesis accuracy (Figs. 36A-36F).
Based on the above analyses, it was found that for a synthesis accuracy in which strands0 contained -15% missing nucleotides, a template sequence of 38 nucleotides could store 10 more bits of data and addresses, an increase from 23 to 33 bits (Fig. 36A). Similarly, 27 and 70 more bits of data and addresses could be stored per template sequence of 74 and 152 nucleotides, respectively at the same level of synthesis accuracy (Figs. 36C and 36E). The codec was also tested with a combination of missing nucleotides, substitutions, and insertion errors (Figs. 36B, 36D, and 36F).
Alignment of DNA strands by scaffolding and consensus
Enzymatic synthesis produces populations of diverse strand variants from each DNA sequence. The presence of diversity in DNA strands enables a larger set of strategies for synthesis, storage, and sequencing. Encoding DNA sequences with synchronization patterns (i.e., scaffolding) is one way to harness information from diversely synthesized strands. The term scaffolding is used to denote specially designed synchronization patterns in DNA sequences. This section describes algorithms for the alignment of diverse DNA strands by scaffolding and consensus.
Alignment by scaffolding and consensus is the first step of the decoding pipeline (Figs. 35C and 39B). To provide a concrete framework, consider a sequence consisting of .^nucleotides from a compressed strand represented by an ordered sequence of random variables OtO^Os.,.0K. One particular realization of the sequence is denoted b %¾¾.„ , A decoder must decide which realization is most likely based on diverse synthesized strands produced from the original sequence. The mathematical model which was adopted to model the production of DNA strands is the Markov model (Fig. 23 A). The ί k synthesized strand will be denoted by the following vector of random variables:
Figure imgf000118_0001
The i ik synthesized compressed strand is comprised of a random number of nucleotides, and its random length is represented by random variable Li. One particular realization of the ί Γίί synthesized strand is denoted by the following vector:
Figure imgf000118_0002
The length £έ: is a realization of the random variable I,. Given a set of synthesized strands-fVj],. a decoder must estimate correctly which original sequence was intended for storage. This estimation is computed based on the probabilistic framework of the Markov model (Fig. 23A). Such a framework is common to and adapted from the framework of synchronization codes used in traditional communication systems (24). Optimal alignment of diverse strands
The method for aligning diverse strands is based on maximum a posteriori (MAP) estimation of each nucleotide. The optimal decoding for the fc-th input nucleotide based on the set of all synthesized output strandsfVj 'fis given by the following optimization.
The notation^ = ^indicates a set of events occurring simultaneously. Realizations of random variables are denoted by lower-case symbols in the above formula. Associated probabilities are computed based on a Markov chain model which characterizes how synthesized DNA strands (outputs) are produced from an input sequence (Fig. 23 A).
Decoding is aided by prior knowledge of the scaffold present in the input sequence (Figs.
19B and 25 A). The optimal alignment is computed efficiently via dynamic programming recursions (explained in subsequent sections) if the number of strands is a small constant, and if the length of DNA sequences is short. While sequence lengths are short in DNA storage systems, the number of synthesized strands per sequence may be large. Therefore, it is critical to design approximations to the above exact optimization. For future algorithmic designs, it is noted that a superior alignment may be estimated for ail input nucleotides Ot02 z... OK jointly. However, individual probability estimates computed per nucleotide allow for the direct application of consensus rules and error-correction after alignment.
Alignment by consensus
The optimal alignment per nucleotide given a set of synthesized strands is not computationally tractable for a large number of strands. Consensus-based approaches offer advantages in terms of computational efficiency. Assuming that each nucleotide in the input
DNA sequence is equally likely to have been written, and assuming that each output strand is produced independently and identically according to the error statistics of the Markov chain model (Fig. 23 A), the following product rule may be applied for probabilistic consensus (26):
Figure imgf000120_0001
The above product rule may be derived from Bayes' theorem directly, and is related to a simple Bayes classifier. The given consensus optimi zation is computed efficiently via dynamic programming recursions, and remains tractable even for an increasing number of strand variants. Its computational complexity scales linearly in the number of strands. The key difference between the above consensus product rule and the optimal solution of alignment i s that the inner probability only involves a single strand, as opposed to all strands jointly. As the number of strands increases, the inner probability may be computed for each strand separately and efficiently, after which a product rule is applied.
It is possible to compute an exact ali gnment by scaffolding for a small group of strands, and then apply consensus over disjoint groups. For the "Eureka! " experiment, in which template sequences each contained 16 nucleotides, it was first aligned strands optimally in two disjoint groups, each containing 5 strands each. It was then applied product- wise consensus over the two groups. For the simulated DNA storage systems, optimal alignment was applied to groups containing 2 strands each, followed by product-wise consensus over 5 groups. In both experiments and simulations, a total of 10 filtered strands were utilized for decoding. It is noted that for aligning multiple strands, there are several ways to combine pairwi se or groupwise alignments. The problem of optimal alignment remains computationally intractable in general, but several low-complexity relaxations to the problem are possible. Advancements to the field of bioinformatics are anticipated to continually improve the quality of alignments. Markov chain model
The Markov chain model (Fig. 23A) provides a probabilistic framework for computing alignments and forming consensus. To simplify calculations, it is possible to "unwrap" the model . As a first step, it was assumed pbser = (hand focus exclusively on deletion, insertion, and substitution events. The number of insertions were limited to two. In experiments, the number of insertions at a given sequence position was rarely beyond two. A related type of error which could also be modeled is replication error, defined as a repetition of a short motif pattern in DNA. The event of nucleotide synthesis or "transmission" was first defined using terminology of a storage channel. The synthesis of a nucleotide results in either a correct write or a write error. The probabi lity of a nucleotide synthesis (a write or write error) is defined as follows: = 1 Ρ<&ΒΙ· Then the following table of probabi lities indicates six events possible in an "unwrapped" and simplified Markov model .
"Unwrapped" Markov Chain Model : Probability of Events
Figure imgf000121_0001
P&— t ^ ^iifj- ! T o i rtioi ? mid Spit; hems
Each event occurs with a specified probabi lity. Probabilities ¾ and p& are modified slightly to ensure that the total sum of probabilities in this simplified model is: ft + i¼ + ft + P* + ft + ft =
Forward-Backward recursions
The following section describe how to compute the probability !Ρθ*¾ = 1 ¾: = ° efficiently using ar/^forward-backward recursions (Figs. 23 A-
23B and Figs. 24A-24E). For the i th synthesized strand, define the event
Figure imgf000122_0001
represent that t nucleotides were correctly added after s synthesis steps. Using the notation [oi fejto represent all indices between and ^including endpoints, the following probabilities was defined:
Figure imgf000122_0002
The /^probabilities may be computed via fonvard-backward recursions. Denote a uniform probability over the DNA alphabet for nucleotides by ¾ = 1/4, To quantify a substitution error in the calculations, the function for inputs was defined as follows:
x„y E {Af Tf Cf G}:
Figure imgf000122_0003
Dynamic programming is designed to utilize pre-existing computations in a recursive manner. To compute ^^ , a two-dimensional table of probabilities is populated in the "forward" direction. To compute βξ^^, a two-dimensional table of probabilities is populated in the "backward" direction. The following table summarizes the recursive computations required. The sum of the probabilities in each column of the table yields and β(ΰ% respectively.
Forward-Backward Recursions: Calculation of Probabilities
Figure imgf000123_0001
Decoding based on alignment of multiple DNA strands
Once the forward-backward probabilities have been computed, it is straightforward to compute-F{ ¾ = ¾ 1 0& = & ). Here, the presentation was simplified by assuming that p = 0 and only consider cases for deletions and substitutions. Using the ^/^probabilities,
An example of MAP estimation by scaffolding is provided (Figs. 23A-23B and Figs. 24A-24E). Decoding by alignment is possible because of the synchronization pattern embedded as a scaffold in the template sequence. The probability = o) is either exactly one at the fe-th input position if the synchronization nucleotide is correctly placed, otherwise it is exactly zero. The synchronization nucleotides provide strong cues for the correct placement of other nucleotides. In addition to the above probability for
Figure imgf000123_0002
f ¾ — o ), it is also possible to compute optimal pairwise and groupwise alignments. For example, for the ¾ synthesized strand and J u synthesized strand considered jointly together, the ^/^probabilities include and β{ί ) as well as eC »*
Figure imgf000123_0003
The following probability for optimal pairwise alignment may be computed via these ^/^probabilities, Ψ\ ¥^ — wt, — j f Ok = © In a similar manner, groupwise alignment from three strands may be computed:
Figure imgf000124_0001
Computational complexity of optimal vs. consensus alignment
With a slight change of notation, consider that the average length of all synthesized strands is given by L. To compute the /|f probabilities for each synthesized strand requires approximately£?(£-A')time complexity. Similarly, to compute the optimal alignment from just one synthesized strand, approximately (Lif) time complexity is necessary. For optimal pairwise alignment, approximately G(I-2if)time complexity is necessary. The complexity increases at least exponentially in I with the exponent equal to the number of synthesized strands. By contrast, alignment by consensus incurs a computational complexity which scales linearly in Lwith the number of synthesized strands. Therefore, it is advantageous to utilize approximations such as fast consensus methods to align multiple strands.
Alternative alignment algorithms
The use of dynamic programming is one solution for computing alignments of synthesized strands. Another algorithm for alignment, termed majority voting alignment, consists of greedy consensus (T. Batu, S. Kannan, S. Khanna, A. McGregor, Reconstructing Strings From Random Traces, in Proceedings of the Fifteenth Annual (ACM-SIAM) Symposium on Discrete Algorithms, (SODA), New Orleans, Louisiana, USA, January 11-14, 2004, pp. 910-918). It was found that such an algorithm was not sufficient to correct a large number of errors such as missing nucleotides, given only 10 filtered strands (Figs. 38A-38D). However, majority voting alignment may be combined with codes such as repetition coding to correct a larger number of errors. A full analysis of a coded form of majority voting alignment is an interesting direction to explore for future algorithmic designs. Considerations of increased diversity for consensus
Alignment by consensus is not always beneficial. To be precise, it was consider the following mathematical problem for consensus— deciding a single bit of information by forming a consensus from multiple independent estimates. It denotes multiple estimates by independent Bernoulli random
Figure imgf000125_0001
which are one with probability ?j, indicating an error, and zero with probability!— syndicating a correct estimate, if each estimate is correct more than half the time, i.e., -?? < 1/2, then the following proposition provides an upper bound for the probability of a majority error formed by majority consensus using i? estimates. For notational purposes, a binary divergence function is defined for parameters ¾ ¾ £ ¾1
&4fe,( j[q) =§= a In - + (1 - a) In ^.
Proposition:
Probability Of Majority Error <exp (-R g | ,)).
Proof of Proposition, Chernoff s bound states that for -^independent and identically- distributed Bernoulli random
Figure imgf000125_0002
where each random variable is one with probability Jf ,
Figure imgf000125_0003
This above probability corresponds to the probability of majority error since the sum of Bernoulli random variables indicates that the consensus estimate is incorrect. If the sum exceeds more than half of the votes, a majority error occurs.
Interpretation, The probability of majority error decreases exponentially with the number of estimates R, as long as each estimate is correct more than half the time. However, if each individual estimate is not reliable, this exponential effect is not guaranteed. Consensus improves decoding accuracy as long as each estimate is reliable, reinforcing information as opposed to contributing noise.
Diversity analyses of enzymatic synthesis
Enzymatic synthesis of a template sequence produces raw strands (strands5) with variable extension length per nucleotide. From each strandR, transitions can be extracted to form compressed strands (strands ). Each strand may be of variable length. For subsequent analyses in this section, the distribution of strand lengths was modeled, and compute the number of diverse strand variants of each length. Edit distances between synthesized strand variants and the original template sequence was also provided, along with a detailed error analyses.
Example V II. Distribution of Lengths for Compressed DNA Strands
Synthesis errors resulting in missing nucieotides (deletions), or insertions directly affect the length of a strand0 unlike conventional errors such as substituted (mismatched) nucleotides. According to one aspect of the present disclosure, a mathematical model was constructed for nucleotide errors occurring in synthesized strandsc. the model is a Markov model (Fig. 23 A) with a state space indicating different types of nucleotide errors such as missing nucleotides (deletions), substituted nucleotides, and insertions. It is assumed that each strandc variant is synthesized independently and according to identical error statistics, as specified in the Markov model (e.g., Ps b* Pins)- ^e error process results in several unique synthesized strands . These strandsc can be aligned to reconstruct the original sequence. While reconstruction is possible through alignment and probabilistic consensus, often the exact determination of error events in strands is ambiguous. For example, a random insertion followed by a deletion of an intended nucleotide is indistinguishable from a substitution error (Fig. 23A).
Despite inherent ambiguities in the error process, it is possible to derive the length distribution of strands " based on the Markov model (Fig. 23 A). Consider a template DNA sequence * - ¾T comprised of ^nucleotides. A coiTect write in the k-th position results in one correct nucleotide added. However, missing nucleotides and insertions affect the total length of strands produced (Fig. 23A). Let Tdenote a discrete random variable representing the number of nucleotides added in the &-th position. The read-length distribution of strandsc is derived by specifying the statistics of random variable T.The probability mass function of Fis denoted by PT{t). The probability generating function is a formal power series defined as follows,
Figure imgf000127_0001
The following proposition expresses the generating function in closed form, from which the moments of Tmav be derived.
Proof Of Proposition. Consider a random variable £/which represents the number of nucleotides written starting from either the WRITE state or the WRITE ERROR state (Fig. 23 A). Then Wis a geometric random variable with probability mass function given by, The generating function of geometric random variable ifis given by,
Figure imgf000128_0001
Based on the Markov model, the probability mass function of Tis defined recursively, = ^and PT{t) = pi lsPT{t - l} + (1 - ¾M - pi J¾Cf} for
£ > 1,
The generating function of Tis derived starting from its power series representation.
¾■(«) 4 ¾ ω'ΡΓ(ί) = p + ¾^Γ(όί) + {1 - ?ss - ftJJ («).
Solving for Gr( )in the above equation establishes the proposition. Based on the generating function, the moments of T may be computed. For example, the mean if[rj = θ' τ ω = 1} depends on the first derivative of the generative function, and the variance VAM[T — G T(&> = 1) + G T{&>— 1) — G τ ω— l))adepends on the second derivative of the generating function.
Exact length distribution (Theory)
The probability mass function for random variable ^describes the statistics for writing one nucleotide of a template sequence to create a strand There exist K nucleotides in the template sequence « . The length of a synthesized strand c is also a random variable, which was denoted here byiL. The length I has a probability mass function PL(l). Assuming each write is independent of previous and future writes, the generating function for length L is given by, Binomial distribution (Special case)
As a special case, assume that only deletions occur in the Markov model (Fig. 23 A) so that ins— Ps b— F r ~ ®- Then, the generating function, mean, and variance of the length L are given as follows.
E[L] = K(l - p M). VAR{L] = Kp^{l - PdW).
As expected, such a generating function is recognized as corresponding to the well- known binomial distribution. More precisely, the binomial distribution of the length is given by,
Figure imgf000129_0001
C1 - Pdsif *for lengths in the range 0≤l < K, Experimental distribution of length
The length distribution was observed empirically in data for all synthesized strands " produced by enzymatic synthesis for template sequences E1 -E4, each containing K = 16nucleotides (Fig. 28 A). More than 100,000 raw strandsR were sequenced for each of
E1 -E4 and post-processed in silico to obtain run-length compressed strands From data histograms (Figs. 29-33), it was verified that mostly deletions (missing nucleotides) occur in compressed strands . Strands were further aligned to their respective template sequences
E1-E4 using the Needleman-Wunsch algorithm, verifying the presence of missing nucleotides (Fig. 27A).
Assuming primarily deletions in all strands*", the empirical length distribution may be compared to a fitted binomial distribution. To fit a modified binomial distribution, two probability mass functions were defined as follows. £® Ρα* Κ~** - PAri)'. for read-lengths in the range 0 < I < K,
Figure imgf000130_0001
= for read-lengths in the range 0 < ! < 4.
The distribution ¾{!}models the non-negligible uniform probability that very short strands0 are produced. A mixture probability distribution for strands0 lengths (Fig. 28A) of the following form was fitted: Q.2¾{£) -f- QMPL l}.. For the set of all synthesized strands0, the fitted binomial parameter p sl = CL59for the El template, pM— O.SSfor the E2 template, ¾Λί = 0,65for the E3 template, and pM— O.SSfor the E4 template. Size selection processes, performed in silico or in vitro, to keep only longer synthesized strandc variants decrease the effective number of missing nucleotides to be resolved. Strandsc was further purified in silico which eliminated very short strands , resulting in a binomial distribution of the form:F£(¾ in which it was determined that the average probability of deletion was ^el = CL28(Figs. 27B, 28B). Thus, size-selection of strand0 variants led to a reduction in the effective probability of missing nucleotides.
Diversity of synthesized strands
Enzymatic synthesis not only produces strands0 of different lengths, but also produces diverse strands0. Each strand0 may contain errors such as missing nucleotides in different positions relative to its corresponding template sequence. In this section, the theoretical diversity was compared with the experimentally observed diversity produced by enzymatic synthesis.
Mathematical diversity
To simplify analysis, the case of only missing nucleotides occurring in strand0 variants was considered. A loose upper hound for the number of diverse variants of length I derived from a template sequence of length A'nucleotides is given by,
Biversityl_l, o£o3o3... %] < ^ '" | for 0 < I < K«
This upper bound is equivalent to the total number of strands of length I obtained after (K— Γ) deletion errors, and is independent of the template sequence itself.
An accurate count of diversity must include only distinct strand variants. Thus, diversity depends on the specific template sequence ¾¾ο5, .. οκ being synthesized. For each template sequence E1-E4 and also E0, the number of subsequences possible for each length I was computed. For the E0 template, a total of 36909 subsequences exist of ail lengths of which 18233 do not include self-transitions. For the El template, a total of 15863 subsequences exist of all lengths of which 1799 do not contain self-transitions. For the E2 template, a total of 29910 subsequences exist of all lengths of which 10795 do not contain self-transitions. For the E3 template, a total of 24960 subsequences exist of all lengths of which 6469 do no contain self-transitions. Finally, for the E4 template, a total of 23679 subsequences exist of all lengths of which 6487 do not contain self-transitions. It was noted that the El template (CATATCACATCTCACT) does not contain a 'G' nucleotide, which is why the number of subsequences with no self-transitions is much less in comparison. These computations reveal the theoretical maximum for the number of possible strand1 variants for synthesizing each template sequence (E0-E4).
Empirical diversity of UNA strands
It was observed the empirical diversity of all strand variants based on reads from more than 100,000 raw strandsR synthesized from templates E0-E4 (Figs. 28A-28B). The number of unique strand variants of each length was compiled for all strands " as well as purified strands As a summary, the total diversity count of strands of all lengths for each template sequence E0-E4 is provided in the following table.
Total Empirical Diversity of Strands for EQ-E4
(Strandsc with 16 nucleotides or less)
El
W j: i , > £ mi mi
N'tmi er of Lfjikitts Si rw ::'- r m
ms ar of t¾kjU Pasifei
4m .i 442
\V;i i: Finals 1
Wish :Re¾ > ¾ m 1:44 ITS
From the first two rows of the above table, it was observed that the total empirical diversity counts of all strands produced from E0-E4 were -2-6 fold less than the theoretical limit calculated. However, as shown in the error analyses of edit distances between strands " and template sequences (Fig. 28 A), as well as more detailed error analyses (Fig. 27 A), not all strands have adequate lengths. The last two rows of the above table indicate diversity counts for purified (size-selected) strands which contain fewer missing nucleotides and errors (Figs.
27B and 28B). Even after purification, a sufficient number of diverse strands exist that may be harnessed for sequence reconstruction.
Example VIII. Factors Impacting Scalable DNA Storage Systems
Four main factors affecting scalability
The scalability of DNA information storage systems is dependent on four main factors. These factors were ranked in order of importance, although they are closely interconnected. For this analysis, it is recalled that the storage capacity is given by 2**(J2— μ . bits where μ is the number of address bits per DNA sequence, and Ω is the total number of address and data bits stored per DN A sequence.
Cost of synthesis
While megabytes of data have been stored in DNA, using approximately 1 -10 million synthesized DNA sequences, the current cost of synthesis based on phosphoramidite chemistry is -S3500/megabyte (Y. Erlich, D. Zielinski, DNA Fountain enables a robust and efficient storage architecture. Science. 355, 950-954 (2017)). Accordingly, storing gigabytes of data in DNA would require an exorbitant amount of money, above $1,000,000. the enzymatic DNA synthesis projects to decrease reagent costs by several orders of magnitude as reactions are miniaturized. Such a reduction in synthesis costs will facilitate affordable storage towards the goal of large-scale storage of data in DNA.
Number of template sequences that can by synthesized
Even with low-cost synthesis, a significant challenge for large-scale storage is the massive number of DNA sequences that must be synthesized. As an example, consider a storage system with DNA sequences containing 200 nucleotides. In such a system, the maximum efficiency rate of 2 bits per nucleotide is possible. In this ideal case, 400 bits (II = 400) may be stored per DNA sequence. If 2μ = 2m lmillion sequences are synthesized, then parameters = (20,400), and the storage capacity is 47.5 megabytes. If 2** = 2 ^ Ibiliion template sequences are synthesized, then parameters (μ ) = (30,400), and the storage capacity is 46.25 gigabytes. Thus, a 1000-fold increase in the number of DNA sequences synthesized yields a proportional increase in storage capacity. However, synthesizing a large number of DNA sequences in massively parallel synthesis reactions remains an engineering challenge for DNA storage.
Number of nucleotides per template sequence
Increasing the number of nucleotides per template sequence will increase storage capacity. Consider two storage systems, one with sequences of 200 nucleotides, and the other with sequences of 400 nucleotides. Further, assume a theoretically maximum efficiency rate of 2 bits stored per nucleotide. For the first system, setting parameters (μ,ίϊ — (2G,4GG),the maximum storage capacity is 47.5 megabytes. For the second system, setting parameters (μ,Ώ) = (20,800), the maximum storage capacity is 97.5 megabytes. Thus, increasing the number of nucleotides per sequence from 200 to 400 results in a 2-fold increase, a linear scaling, of storage capacity. As enzymatic synthesis accuracy improves, through process engineering or advances in biochemistry, the number of nucleotides per DNA sequence can be increased to achieve larger storage capacities.
Efficiency rate of storage
Increasing the efficiency rate of storage will increase the storage capacity. Consider two storage systems, both with sequences containing 200 nucleotides. One system achieves a theoretically maximum efficiency rate of 2 bits per nucleotide, while the other achieves an efficiency rate of 0.5 bits per nucleotide. For the first system, the storage capacity is 47.5 megabytes (setting ( ,ίΐ) = (20., 400}). For the second system, which synthesizes the same number of sequences, the storage capacity is 10 megabytes (setting parameters (μ,β — (20*100)). Thus, reducing efficiency rate by 4-fold leads to a nearly linear reduction in storage capacity. Efficiency rates of storage can be increased with improvements to enzymatic synthesis accuracy, which will reduce the error-correction overhead.
Based on these analyses, it considers the two immediate challenges for large-scale storage in DNA to be the cost of synthesis and the massive parallelization of affordable synthesis reactions. With continued advances to enzymatic synthesis, the number of nucleotides per DNA sequence and the efficiency rate of storage will improve.
Soup of UNA : DNA storage modeled as a permutation channel
Consider the storage capacity of a DNA storage system consisting of
Figure imgf000135_0001
the following proposition expresses an upper bound for the total number of bits possible storage. The result is obtained by modeling DNA storage as a permutation channel.
Proposition: Define Cca _[ » JCjos the storage capacity.
Figure imgf000135_0002
Proof of Proposition. A related mathematical problem was first presented. Consider a set{¾" j}for i ε {1, 2,...,/} where xt > Qare integers, and consider the following equation:
The variables /and /lare also integers. It is known that the equation has exactly (A- J~i)£
, . integral solutions. To analyze information storage in DNA as a permutation channel, it is noted that DNA strands lose their relative order when synthesized and mixed in solution. Information is only preserved and conveyed in the form of the type of nucleotide sequence. For a sequence of length ^nucleotides, there exist 4 unique sequence types possible. Since there exist M sequences, the storage capacity upper bound is computed by setting / = 4i and A— M^indicating the number of different patterns possible in DNA using sequences for storing data. Taking the fo¾(/}function over the total number of distinguishable patterns yields total bits stored. Special Cases. The following special cases further illustrate the upper bound.
( + 3)1
= 0{lag2 Af).
3 ! Mi
Figure imgf000136_0001
The first inequality states that the number of nucleotides per sequence must increase, and not be held constant, to store enough bits. The third inequality states that the number of nucleotides per sequence must increase at least by iog2 Mf in order to increase storage capacity. This bound indirectly implies that storing an address per sequence, which requiresO{io¾ ) bits of storage per sequence, is a minimal requirement for reassembly. Example IX. 2D Array-Format Synthesis
Aspects of the present disclosure is directed to translate the bead-based process to create an array-based enzymatic DNA synthesis platform. In some embodiments, the prototype is comprised of two main parts; a Mantis liquid handler, which has a single robotic arm that can be programmed to dispense one of six reagents at a time, and custom jigs, which were either laser cut (Epilog Legend 36EXT) or machined (gift from Formuiatrix) to hold the glass slide acting as a solid support substrate for the DNA (Fig, 40 A), Initiator immobilization and surface preparation
In one embodiment, a 5' amine-modified initiator oligo (5Aml2-fSBS3-ctgag) and a 3' amine-modified blocking oligo (10T-3Am) were covalently attached onto an aldehydesilane-coated microscope slide (Schott Nexterion Slide AL). The blocking oligo was included to prevent unwanted interactions, such as adsorption, between the initiator or enzymes to the surface. To do this, it was created an oligo mixture containing 2μΜ 5Aml2- fSBS3-ctgag and 8uM 10T-3AM in 3X SSC (IX SSC is 150mM NaCl and 15mM sodium citrate) and 1.5M Betaine. The oligo mixture was dispensed as 0.1 μΙ_, droplets onto the slide using a Mantis liquid handler (Figs. 4QB-40C). Following the dispense, the slide was incubated at room temperature for 30 minutes in a parafilm-sealed Petri dish with Kimwipes saturated with 4X SSC, Then, the slide was transferred to a 100°C hotplate and dried for 30 minutes.
The synthesis procedure depends on precise and specific localization of nucleotide triphosphates and enzymatic mixes to initiator spots, which is denoted as features. Once these droplets are dispensed, however, they are prone to spread unevenly and uncontrollably across the glass surface and may contaminate neighboring features. To constrain the droplet, it was sought to create virtual "wells" for each feature by increasing the hydrophobicity in the areas between features. Dispensed droplets should then stay localized on each feature. 0,3 μΐ. droplets containing 3X SCC and 1.5M Betaine was first dispensed on top of the features using a Mantis liquid handler and then dried the slide for 30 minutes on a 100°C hotplate. This creates an increased hydrophilic area surrounding each feature. To do this, the slide is dipped in Sigmacote (Sigma), which produces a neutral hydrophobic film over the areas of the glass which do not contain features, dried under a fume hood for 5 minutes, then dried for 5 minutes on a 100°C hotplate. Afterwards, the slide is washed twice with 0.2% SDS and three times with distilled water (Invitrogen UltraPure), The slides were then stringently washed by placing it in a boiling solution of 0.2X SSC for 15 minutes, then in room temperature distilled water (Invitrogen UltraPure). Lastly, to reduce Schiff bases and unreacted aldehydes, the slide was incubated for 10 minutes in a sodium borohydride reducing solution. The solution was prepared by dissolving 0.12g of NaBH4 (Sigma) in 30mL phosphate buffered saline (PBS, Invitrogen), then adding lOmL of 100% ethanol. Afterwards, the slide was washed once with 0.2% SDS and three times with distilled water (Invitrogen UftraPure). The prepared slide is then kept in an ice-cold ethanoi hath until use.
Synthesis, processing, and sequencing
Three replicates of the following three template sequences were synthesized, each with 13 nucleotides; SOI : ' ACTGATCGT AGC A' , S02: ' CTGATC ACGTAGC ' , and S03; 'TAGCTGACGTCAT'. In total, synthesis was performed on nine total features.
Each synthesis cycle was composed of the following six steps: (i) the slide is placed on a custom jig for the Mantis liquid handler; (ii) a 0.5μ!, dispense of enzymatic reaction mix, comprised of Ix Custom Synthesis Buffer (14 mM Tris-Acetate, 35 mM Potassium Acetate pH 7.9, 7 mM Magnesium Acetate, 0.1% Triton X-100, 10% (w/v) PEG 8000) with ΐυ/μΕ TdT (Enzymatics) and 0.25mU/uL apyrase (NEB); (iii) a 0.1 μΕ dispense of a nucleotide triphosphate at the following 6X concentrations in 10% PEG 8000 + 0.05% Triton X-100: 60 μΜ dATP, 75 μΜ Br-dCTP, 18μΜ dGTP, and ()μ\! dTTP; (iv) 30 second static incubation at room temperature; (v) four-step washing: once with 0.5X SSC + 0.01% Tween- 20, once with 0.5X SSC and two times in distilled water (Invitrogen UltraPure); (vi) the slide placed back in the jig for the next cycle. For each synthesis cycle, the Mantis liquid handler performs four dispense cycles, described in (iii), one per nucleotide triphosphate. In each dispense cycle, a specific nucleotide triphosphate is deposited to all features for synthesis. Washes were performed by manually by transferring the slide between each of the defined solutions.
Following the last synthesis cycle, all strands from all features were ligated to a universal adapter. A thin, siiicone-gasketed chamber (Grace Bio-Labs SecureSeal Hybridization Chamber) was adhered to the slide and a ligation mixture containing a universal adapter was flooded into the chamber. The ligation mixture is composed of 2.5μΜ 5App-rSBS9-dd adapter, IX T4 DNA Ligase Buffer (NEB), 25% PEG 8000 (Sigma) and 1 unit of T4 RNA Ligase per μΙ_. (Enzymatics). Following ligation for 1 hour at room temperature, the chamber was removed and the slide washed once with 0.1% SDS and three times with distilled water (Invitrogen UltraPure).
The synthesized strands were then released from the slide surface by cleaving the uracils located on the 5' end of the initiators with USER enzymes. The cleavage reaction mixture was composed of 0.18 units of UDG (Enzymatics) per μΕ, 0.18 units of Endonuclease VIII (Enzymatics) per μΐ, and 0.5μΜ ttSBS9 in USER TE-T buffer (40mM Tris-HCl pH 8.0, ImM EDTA, 0.01% Tween-20), The cleavage mixture was dispensed as 2μΕ droplets with the Mantis liquid handler onto each of the features. The slide was then incubated for 1 hour at 37C'C in a sealed chamber with a Kimwipe saturated with 0.1X SSC. Droplets containing the cleaved DNA strands were transferred by multichannel pipette, to a 96-wefl PCR plate where each well contained 5mM Tris-HCl pH 8.0 + 0.01% Tween-20.
A sequencing library for each feature was generated next. Using cycle-limited realtime PCR, 5μΕ of each feature was first amplified with the primers tSBS3 and ttSBS9, then with NEBNext Dual Indexing Primers for 15 cycles. Barcoded strands were then combined and sequenced single end using Illumina MiSeq v3 50.
Analyses and future improvements
Sequences from demultiplexed reads were first trimmed with cutadapt 1.9.1 (M. Martin, Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal. 17, 10-12 (201 1)), with an error tolerance up to 10%, to remove the 5' initiator oligo sequence (5Aml2-fSBS3-ctgag) and the 3' universal oiigo sequence (5App- rSBS9-dd adapter). Only reads containing both sequences for trimming were retained for further analysis.
Analyses of synthesis errors and diversity were performed as described above. Perfectly synthesized strands were found, which had raw lengths -50 bases, for each of the three tested template sequences across ail three replicates (Fig. 41). All synthesized strands were then analyzed and an in silico purified set with raw lengths between 39-52 bases, assuming an extension length of 3 to 4 bases per template nucleotide. The dominant mode of synthesis error was missing strandc nucleotides which was reduced to observed experimental rates by size selection (Figs. 42A-42B). Furthermore, it was found that synthesized strands were diverse (Figs. 43 A-43B), indicating that the codec could be used to encode and retrieve data.
Subsequent iterations of this array -based synthesis platform will include both hardware and "wetware" improvements. In terms of hardware, a printhead containing multiple dispensers will improve parailelization. In addition, washing must be automated to make each cycle rapid and robust. For "wetware", further surface chemistry optimizations will mitigate potential issues with initiator and protein adsorption to the glass surface. Furthermore, process engineering of reaction conditions, such as heating, may improve mixing and denaturation of the DNA. Together, these improvements will increase synthesis speed, throughput, and accuracy.
Example X. Cost Projections and Cycle Time Estimation
Synthesizing DNA with an inkjet microarray printer
In order to compare reagent costs of enzymatic synthesis to that of chemical synthesis, it was estimated the reagent costs for each, assuming that both processes can be implemented on the same device. For the device, it considered an inkjet microarray printer conceptually similar to that manufactured by Agilent Technologies (T. R. Hughes et al, Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nat.
Biotechnol. 19, 342-347 (2001); E, M. LeProust et al, Synthesis of high-quality libraries of long ( I SOmer) oligonucleotides by a novel depurination controlled process. Nucleic Acids Res. 38, 2522-2540 (2010); A. P. Bianchard, R. J. Kaiser, L. E. Hood, High-density oligonucleotide arrays. Biosensors and Bioeleclronics. 1 1, 687-690 (1996)). In such a microarray printer, each DNA sequence to be synthesized occupies a physical spot, also denoted as a feature, on a planar surface. Multiple sequences are arranged as a 2D array to allow spatial addressing (x and y Cartesian coordinates). All DNA sequences are synthesized in parallel per cycle, that is, all features receive their first nucleotide during the first cycle, they then all receive their second nucleotide during the next cycle, and so on. Each cycle consists of a series of reactions. Reagents for each reaction may be dispensed directly to each feature by non-contact Inkjet dispense or to all features by first sealing the array surface to form a flow cell and then flushing the reagent through. The reagent to be dispensed by inkjet is denoted as droplet whereas the reagent to be flushed is denoted as flowcell.
Reagent costs
The volumetric cost of each flowcell or droplet reagent was obtained. Table 6 lists the reaction steps of standard phosphoramidite chemistry and enzymatic biochemistr presented in this study. Reactions are tagged by type (droplet or flowcell) and retail price per milliliter as of September 2017. Total droplet volumes (Vd) and flowcell volumes (Vf) of each reagent per cycle are also detailed.
It was noted that the enzymes may be used over multiple cycles since TdT:apyrase does not get inactivated after a synthesis reaction, TdT:apyrase has been used for at least ten consecutive cycles of extension (Fig, 7) with no observable deterioration in performance. As such, it is possible for enzyme costs per cycle (Table 6) to be readily reduced by 10-fold. Furthermore, it was obtained a quote for bulk pricing of TdT, reducing its price to 71% of the listed price. Factoring in both price reductions, S¾<g.ca be reduced ~ 14-fold, from 61.3 USD per milliliter to 4.38 USD per milliliter.
The total cost of reagents for a cycle of each synthesis process can be computed as follows:
1 . Cycte st^ = ($ >e X Vf} + {n X ¾iS X ¾)
2. Cycle_c0stffhmm = ($ >e X Vf) + ζη X $^ X ¾ where $f ^represents the cost of flowcell reagents in enzymatic synthesis, l^is the flowcell volume in milliliters, n is the total number of features, ^^is a constant representing the cost of droplet reagents in enzymatic synthesis, V is the droplet volume in cubic centimeters,
$^jerepresents the cost of flowcell reagents in chemical synthesis, and $ώ>β, represents the cost of droplet reagents in enzymatic synthesis.
Flowcell volume (^f ), can be expressed as in relation to height (cL = QAem, assuming a constant flowcell height (29)) and flowcell area (A):
3. Vf = ¾ A
Furthermore, droplet volume (¾), assuming the droplet forms a half sphere on the surface, can be expressed as a function of its feature size diameter (d) and the constant c2 = π ÷ 12:
4. ¾ = <¾ X d*
Flowcell area (A) can be expressed as a function of number of features (n) and density (/)) of features:
5. A = n ÷ D
Reagent cost per c ycle
Based on equati ons 3, 4, and 5, equations 1 and 2, can be reformulated as: 6. Cycle_cast9ng = ($fsS X c X. n ÷ D) -f- (n. X $^ X ¾ X rf3)
7. Cycie_cos£rftem = ($fitf X cs X n ÷ D} (n X X c2 X d3)
The number of features and feature density from the Agilent SurePrint G3 system was then utilized as a physical basis for projecting reagent costs for synthesis. For this system, the maximum number of features is approximately 1 million (n = 1 $00,000) with a density of -71,000 spots per square centimeter (D = 71,000), obtained by estimating a surface area of
14cm " (A = 14) out of a microscope slide with a total surface area of 18,75cm" (G4447A j
Agilent). Furthermore the feature size can be approximated to be 15-38 microns (d = 0.001S to 0.0034), based on a dispense volume of 1-10 picoiiters ( FUJIFILM Dimatix collaborates with Agilent in developing inkjet technology for advanced life sciences applications j Press Center j Fujifilm USA) and equation 4.
With these set number of features and density, it was projected the reagent cost per cycle for both enzymatic and phosphoramidite as a function of miniaturizing feature sizes, (Fig. 44A). With smaller feature sizes, it was found that the reagent cost per cycle for both processes drops to approximately the price of the flowceil reagent, indicating that the droplet reagent cost for all 1 million features are negligible. For enzymatic, this floor occurs when feature sizes are below 1-5 microns, depending on the enzyme droplet price ( ^^) considered whereas for phosphoramidite, the floor occurs for feature sizes below -34 microns. For all feature sizes less than 1 micron, the enzymatic cost per cycle will be >1, 000- fold cheaper than phosphoramidite cost per cycle. Considering current feature sizes of -15-34 microns, it was found that the reagent cost per cycle for enzymatic could already be cheaper than phosphoramidite. For example, with 15 micron features, phosphoramidite reagent cost per cycle is 0.626 USD whereas the enzymatic reagent cost per cycle is 0.055 USD
Figure imgf000144_0001
61.3) or 0.0044 USD (assuming 4,38), a ~11-fold and -140-fold drop in cost respectively.
Reagent cost per megabyte
It was next sought to project the reagent cost for synthesizing sufficient quantity of DNA to encode a megabyte of data. The previous analysis of reagent cost per cycle indicated that as feature size is miniaturized, the flowcell reagent becomes the dominant cost rather than the droplet reagent (Fig. 44A). Therefore, a cost-effective strategy would be to increase the number of synthesized features, n, for a given surface area (increasing the feature density, 0, as a result) per cycle and to minimize the total number of cycles, thereby limiting flowcell reagent cost. For this approach, it assumes that features are maximally packed, end-to-end, in a given surface area. The flowcell area (,4 ) can be alternatively expressed as a function of the number of features (?;,} and feature size diameter (d):
8. -4 = n X d2
With equation 8, the cost per cycle equations 6 and 7 can be reformulated as;
Figure imgf000144_0002
= A X (S#jS X ¾ + $Λ Κ X ¾ X d )
l0.Cycle_costehwm = A X ($/ϊβ X ¾ + X ¾ X d )
To store a megabyte of data in DNA, the number of cycles and number of features must be determined. Since the number of cycles should be minimized to limit flowcell reagent cost, template sequences should be as short as possible, which results in data being spread across a large number of features. As a result, most of the nucleotides for each template sequence should be allocated for addressing. Assuming an average efficiency rate of storage of 1 bit per template nucleotide, 2" (1,048,576) sequences must be synthesized and each sequence must contain 28 template nucleotides (20 for addressing and 8 for data) to store 1 megabyte of data. The cost for a template sequence of 28 nucleotides requires 28 cycles worth of reagents. As the feature size (d) is decreased, however, the number of features (n) increases for a given area (.4), and can be derived from equation 8 as:
1 1 . n = A ÷ d
With equations 9, 10, 11, it computed the reagent cost per megabyte assuming a maximum number of features packed into a 14 cm^ area (A = 14), 28 cycles, and that each megabyte requires 2i0 features (n = 1,048,576) (Fig, 44B), Similar to the previous cost per cycle analyses, these calculations show that reagent costs per megabyte are cheaper than phosphoramidite for current feature sizes of 15-34 microns. These projections show that the reagent cost per megabyte could, in theory, be reduced by a maximum of -1 orders of magnitude for enzymatic compared to a maximum of ~8 orders of magnitude for phosphoramidite, if the feature size was the diameter of double-stranded DNA (d = 2, 7e-7). The reagent cost per megabyte can be equivalent to that of magnetic tape (2E-5 USD per megabyte, derived from 2E-2 USD per gigabyte (T. Coughlin, The Costs Of Storage. Forbes
Magazine (2016)) if feature sizes (d) are ~40nm for phosphoramidite or ~350-800nm
61 ,3 or 4.38) for enzymatic. This would correspond to ~7 orders of magnitude cost drop from the calculated reagent price of -18 USD per megabyte when synthesized on a system with feature number and density parameters similar to the Agilent SurePrint G3.
Practical considerations
These models project theoretical costs and will be altered depending on practical implementations. The three most important factors to consider are as follows: Efficiency rate of storage: For ease, it sets the average efficiency rate of storage for both enzymatic and phosphoramidite to be equivalent, storing an average of 1 bit per template nucleotide. The rate for each approach may be different depending on factors such as synthesis accuracy and the required addition of error-correction codes per template sequence to ensure accurate information recovery. Altering the efficiency rate of storage for each processes will change costs linearly, and the resulting difference between enzymatic and phosphoramidite approaches would likely be within an order of magnitude. Improvements to enzymatic synthesis wil l increase the efficiency rate of storage to be competitive to that of phosphoramidite synthesis. Such improvements will also influence the number of diversely synthesized needed for template reconstruction and inform the minimum required feature size.
Feature density: For the reagent cost per megabyte projections, features are maximally packed with no spacing between. Practically, features are likely to be separated by a gap, usually a fraction of the feature size, to accommodate for potential positioning errors when droplets are dispensed. The number of features will then decrease inversely proportional to the square of the gap size (equation 1 1 to be modified accordingly). As this parameter is the same for calculating reagent costs for both phosphoramidite and enzymatic synthesis, altering the number of features may change absolute costs for each approach but relative comparisons between approaches will remain unchanged.
Feature size: Reaching the projected costs depends on overcoming significant engineering challenges associated with miniaturizing feature sizes. Current Inkjet printheads dispense 1-10 picoliter droplets, resulting in feature sizes of 15-38 microns (equation 4 and (FUJIFILM Dimatix col laborates with Agilent in developing Inkjet technology for advanced life sciences applications j Press Center | Fujifilm USA). To reach the projected cost per megabyte equivalent to magnetic tape, phosphoramidite features must be ~40nm which requires dispensing a 0.016 attoliter droplet, whereas enzymatic features must be -350- 800nm which requires dispensing a droplet of 1 1 -134 attol iters. While it is now possible to dispense a droplet of hundreds of attolitres (Y. Zhang, B. Zhu, Y. Liu, G. Wittstock, Hydrodynamic dispensing and electrical manipulation of attolitre droplets. Nat. Commim. 7, 12424 (2016)), no sub-attoliter dispensers are available to the knowledge. To achieve this cost, significant technology development and engineering will be required. Development of alternative systems that consume equivalent reagent quantities, perhaps requiring modifications to the enzymatic process, are warranted.
Equipment costs
It is assumed an equivalent equipment cost between enzymatic and phosphoramidite DNA synthesis. Currently, the reactions occur under ambient conditions without a need for stringent control of temperature or oxygenation. Thus, it was reasoned that the reagents could be used directly in a machine designed for phosphoramidite chemistry such as the Agilent SurePrint G3.
Equipment amortization is another important, but often neglected, cost consideration. Capital equipment costs are likely to increase significantly as DNA synthesis is scaled to achieve target costs. To miniaturize feature sizes, specialty dispensers will be required. As the number of these features increases, the time required for a dispenser to find the correct feature to receive a droplet reagent, the dispenser seek time, becomes an important consideration. Even if multiple dispensers are combined into a large printhead to reduce seek times, positioning systems likely with nanometer-scale resolution will be required, which may be expensive or prone to breakdown. While these are all important factors, there is insufficient information to estimate relevant parameters. Accordingly, it was assumed for ease that all seek times could be instantaneous and thus equipment amortization for enzymatic and phosphoramidite would be primarily dependent on their respective cycle time. A conservative estimate of enzymatic cycle time is ~4-fold shorter than phosphoramidite chemistry (Table 6), which could result in a shortened amortization schedule, further reducing total synthesis costs.
Table 1. Conversion of "hello world!" to template sequences.
Figure imgf000148_0001
Figure imgf000149_0001
Table 2. Statistics from simulated real-time data reconstruction by nanopore sequencing.
Figure imgf000149_0002
Figure imgf000150_0001
 Table 4. Oligonucleotides used in this study.
Figure imgf000151_0001
0T-3 Am I TTTTTTTTTT/3 AmMO/
App-rSBS9-dd i/5rApp/AGATCGGAAGAGCACACGTCTGAACTCCAGTCA/3ddC/
Table 5. Extension lengths for perfectly synthesized strands of "hello world!".
Figure imgf000152_0001
Table 6. Commercial reagent prices and estimated reaction times for the enzymatic synthesis vs. phosphoramidite chemistry.
Comparison of Vf and V represent flowcell and droplet volume respectively. The two highlighted values on the bottom of the Phosphoramidite Price section are and c . The two highlighted values on the bottom of the Phosphoramidite Price section are
Figure imgf000153_0001
Figure imgf000153_0002
Table 7. Parameters of DNA Storage Systems
Figure imgf000154_0001
Table 8, Design Specifications of DNA Storage Systems
Figure imgf000154_0002
Table 9. Modulation and demodulation: Interconversion of bits to nucleotides for 'Eureka'" experiment
IS ti j If its i wi Tfits t a Hi m it J to Ma dec fkl
.r >·<.·: l 2 Pre* :V ··:.%■> 0 1.
m T a G " A A !
001 1. a T A a T A C G
02 σ G T A a A G r
A
O K: € G T G A T c
no :H.¾ A T G G
T A G
III 11.22: G r A G
101 21 G T A G
w ••·ί· 20 A G G T
T G A G
G C A T
G C A T
A G T G
T a A G
a G A T
G G C A OTHER EMBODIMENTS
Other embodiments will be evident to those of skill in the art. It should be understood that the foregoing description is provided for clarity only and is merely exemplary. The spirit and scope of the present invention are not limited to the above examples, but are encompassed by the following claims. All publications and patent applications cited above are incorporated by reference in their entirety for all purposes to the same extent as if each individual publication or patent application were specifically and individually indicated to be so incorporated by reference.

Claims

Claims:
1. A method of decoding a nucleotide sequence, the nucleotide sequence encoding a value corresponding to a format of information comprising:
determining the nucleotide sequence,
identifying a transition or boundary or edge between different or nonidentical nucleotides of the nucleotide sequence, and
assigning a predetermined value to the identified transition or boundary or edge to create the value encoded in the nucleotide sequence corresponding to the format of information.
2. The method of claim 1 wherein the nucleotide sequence encodes a series of values corresponding to the format of information and wherein a plurality of transitions or boundaries or edges between different or nonidentical nucleotides of the nucleotide sequence are identified and each identified transition or boundary or edge is assigned a predetermined value to create the series of values encoded in the nucleotide sequence corresponding to the format of information.
3. The method of claim 1 wherein the value corresponding to the format of information can be obtained from analog, digital, optical, visible or non-visible wavelengths, chemical, or physical input sources.
4. The method of claim J wherein the value is a di gital value.
5 The method of claim 2 wherein the series of values are digital values.
6. The method of claim 5 wherein the digital values are two digits, three digits, four digits, five digits, six digits, seven digits, eight digits, nine digits, ten digits, eleven digits, twelve digits, thirteen digits, fourteen digits, fifteen digits, sixteen digits, or more.
7. The method of claim 5 wherein the digital values are bits (binary digits), trits (ternary digits), octet (eight digits), hexadecimal (sixteen digits) or more.
8. The method of claim 1 wherein the format of information is selected from the group consisting of text, image, video or audio format, sensor data, and combinations thereof.
9. The method of claim I wherein the different or nonidentical nucleotides comprise natural nucleotides or nonnaturai nucleotides.
10. The method of claim 1 wherein the different or nonidentical nucleotides comprise adenine, cytosine, guanine, and thymine.
11. The method of claim 1 wherein the nucleotide sequence includes at least one nucleotide homopolymer.
12. The method of claim 1 wherein the nucleotide sequence includes a nucleotide homopolymer for each different or nonidentical nucleotide in the nucleotide sequence.
13. The method of claim 1 wherein the nucleotide sequence includes a nucleotide homopolymer for each different or nonidentical nucleotide in the nucleotide sequence and wherein the transition between one nucleotide homopolymer to a different or nonidentical nucleotide homopolymer is a single transition or boundary or edge.
14. The method of claim 5 wherein each digital value in the series of digital values represents two different digital values.
15. The method of claim 5 wherein each digital value in the series of digital values represents three different digital values.
16. The method of claim 5 wherein each digital value in the series of digital values represents more than three different digital values.
17. The method of claim 2 wherein the each nucleotide transition or boundary or edge is assigned a predetermined digital value.
18. The method of claim 1 wherein the step of determining the nucleotide sequence is carried out by sequencing methods comprising nanopore sequencing, sequencing-by- synthesis, sequencing-by-ligation, and sequencing-by-hybridization.
19. The method of claim 1 wherein the step of determining the nucleotide sequence is carried out by nucleotides modified with reversible terminators.
20. The method of claim 1 wherein the step of determining the nucleotide sequence is carried out by detection of pyrophosphate or hydrogen ions generated during DNA polymerization of a complementary nucleotide strand.
21. The method of claim 1 wherein the step of determining the nucleotide sequence is carried out by ligation of fluorescently modified single-stranded nucleotides with complementarity to the nucleotide sequence to be sequenced.
22. The method of claim 2 wherein the series of digital values includes a corresponding barcode.
23. The method of claim 1 further comprising decoding a plurality of nucleotide sequences, each member of the plurality encoding for an identical value corresponding to the format of information,
wherein the nucleotide sequence is determined for each member of the plurality, and identifying a transition or boundary or edge between different nucleotides of each member of the plurality and assigning a predetermined value to each identified transition or boundary or edge to create the identical value corresponding to the format of information.
24. The method of claim 23 wherein each member of the plurality of the nucleotide sequence encodes a series of identical values corresponding to the format of information and wherein a plurality of transitions or boundaries or edges between different or nonidentical nucleotides of each member of the plurality of the nucleotide sequence are identified and each identified transition or boundary or edge is assigned a predetermined value to create the series of identical values encoded in each member of the plurality of the nucleotide sequence corresponding to the format of information.
25. The method of claim J wherein the nucleotide sequence is attached to a substrate.
26. The method of claim 23 wherein each member of the plurality of nucleotide sequence is attached to a substrate.
27. The method of claim 24 wherein each member of the plurality of nucleotide sequence is attached to a substrate.
28. The method of claim 2 wherein the series of digital values is a bit or trit stream and the nucleotide sequence corresponds to a bit or trit sequence within the bit or trit stream.
29. The method of claim 2 wherein the series of digital values is a bit or trit stream and the bit or trit stream comprises a plurality of bit or trit sequences each having a corresponding barcode to indicate position within the bit or trit stream and with the plurality of bit or trit sequences having a corresponding plurality of nucleotide sequences,
wherein each member of the plurality of nucleotide sequences is sequenced, and identifying a plurality of transitions or boundaries or edges between different nucleotides of each member of the plurality and assigning a predetermined bit or trit value to each transition or boundary or edge of the plurality of transitions or boundaries or edges to create the bit or trit sequences corresponding to each member of the plurality.
30. A method of decoding a nucleotide sequence encoding for a series of digital values corresponding to a format of information comprising
determining the nucleotide sequence to identify nucleotide homopolymers and for each homopolymer assigning one or more of the nucleotides based on a predetermined predicted homopolymer length of the nucleotide produced using enzymatic or chemical synthesis, and
assigning a particular digital value for each of the one or more nucleotides.
31 . The method of claim 30 wherein the predicted homopolymer length is determined from statistical inference drawn from empirical observation.
32. The method of claim 31 wherein the predicted homopolymer length is a median, a mean, a mode, a probability or a value within a confidence interval.
33. The method of claim 30 wherein the format of information is selected from the group consisting of text, image, video or an audio format, sensor data, and combination thereof.
34. The method of claim 30 wherein the nucleotides comprise natural or nonnatural nucleotides.
35. The method of claim 30 wherein the nucleotides comprise adenine, cytosine, guanine, and thymine.
36. A method of sequencing and decoding a plurality of nucleotide sequences representing a format of information wherein each nucleotide sequence encodes a portion of the format of information and wherein each portion of the format of information has more than two corresponding nucleotide sequences comprising:
determining the sequences and decoded series of digital values for the sequences within a first portion of the plurality of nucleotide sequences,
translating the series of digital values into the portions of the format of information, and
sequencing and decoding in series additional portions into series of digital values and translating the series of digital values into the portions of the format of information until the entire format of information is achieved.
37. A method of encoding a series of digital values corresponding to a format of information into a nucleotide sequence comprising:
for each digital value, assigning a corresponding nucleotide to different or nonidentical nucleotide transition to generate the nucleotide sequence,
synthesizing the nucleotide sequence, and
optionally storing the nucleotide sequence.
38. The method of claim 37 wherein the digital values are two digits, three digits, four digits, five digits, six digits, seven digits, eight digits, nine digits, ten digits, eleven digits, twelve digits, thirteen digits, fourteen digits, fifteen digits, sixteen digits, or more.
39. The method of claim 37 wherein the digital values are bits (binary digits), trits (ternary digits), octet (eight digits), hexadecimal (sixteen digits), or more.
40. The method of claim 37 wherein the format of information is selected from the group consisting of text, image, video or an audio format, sensor data, and combination thereof.
41. The method of claim 37 wherein the nucleotides or different or nonidentical nucleotides comprise natural nucleotides or nonnatural nucleotides.
42. The method of claim 37 wherein the nucleotides or different or nonidentical nucleotides comprise adenine, cytosine, guanine, and thymine,
43. A method for high-throughput decoding of a format of information encoded in a plurality of nucleotide sequences comprising:
providing a plurality of nucleotide sequences, the plurality of nucleotide sequences represents a packet of information, the packet comprises at least one unique identifier;
sequencing at least one of the plurality of nucleotide sequences using a selective sequencer;
storing the sequence and its unique identifier; and
preventing, using the selective sequencer, redundant sequencing of the same nucleotide sequence.
44. The method of claim 43 wherein the format of information is selected from the group consisting of text, image, video or an audio format, sensor data, and combination thereof.
45. The method of claim 43 wherein the step of preventing comprises using the unique identifier to prevent sequencing of additional nucleotide sequence with the same identifier.
46. The method of claim 43 wherein the selective sequencer comprises a nanopore sequencer, and a sequencer compatible with sequencing-by-synthesis, sequencing-by-ligation and sequencing-by-hybridization methods,
47. The method of claim 43 wherein the sequence is stored in computer memory.
48. The method of claim 43 wherein the sequence is decoded into digital values.
49. The method of claim 43 wherein the unique identifier is a synthetic sequence.
50. The method of claim 43 wherein the unique identifier is located at the 3' end, the 5' end of the nucleotide sequence, or is interspersed within the nucleotide sequence.
51. The method of claim 43 wherein the plurality of nucleotide sequences comprises a plurality of unique identifiers.
52. The method of claim 43, further comprising:
sequencing a predetermined number of nucleotide sequences;
assembling the packet of information; and
analyzing the information to determine if the information is correctly decoded.
53. The method of claim 52, further comprising:
permitting sequencing of any nucleotide sequences that were not correctly decoded.
54. The method of claim 52 wherein the step of analyzing is performed using a decoding algorithm.
55. The method of claim 37 wherein the nucleotide sequence is synthesized by a template-independent DNA polymerase.
56. The method of claim 55 wherein the template-independent DNA polymerase is terminal deoxynucleotidyl transferase (TdT).
57. The method of claim 37 wherein the nucleotide sequence is synthesized by a mixture of a template-independent DNA polymerase and an apyrase.
58. The method of claim 43 wherein the information is stored using a codec model .
59. The method of claim 58 wherein the codec model is capable of correcting errors accumulated from synthesis, storage and sequencing.
60. The method of claim 36 wherein the sequencing is streaming nanopore sequencing.
PCT/US2018/056900 2017-10-20 2018-10-22 Methods of encoding and high-throughput decoding of information stored in dna WO2019079802A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201762575017P 2017-10-20 2017-10-20
US62/575,017 2017-10-20

Publications (1)

Publication Number Publication Date
WO2019079802A1 true WO2019079802A1 (en) 2019-04-25

Family

ID=66173900

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2018/056900 WO2019079802A1 (en) 2017-10-20 2018-10-22 Methods of encoding and high-throughput decoding of information stored in dna

Country Status (1)

Country Link
WO (1) WO2019079802A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111091876A (en) * 2019-12-16 2020-05-01 中国科学院深圳先进技术研究院 DNA storage method, system and electronic equipment
CN112288089A (en) * 2020-09-28 2021-01-29 清华大学 Array type nucleic acid information storage method and device
WO2021064095A1 (en) 2019-10-01 2021-04-08 Centre National De La Recherche Scientifique Biocompatible nucleic acids for digital data storage
CN113314187A (en) * 2021-05-27 2021-08-27 广州大学 Data storage method, decoding method, system, device and storage medium
WO2021242446A1 (en) * 2020-05-28 2021-12-02 Microsoft Technology Licensing, Llc De novo polynucleotide synthesis with substrate-bound polymerase
US11268091B2 (en) 2018-12-13 2022-03-08 Dna Script Sas Direct oligonucleotide synthesis on cells and biomolecules
US11773422B2 (en) 2019-08-16 2023-10-03 Microsoft Technology Licensing, Llc Regulation of polymerase using cofactor oxidation states
US11795450B2 (en) 2019-09-06 2023-10-24 Microsoft Technology Licensing, Llc Array-based enzymatic oligonucleotide synthesis

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120142006A1 (en) * 2000-10-06 2012-06-07 The Trustees Of Columbia University In The City Of New York Massive parallel method for decoding dna and rna
US20170017436A1 (en) * 2015-07-13 2017-01-19 President And Fellows Of Harvard College Methods for Retrievable Information Storage Using Nucleic Acids

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120142006A1 (en) * 2000-10-06 2012-06-07 The Trustees Of Columbia University In The City Of New York Massive parallel method for decoding dna and rna
US20170017436A1 (en) * 2015-07-13 2017-01-19 President And Fellows Of Harvard College Methods for Retrievable Information Storage Using Nucleic Acids

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11268091B2 (en) 2018-12-13 2022-03-08 Dna Script Sas Direct oligonucleotide synthesis on cells and biomolecules
US11773422B2 (en) 2019-08-16 2023-10-03 Microsoft Technology Licensing, Llc Regulation of polymerase using cofactor oxidation states
US11795450B2 (en) 2019-09-06 2023-10-24 Microsoft Technology Licensing, Llc Array-based enzymatic oligonucleotide synthesis
WO2021064095A1 (en) 2019-10-01 2021-04-08 Centre National De La Recherche Scientifique Biocompatible nucleic acids for digital data storage
CN111091876A (en) * 2019-12-16 2020-05-01 中国科学院深圳先进技术研究院 DNA storage method, system and electronic equipment
WO2021242446A1 (en) * 2020-05-28 2021-12-02 Microsoft Technology Licensing, Llc De novo polynucleotide synthesis with substrate-bound polymerase
US11702683B2 (en) 2020-05-28 2023-07-18 Microsoft Technology Licensing, Llc De novo polynucleotide synthesis with substrate-bound polymerase
CN112288089A (en) * 2020-09-28 2021-01-29 清华大学 Array type nucleic acid information storage method and device
CN112288089B (en) * 2020-09-28 2022-12-20 清华大学 Array type nucleic acid information storage method and device
CN113314187A (en) * 2021-05-27 2021-08-27 广州大学 Data storage method, decoding method, system, device and storage medium

Similar Documents

Publication Publication Date Title
WO2019079802A1 (en) Methods of encoding and high-throughput decoding of information stored in dna
KR102534408B1 (en) Nucleic acid-based data storage
US20210241059A1 (en) Methods of Storing Information Using Nucleic Acids
Lopez et al. DNA assembly for nanopore data storage readout
US20200401903A1 (en) Nucleic acid-based data storage
JP7277054B2 (en) Homopolymer-encoded nucleic acid memory
Tomek et al. Driving the scalability of DNA-based information storage systems
Song et al. Large-scale de novo oligonucleotide synthesis for whole-genome synthesis and data storage: challenges and opportunities
US11227219B2 (en) Compositions and methods for nucleic acid-based data storage
US11286479B2 (en) Chemical methods for nucleic acid-based data storage
Lu et al. Enzymatic DNA synthesis by engineering terminal deoxynucleotidyl transferase
Lee et al. Enzymatic DNA synthesis for digital information storage
US11795450B2 (en) Array-based enzymatic oligonucleotide synthesis
US11174512B2 (en) Homopolymer encoded nucleic acid memory
Roquet et al. DNA-based data storage via combinatorial assembly
Baek et al. Recent Progress in High-Throughput Enzymatic DNA Synthesis for Data Storage
JP2023546330A (en) Temperature controlled fluid reaction system
Yu et al. High-throughput DNA synthesis for data storage
WO2023177864A1 (en) Combinatorial enumeration and search for nucleic acid-based data storage
KR20230160898A (en) Fixed-point number representation and calculation circuit

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18868230

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18868230

Country of ref document: EP

Kind code of ref document: A1