WO2018148257A1 - Apparatus, method and system for digital information storage in deoxyribonucleic acid (dna) - Google Patents

Apparatus, method and system for digital information storage in deoxyribonucleic acid (dna) Download PDF

Info

Publication number
WO2018148257A1
WO2018148257A1 PCT/US2018/017188 US2018017188W WO2018148257A1 WO 2018148257 A1 WO2018148257 A1 WO 2018148257A1 US 2018017188 W US2018017188 W US 2018017188W WO 2018148257 A1 WO2018148257 A1 WO 2018148257A1
Authority
WO
WIPO (PCT)
Prior art keywords
dna
digital data
error correction
information
code
Prior art date
Application number
PCT/US2018/017188
Other languages
French (fr)
Inventor
Naveen Goela
Jean C. Bolot
Original Assignee
Thomson Licensing
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Thomson Licensing filed Critical Thomson Licensing
Publication of WO2018148257A1 publication Critical patent/WO2018148257A1/en

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/29Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes combining two or more codes or code structures, e.g. product codes, generalised product codes, concatenated codes, inner and outer codes
    • H03M13/2906Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes combining two or more codes or code structures, e.g. product codes, generalised product codes, concatenated codes, inner and outer codes using block codes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/123DNA computing
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C13/00Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00
    • G11C13/0002Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00 using resistive RAM [RRAM] elements
    • G11C13/0009RRAM elements whose operation depends upon chemical change
    • G11C13/0014RRAM elements whose operation depends upon chemical change comprising cells based on organic memory material
    • G11C13/0019RRAM elements whose operation depends upon chemical change comprising cells based on organic memory material comprising bio-molecules
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/05Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
    • H03M13/11Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits using multiple parity bits
    • H03M13/1102Codes on graphs and decoding on graphs, e.g. low-density parity check [LDPC] codes
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/05Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
    • H03M13/13Linear codes
    • H03M13/15Cyclic codes, i.e. cyclic shifts of codewords produce other codewords, e.g. codes defined by a generator polynomial, Bose-Chaudhuri-Hocquenghem [BCH] codes
    • H03M13/151Cyclic codes, i.e. cyclic shifts of codewords produce other codewords, e.g. codes defined by a generator polynomial, Bose-Chaudhuri-Hocquenghem [BCH] codes using error location or error correction polynomials
    • H03M13/1515Reed-Solomon codes
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/05Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
    • H03M13/13Linear codes
    • H03M13/15Cyclic codes, i.e. cyclic shifts of codewords produce other codewords, e.g. codes defined by a generator polynomial, Bose-Chaudhuri-Hocquenghem [BCH] codes
    • H03M13/151Cyclic codes, i.e. cyclic shifts of codewords produce other codewords, e.g. codes defined by a generator polynomial, Bose-Chaudhuri-Hocquenghem [BCH] codes using error location or error correction polynomials
    • H03M13/152Bose-Chaudhuri-Hocquenghem [BCH] codes
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/37Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35
    • H03M13/3761Decoding methods or techniques, not specific to the particular type of coding provided for in groups H03M13/03 - H03M13/35 using code combining, i.e. using combining of codeword portions which may have been transmitted separately, e.g. Digital Fountain codes, Raptor codes or Luby Transform [LT] codes

Definitions

  • the present disclosure involves digital information storage in deoxyribonucleic acid
  • DNA contains the genetic program for the biological development of life.
  • DNA may be used as a compact storage medium for petabytes of information by encoding digital information into sequences of DNA nucleotides.
  • the potential benefits of DNA storage include: (1) extremely high-density storage beyond the order of terabytes within 1 gram of DNA; (2) decades of stability and durability at moderate temperatures; (3) biological replication due to polymerase chain reaction (PCR)-amplification; (4) rapid biological search and indexing mechanisms via primers; (5) biological editing and re- encoding of segments via enzymes.
  • PCR polymerase chain reaction
  • a principal bottleneck for storing data in DNA molecules is the high cost of synthesis.
  • Enzymatic synthesis utilizes low-cost chemicals for growing short DNA oligonucleotides of lengths under 1000 base pairs. However, the synthesis may produce a large fraction of insertion, deletion, or substitution errors.
  • a challenge of using DNA for storage involves the management of errors. Several types of errors may occur including: (1) Insertion, deletion, substitution errors within oligonucleotide segments; (2) Missing DNA segments; (3) Synchronization errors across multiple segments with the same address; (4) Low coverage and amplification yields for certain DNA segments; (5) Structural error patterns introduced by synthesis arrays and sequencing machines.
  • One problem involved in designing a coded DNA storage system is to ensure the efficiency and compatibility of the different components which correct diverse errors.
  • An aspect of the present disclosure involves error-correction being introduced for DNA storage based on a DNA process that may be error prone, e.g., a noisy or lossy process that may result from using at least one of an error-prone synthesis process such as low-cost enzymatic molecular synthesis and an error-prone sequencing process such as nanopore sequencing.
  • a modular pipelined system provides for encoding and decoding digital information reliably in DNA molecules in the presence of a large fraction of differentiated errors.
  • a source of data is efficiently encoded, modulated, and stored in a set of oligonucleotide segments. Using DNA sequencing, the information stored on multiple segments is assembled and decoded reliably.
  • a classification of errors is provided for the biological processes of low- cost DNA synthesis, PCR-amplification, and DNA sequencing. Error-correction is designed to recover data corrupted by insertions, deletions, and substitutions of nucleotides. Low-cost synthesis also places constraints on the modulation of bits to nucleotides.
  • a flexible system is presented comprised of error-correction codes (e.g., Reed-Solomon, LDPC, or polar codes), synchronization codes, and constrained modulation codes. Tradeoffs are outlined for the amount of overhead envisioned for reliable storage and compared to information-theoretic bounds.
  • the encoding step comprises modulation, synchronization, addressing, and error-correction within each oligonucleotide and per block of multiple oligonucleotides.
  • the decoding step comprises reciprocal measures to assemble source data from multiple segments. Error- correction is designed to recover data corrupted by several types of errors including missing segments. The amount of overhead redundancy envisioned for reliable storage is specified and compared to information-theoretic bounds.
  • an enzymatic synthesizer intended to produce one oligonucleotide may actually produce multiple output strands having varying amounts of errors such as insertions, deletions, and substitutions.
  • Another aspect relates to an error-correction system suitable for effectively managing errors created by a lossy or error-prone DNA synthesis system, e.g., a low-cost enzymatic synthesis.
  • Traditional error-correction systems for DNA storage are based on high-fidelity molecular synthesis (e.g., using Agilent micro-array platforms). Such synthesis machines are too costly for commercial purposes, and cannot enable ubiquitous storage.
  • a lossy or error- prone DNA process in accordance with the present principles may introduce error rates one or two orders of magnitude greater than a high-fidelity synthesis machine.
  • a DNA data storage system in accordance with the present principles enables data storage in DNA using lossy or error-prone DNA synthesis such as low-cost enzymatic synthesis producing DNA oligonucleotides using low-cost chemicals.
  • the low-cost property allows eventually scaling up to terabytes of synthesized data.
  • Another aspect of the current disclosure involves addressing an issue with low-cost enzymatic synthesis regarding when writing successive nucleotides (e.g., ATCAGTGAGCTAGC). Not all transitions have a high probability of success.
  • an aspect of the present disclosure involves introducing a constraint graph for nucleotide patterns.
  • a modulation scheme implements the constraint graph.
  • modulation is to maximize the capacity of information (how many bits per nucleotide stored), and to minimize the probability of insertions/deletions (due to less reliable nucleotide transitions). Modulation in accordance with the present principles avoids less reliable transitions entirely. It is also possible to minimize the usage of the less reliable transitions.
  • Another aspect of the present principles addresses an issue with error prone DNA synthesis, e.g., low-cost enzymatic synthesis, involving multiple oligonucleotide segments produced all of which do not have the same lengths.
  • An aspect of the present disclosure involves using synchronization codes to address distribution of lengths of oligonucleotides, otherwise the storage capacity may be significantly reduced.
  • One type of synchronization code may comprise a marker code.
  • a marker code may comprise known marker symbols which are inserted in the bit stream prior to modulation.
  • synchronization errors e.g., insertion and deletion errors
  • oligonucleotide segments are synchronized, the positions which have been deleted may be corrected.
  • error-correction within each oligonucleotide addresses this problem.
  • BCH codes are applied for error- correction after the oligonucleotide has been synchronized to the correct length. That is, in accordance with the present principles, first synchronize and then correct.
  • synchronization provides (posterior) probabilities for the sequence of bits stored in each oligonucleotide.
  • other codes such as LDPC (low-density parity-check codes), polar codes, short Turbo codes, or convolutional codes may be applied as well.
  • Another aspect of the present principles addresses an issue with DNA storage involving assembling the original data from multiple, short, sequenced oligonucleotides.
  • an address for each oligonucleotide is provided for assembly.
  • address bits are concatenated with data bits prior to applying error-correction for each oligonucleotide.
  • decoding after an oligonucleotide has been synchronized and error-corrected, its address is retrieved and thus its position in assembling all of the original data is known.
  • Another aspect of the present disclosure involves addressing an issue with DNA Storage involving that some oligonucleotides might be corrupted beyond repair, and/or lost in the DNA solution (e.g., if PCR-amplification does not yield enough copies for sequencing).
  • An aspect of the present disclosure involves addressing this issue by introducing a code (e.g., Reed-Solomon code) for a block of multiple oligonucleotides. After synchronization of each oligonucleotide, error-correction of each oligonucleotide, and retrieving the address of each oligonucleotide, multiple oligonucleotides in a block are reinforced by extra redundancy in case any of them is lost or corrupted beyond repair.
  • a code e.g., Reed-Solomon code
  • Another aspect of the present disclosure involves correcting for multiple types of errors that may occur simultaneously with DNA storage based on low-cost enzymatic synthesis.
  • this problem is addressed using a modular error- correction pipeline.
  • the order of the blocks in the pipeline provides for correcting diverse types of errors. It is possible to jointly design the blocks in the pipeline to improve efficiency and a modular approach enables improving each block separately.
  • Encoding in accordance with the present principles involves the following in order: (1) Store data bits in multiple oligonucleotides with redundancy per block of oligonucleotides; (2) Add an address per oligonucleotide; (3) Add error-correction per oligonucleotide; (4) Add synchronization per oligonucleotide; (5) Transform bits to nucleotides using modulation scheme.
  • Decoding in accordance with the present principles involves the following in order: (1) Demodulate each oligonucleotide from nucleotides to bits; (2) De- synchronize each oligonucleotide; (3) Error-correction by decoding each oligonucleotide; (4) retrieve address for each oligonucleotide; (5) If any oligonucleotide in a block is corrupted beyond repair or missing, utilize the redundancy per block of oligonucleotides to reconstruct the original source data.
  • an embodiment of a method of encoding digital data comprises adding address information to the digital data; converting the digital data including the address information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable: synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy or error-prone DNA synthesis process including at least one of error-prone synthesis and error-prone sequencing, and decoding the digital data from the DNA oligonucleotide.
  • an embodiment of a method of encoding digital data including adding synchronization information to the digital data may include the synchronization information comprising a synchronization marker.
  • an embodiment of a method of encoding digital data may further include adding error correction information to the digital data after adding address information and before converting to the digital data to codes, whereby converting includes converting the digital data including the address information and including the error correction information to the plurality of codes.
  • the error correction information may include a block error correction code and a word error correction code.
  • a block error correction code may include one of a Reed-Solomon code and a Fountain code.
  • a word error correction code may include one of a BCH code and a LDPC code.
  • an embodiment of a method of encoding digital data may comprise adding address information to the digital data; adding error correction information to the digital data including the address information; converting the digital data including the address information and the error correction information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy or error-prone DNA synthesis process and to enable subsequent decoding of the digital data from the DNA oligonucleotide.
  • DNA deoxyribonucleic acid
  • synchronization information may comprise a synchronization marker.
  • error correction information may include a block error correction code and a word error correction code.
  • a block error correction code may include one of a Reed-Solomon code and a Fountain code.
  • a word error correction code may include one of a BCH code and a LDPC code.
  • converting digital data to a plurality of codes may comprise a mapping of the digital data to nucleotides according to a modulation map.
  • converting digital data to a plurality of codes using a modulation map may comprise the modulation map being in accordance with a constraint graph configuring the mapping to decrease a likelihood of unreliable transitions.
  • converting digital data to a plurality of codes using a modulation map may comprise the modulation map writing an A on every odd occurrence of 0, a T on every even occurrence of 0, a C on every odd occurrence of 1, and a G on every even occurrence of 1.
  • an embodiment of a method of decoding digital data stored in DNA may comprise synchronizing a plurality of oligonucleotides selected from a pool of DNA strands synthesized using a lossy or error-prone DNA synthesis process based on synchronization information included in each of the plurality of oligonucleotides; demodulating the plurality of oligonucleotides to produce a plurality of binary words corresponding to respective ones of the plurality of oligonucleotides, wherein the demodulating is based on a modulation map providing a mapping of nucleotides to digital bits; and extracting the digital data from the plurality of binary words.
  • synchronization information may comprise a synchronization marker.
  • each of the plurality of binary words may include error correction information, and extracting the digital data may include error correction based on the error correction information.
  • error correction information may include a block error correction code and a word error correction code.
  • a block error correction code may include one of a Reed-Solomon code and a Fountain code.
  • a word error correction code may include one of a BCH code and a LDPC code.
  • a modulation map may be in accordance with a constraint graph configuring the mapping to decrease a likelihood of unreliable transitions.
  • a modulation map may comprise writing an A on every odd occurrence of 0, a T on every even occurrence of 0, a C on every odd occurrence of 1, and a G on every even occurrence of 1.
  • an embodiment of a lossy DNA synthesis process for digital data storage may comprise an enzymatic molecular synthesis process.
  • an embodiment of an encoder may comprise a processor performing one or more aspects of embodiments of methods as described herein.
  • an embodiment of a decoder may comprise a processor performing one or more aspects of embodiments of methods as described herein.
  • an embodiment may comprise a non-transitory computer-readable medium storing computer-executable instructions executable to perform one or more aspects of embodiments of methods as described herein.
  • an embodiment of a method of decoding digital data stored in DNA molecules may comprise accessing a plurality of DNA molecules storing encoded digital data, wherein the encoded digital data includes a code component comprising an address, synchronization information comprising a marker symbol, and error correction information, and the DNA molecules were synthesized using a lossy or error-prone DNA synthesis process; sequencing the plurality of DNA molecules; merging and assembling the plurality of DNA molecules to form a plurality of DNA oligonucleotides; and decoding the digital data from the plurality of DNA oligonucleotides wherein the decoding includes synchronizing the plurality of DNA segments using the marker symbol in each of the plurality of segments, processing the synchronized plurality of segments to extract digital information from the synchronized plurality
  • an embodiment of an encoder comprises encoding digital data including error correction information and synchronization information into a plurality of codes each representing a respective nucleotide of a DNA oligonucleotide, wherein the encoding includes a modulation of the digital data to nucleotides of DNA in accordance with a modulation map; a DNA synthesizer synthesizing a DNA segment based on the plurality of codes and using a lossy or error-prone DNA synthesis process; a PCR amplifier amplifying the DNA segment to form a plurality of DNA segments; and an archive to store the plurality of DNA segments.
  • an embodiment comprises a sequencer sequencing a plurality of DNA segments retrieved from an archive, wherein the plurality of DNA segments were synthesized using a lossy or error-prone DNA synthesis process; and a decoder decoding digital data from a plurality of DNA oligonucleotides included in the plurality of DNA segments, wherein the decoder synchronizes the plurality of oligonucleotides using the synchronization information included in the plurality of oligonucleotides; demodulates the synchronized plurality of oligonucleotides to produce a plurality of digital words corresponding to respective ones of the plurality of oligonucleotides; and extracts the digital data from the binary words.
  • a DNA data storage system in accordance with the present principles includes an encoder producing digital words, at least one of a lossy or error- prone DNA synthesis process synthesizing DNA oligonucleotides corresponding to the digital words and a lossy or error-prone DNA sequencing process sequencing the DNA oligonucleotides, a decoder decoding the digital words from the sequenced oligonucleotides and end-to-end error correction correcting errors introduced by the error-prone DNA processing at combined error rates on the order of 25% per nucleotide.
  • an embodiment of an encoder for encoding digital data comprises at least one processor configured for adding address information to the digital data; converting the digital data including the address information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable: synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy DNA synthesis process, and decoding the digital data from the DNA oligonucleotide
  • an embodiment of a decoder for decoding digital data stored in DNA comprises at least one processor configured for synchronizing a plurality of oligonucleotides selected from a pool of DNA strands synthesized using a lossy DNA synthesis process, wherein the synchronization is based on synchronization information included in each of the plurality of oligonucleotides; demodulating the plurality of oligonucleotides to produce a plurality of binary words corresponding to respective ones of the plurality of oligonucleotides, wherein the demodulating is based on a modulation map providing a mapping of nucleotides to digital bits; and extracting the digital data from the plurality of binary words.
  • an embodiment of a system for storing digital information in DNA comprises: an encoder encoding digital data including error correction information and synchronization information into a plurality of codes each representing a respective nucleotide of a DNA oligonucleotide, wherein the encoding includes a modulation of the digital data to nucleotides of DNA in accordance with a modulation map; a DNA synthesizer synthesizing a DNA segment based on the plurality of codes and using a lossy DNA synthesis process; a PCR amplifier amplifying the DNA segment to form a plurality of DNA segments; an archive to store the plurality of DNA segments; a sequencer sequencing a plurality of DNA segments retrieved from the archive, wherein the plurality of DNA segments were synthesized using a lossy DNA synthesis process; and a decoder decoding digital data from a plurality of DNA oligonucleotides included in the plurality of DNA segments, wherein the decoder: synchronizes the pluralit
  • FIG. 1 illustrates, in block diagram form, a DNA storage system in accordance with the present principles
  • Figure 2 illustrates, in state diagram form, a system for DNA synthesis
  • Figure 3 illustrates, in graphical form, an aspect of a DNA digital data storage system in accordance with the present principles
  • Figure 4 illustrates, in state diagram form, an exemplary embodiment of constraints for synthesizing DNA
  • Figure 5 illustrates, in block diagram form, an exemplary embodiment of molecular- level DNA storage
  • Figure 6 illustrates, in block diagram form, an exemplary embodiment of an error- correction pipeline for DNA storage of digital data
  • Figure 7 illustrates, in graphical from, an aspect of a DNA digital data storage system in accordance with the present principles.
  • Figure 8 illustrates, in graphical from, an aspect of DNA digital data storage system in accordance with the present principles.
  • An end-to-end system for DNA storage includes several components.
  • Source information e.g., a movie file
  • Source information is encoded, modulated, synthesized, and stored in multiple DNA oligonucleotide segments.
  • the DNA segments are sequenced, demodulated, assembled, and decoded reliably.
  • Figure 1 depicts an exemplary embodiment of a complete digital data storage and recovery system.
  • the system involves a digital data management and processing portion 101 including encoding, decoding and error management of digital data and a DNA storage and recovery portion 102 including DNA synthesis, polymerase chain reaction (PCR) amplification, and DNA sequencing.
  • the system is comprised of the following components as shown in Fig.
  • source data 110 in bits in bits
  • encoding mechanism 120 including all error-correction codes and modulation from bits to nucleotides
  • DNA synthesis 130 of multiple DNA segments PCR-amplification 140 of DNA pools
  • DNA archival storage 150 DNA sequencing 160; merging and assembling multiple DNA segments 170; demodulation and decoding 180 of all codes for reliable recovery.
  • One challenge of designing a coded DNA storage system is to ensure the efficiency and compatibility of the different components which correct diverse errors.
  • DNA storage In DNA storage, several types of errors occur including: (1) Insertion, deletion, substitution errors within oligonucleotide segments; (2) Missing DNA segments; (3) Synchronization errors across multiple segments with the same address; (4) Low coverage and amplification yields for certain DNA segments; (5) Structural error patterns introduced by synthesis arrays and sequencing machines.
  • DNA storage channels By modeling the errors of DNA processing technologies, it is possible to define "DNA storage channels", many of which have only approximately known information-theoretic capacities by contrast to standardized, precisely- mapped wireless communication channels. For example, the capacity of the deletion channel is only known to within upper and lower bounds for independent and identically distributed deletion.
  • errors such as deletions must be addressed to provide a reliable data storage system, especially when utilizing low-cost, next-generation DNA processes that may be lossy or noisy.
  • a DNA process that is lossy or noisy as referenced herein is intended to encompass various DNA synthesis and/or sequencing processes that may be error prone, e.g., produce errors such as deletions, insertions and substitutions as described herein. Such errors may occur as a result of using low cost DNA processing as part of a DNA data storage system.
  • a low cost system may include at least one of a lossy or noisy DNA synthesis process and a lossy or noisy DNA sequencing process.
  • a lossy or noisy DNA synthesis process may include a low-cost enzymatic synthesis process.
  • Such processes use low cost chemicals and, therefore, may advantageously reduce the cost of a DNA data storage system.
  • such low cost synthesis may introduce diverse errors at a relatively high rate.
  • nucleotide deletions may occur at error rates on the order of 25% per nucleotide.
  • high-fidelity DNA synthesis machines utilizing chemical synthesis methods may provide a combined error rate one or two orders of magnitude less, e.g., under 1%, but at a much higher cost.
  • Errors may also be introduced during sequencing of DNA oligonucleotides when reading or recovering data, e.g., by using a sequencing process such as nanopore sequencing that may tend to be lossy or noisy.
  • oligonucleotide segments are equipped with an address code which is a unique identifier.
  • Digital payload information is stored across multiple segments, protected by modern codes organized in both the "horizontal" dimension (per oligonucleotide), and the "vertical" dimension (across multiple oligonucleotides). While accuracy in retrieval is an important consideration, so also is efficiency of such codes to reduce overhead costs in DNA synthesis and sequencing.
  • a fundamental constraint of storing information in DNA molecules is the limited length of oligonucleotide segments.
  • Current DNA synthesis machines synthesize short segments nucleotide by nucleotide. Each oligonucleotide may be of variable length, depending on the number of synthesis errors incurred.
  • a DNA synthesizer produces M oligonucleotide segments of variable lengths.
  • the fixed-length input x m and variable-length output w m of a synthesis machine are denoted as follows for the m th oligonucleotide where m E [1 ... M] ,
  • next-generation low-cost DNA synthesis exhibits insertion, deletion, substitution, and burst nucleotide errors.
  • deletion errors are prevalent.
  • the notion of memory in a deletion channel differs from memoryless channel models. A deletion occurring at any nucleotide within an oligonucleotide affects the position of all subsequent nucleotides.
  • Figure 2 illustrates a model for DNA synthesis.
  • the DNA synthesis machine aims to write the k th nucleotide of an oligonucleotide where k E [1, ... , K] .
  • the output oligonucleotide is of variable length.
  • Figure 2 illustrates a Markov chain mathematical model which includes various types of errors. Insertion, deletion, substitution, and burst errors have associated probabilities denoted by p ins , p dei , p sub , or p bur respectively.
  • the following theorem derives the mean and variance of the length of the m th synthesized oligonucleotide.
  • Theorem 1. (Variable-Length Oligonucleotide Segments):
  • T be the random number of nucleotides written starting from state s k and ending at state s k+1 for k E [1, ... , K] .
  • the probability generating function of random variable T is repeatedly , ⁇ Pdei ( ⁇ - ⁇ Pdel ⁇ Pins) - ⁇ Pbur)
  • the proof is available in Appendix I, and may be adapted to several variations of the Markov chain model.
  • VAR[L m ] tf(p dei ) (l - p dei )-
  • Example 2 The distribution of the lengths of synthesized DNA segments is simulated and illustrated in Figure 3. For non-negligible insertion and deletion probabilities, a large fraction of the segments have unequal, variable lengths.
  • Conventional DNA storage systems rely on high-fidelity micro-array synthesizers which produce segments of correct lengths. For example, in known experiments such as the Harvard-Technicolor collaboration to store 22 mega-bytes of information in DNA, high-cost synthesis yielded a distribution spiked at the intended template length.
  • Low-cost enzymatic synthesis may have a non-negligible deletion probability p del which, in accordance with an aspect of the present disclosure, may be addressed by providing synchronization codes to correct for variable lengths.
  • p del the distribution of the length of synthesized oligonucleotides varies according to insertion and deletion probabilities.
  • the limitations of biological processes for molecular-level storage constrain the set of valid nucleotide sequences.
  • E denote a matrix of efficiencies (probabilities) for i,j E ⁇ A, T, C, G ⁇ .
  • Element is the probability of successfully writing a nucleotide j given that a nucleotide ⁇ has been written.
  • the matrix E of efficiencies may be asymmetric.
  • a constraint graph may be constructed based on nucleotide efficiencies; i.e., only transitions which have a higher probability of success for biological synthesis are included as valid transitions.
  • a constraint graph is a directed graph.
  • An irreducible constraint graph contains a path between any ordered pair of nodes.
  • Example 3 (Constraint Graph): Figure 4 depicts a constraint graph which outlines valid and invalid transitions between nucleotides.
  • the adjacency matrix of the constraint graph is
  • edge transitions (A, A), (T, T), (C, C), (G, G) are not included in the graph.
  • edge transitions (A, G), (T, G), (G, T), (C, T) are invalid transitions.
  • constraints on transitions between nucleotides constraints on longer sequences of nucleotides are often necessitated to ensure proper PCR-amplification and DNA sequencing. For example, the exclusion of reverse complementary subsequences may be involved, and/or the avoidance of specific nucleotide patterns reserved for primers. To encourage the correct binding of primers, a balanced distribution of nucleotides might also be enforced.
  • DNA storage may involve constraints on input sequences of nucleotides.
  • a modulation code maps sequences of bits to valid sequences of nucleotides.
  • a demodulation function maps sequences of nucleotides to sequences of bits. The demodulation function may correct nucleotide errors partially if they occur, and/or defer error-correction to other decoding units. However, in either case, the demodulation function should not drastically affect the synchronization of the sequences.
  • Modulation and Demodulation Functions Denote a modulation function ⁇ £ and a demodulation function ⁇ ⁇ as follows, where k moc i, kJ nod , n mo( i, nJ nod E 2 + .
  • the modulation function ⁇ £ maps a sequence of k mod bits to a sequence of n mod nucleotides. If nucleotide insertions and deletions occur, the demodulation function ⁇ ⁇ may accept a sequence of nucleotides of length n m ' od where n m ' od ⁇ n mod . The demodulation function produces a sequence of bits of length k m ' od where it may be the case that k m ' od ⁇ k mod if error correction is deferred.
  • the map is defined by
  • the modulation map ⁇ £ writes an A on every odd occurrence of 0, a ⁇ on every even occurrence of 0, a C on every odd occurrence of l, and a G on every even occurrence of 1.
  • ⁇ £ (00111010) ATCGCACA.
  • Nucleotide transitions (A, G), (G, T), (C, T), (T, G) are not permitted in the modulation, as indicated by the constraint graph of Figure 4.
  • the demodulation function ⁇ maps both A and T to a 0 , and both C and G to a 1 .
  • variable-length segments ⁇ w m ⁇ for m E [1,2, ... M] are stored compactly in an ultra-dense DNA solution.
  • storing DNA in a solution implies the loss of ordering between segments.
  • the set of oligonucleotide segments ⁇ w m ⁇ is reordered arbitrarily via a noiseless permutation channel, yielding ⁇ w 7l(jn ⁇ for m E
  • Figure 5 illustrates a macro-level view of DNA storage.
  • the input to the system is M oligonucleotide segments ⁇ x m ⁇ each of fixed length K. Synthesis produces variable-length segments ⁇ w m ⁇ , and storage in solution yields ⁇ ⁇ ( ⁇ . Prior to sequencing, the stored segments may be PCR-amplified to aid in the read process. The final step is DNA sequencing of all detectable strands.
  • each short oligonucleotide is re-assembled from reads.
  • the modeling of nucleotide errors such as insertions, deletions, substitutions was already included mathematically for DNA synthesis, and is therefore not covered again at this stage.
  • a new type of error occurs during sequencing.
  • a probability ⁇
  • each oligonucleotide is absent from the read output due to low coverage in its DNA solution. The low coverage may be the result of a lack of PCR-amplification for a particular oligonucleotide.
  • the mathematical model for sequencing each oligonucleotide is,
  • variable-length segments ⁇ y n (m) ⁇ f° r m e [1 * 2, ... , M] .
  • the merging of segment copies may be done easily if each segment contains a protected address as a unique identifier.
  • the merging is combined with a consensus approach based on edit distances.
  • the edit distance between two strings is the minimum number of symbol insertions, deletions, or substitutions needed to transform one of the strings to the other string.
  • the edit distance for variable-length segments is computed efficiently via dynamic programming which is essentially optimal in terms of its quadratic complexity in the length of the segments.
  • the DNA molecular channel contains several sources of error.
  • the inputs are M oligonucleotide segments each of fixed size K nucleotides.
  • the capacity for the storage channel is at most 2MK bits, if each nucleotide represents 2 bits.
  • This upper bound assumes no constraints for modulation, and no sources of error.
  • the upper bound of 2MK bits may be tightened.
  • C ST (M, K) be the storage capacity of the DNA molecular-level storage system described herein.
  • a e the adjacency matrix of an irreducible constraint graph specifying valid nucleotide sequences.
  • the DNA storage system satisfies the following upper bounds,
  • the upper bound relaxations provided in Theorem 2 are not achievable in most cases.
  • the capacity of the deletion channel is not known exactly. Therefore, the upper bound in Eqn. (6) is not attainable.
  • the capacity C DL is bounded as
  • deletion channels typically for deletion channels, theoretical bounds are achieved by random codes with block length approaching infinity. Even at finite lengths, the exponential complexity of decoding is impractical. Moreover, much less is known for quaternary or z-ary deletion channels. Aside from deletions, there are multiple errors occurring in DNA storage systems. It is theoretically possible to analyze multiple errors occurring simultaneously to tighten the bounds jointly.
  • the upper bound due to the permutation channel in Eqn. (5) is important to show that both the number of oligonucleotide segments M and the length of each oligonucleotide K must grow in order to increase storage capacity adequately.
  • the upper bound is computed as follows for a few pairs (M, K) . (M + 3)!
  • the increase in storage capacity only grows at most as 0(log 2 M) .
  • Oligonucleotide segments must not be too short, despite biological constraints due to low-cost DNA synthesis.
  • the length K must grow at least as log 2 M which is the number of bits to uniquely address each segment. As expected, the capacity is upper bounded by a rate of growth linear in K.
  • Figure 6 illustrates a pipeline for error-correction and processing comprising the following components: (1) Modulation and demodulation between bits and nucleotides; (2) Encoding and decoding for synchronizing sequences of bits; (3) Error-correction across each oligonucleotide; (4) Error-correction across multiple oligonucleotides.
  • the information stream is in bits.
  • the modulation map converts the stream of bits to a stream of nucleotides obeying the constraint graph described above.
  • Example 3 defined a modulation map in which 1 bit mapped to 1 nucleotide.
  • the presence of an insertion or deletion of a nucleotide results in an insertion or deletion of 2 bits in the demodulated bit stream.
  • the input to the synchronization encoding block in Figure 6 is a stream of bits for each oligonucleotide.
  • a stream of input bits may be grouped into symbols.
  • a mechanism to achieve synchronization is to insert a marker symbol in the input stream. Markers delineate boundaries to detect insertions and deletions. For every k sync information symbols, (n sync — k sync ) marker symbols may be inserted, resulting in an overhead of Usync .
  • a synchronization decoder is more complex.
  • a synchronization decoder v may feed the raw probabilities to an outer code, or directly estimate
  • the decoder ⁇ ⁇ is able to efficiently compute posterior probabilities via dynamic programming.
  • Appendix 7 details the ⁇ / ⁇ recursions involved based on the Markov chain model discussed above.
  • error-correction may be applied per oligonucleotide.
  • a deletion error causes a synchronization error, but even if the position of the deletion is known via synchronization, correction of the deleted symbol may be needed.
  • An example of a short block code is the Bose- Chaudhuri-Hocquenghem (BCH) code.
  • BCH Bose- Chaudhuri-Hocquenghem
  • Primitive BCH codes over the Galois field F 2 are a standard class of BCH codes.
  • a t-error-correcting BCR(n bch , k bch ) code has parameters
  • d min is the minimum distance of the linear code.
  • m 6
  • k bch 36
  • BCH(63,39,4), BCH(63,36,5), BCH(63,30,6) BCH(63,78,7) , BCH (127,71,9) , BCH (127,64,10) , BCH (127,57,11) , BCH(127,50,13).
  • BCH codes are applicable for DNA storage due to their short block lengths and error-correcting abilities per oligonucleotide. For a t-error-correcting BCH code with storage overhead ⁇ -, the code corrects a fraction—— of errors.
  • Error-correction across multiple oligonucleotide segments may be included to protect against missing segments, and segments that have an individual decoding error.
  • a basic kind of error-correction code is a Reed-Solomon (RS) code.
  • An RS(n rs , k rs ) code has a minimum Hamming distance of n rs — k rs + 1.
  • the storage overhead ratio is— .
  • the code is able to
  • a Reed-Solomon code operates over a Galois field 2 m where m is a positive integer.
  • An RS(n rs , k rs ) block is comprised of n rs oligonucleotide segments of which k rs segments hold information, and (n rs — k rs ) segments hold redundant parity information.
  • each oligonucleotide may be allocated address bits in addition to information bits.
  • the address indicates not only the RS-block for each oligonucleotide, but also the specific position of the segment within the RS- block. Only correctly decoded oligonucleotide segments labeled by addresses are included for RS-block decoding.
  • Each oligonucleotide is equipped with an address since the DNA channel is partially described by a permutation channel as described herein.
  • an overhead is incurred for DNA storage of Waddr where k addr bits per oligonucleotide are reserved for information bits, and
  • the end-to-end coding system aims to achieve zero error (perfect reliability) in order to extract data from DNA.
  • an amount of overhead is selected in order for the overall system to have a very low probability of error in reconstruction.
  • Each block of Figure 6 has an associated storage overhead.
  • the total storage overhead v incurred by the system is given by,
  • V ⁇ "k-mod ⁇ ksync ⁇ kbch " kr ⁇ s ⁇ kaddr ' ⁇
  • the overall storage ratio includes overhead for the addressing of each oligonucleotide Waddr ,
  • ratio for modulation and demodulation includes a factor of 2 since the conversion is from k mod bits to n mod nucleotides.
  • Example 5 (Synchronization Overhead vs. Reliability): To illustrate the effect of synchronization on decoding error within each oligonucleotide, the following experiment was simulated based on all code blocks of Figure 6, excluding error-correction across multiple oligonucleotides.
  • each oligonucleotide is able to store 50 bits comprised of both address and information bits. An extra fixed bit was added to the end of the stream to allow for an even length of 128 bits.
  • oligonucleotide lengths were 128,95,85,79 nucleotides respectively.
  • An increasing oligo length accommodates more synchronization overhead.
  • Figure 7 shows experimental results for error-correction of oligonucleotide segments in the presence of a large fraction of deletions. An increasing fraction of synchronization markers increases the probability of correct decoding at the expense of extra redundancy. The results of decoding individual oligonucleotides is plotted in Figure 7 for different probabilities of deletion p dei in the DNA channel. Other types of error were excluded in the simulation.
  • Each data-point in the curves is the result of decoding 10 4 oligonucleotides.
  • the synchronization code with the highest overhead yields the best performance curve at the cost of synthesizing a longer oligonucleotide.
  • Each symbol was mapped to a nucleotide with unconstrained modulation yielding oligonucleotide segments of length 95 nucleotides.
  • Figure 8 shows experimental results for error-correction of oligonucleotide segments in the presence of a large fraction of insertions, deletions, and substitutions.
  • the plot shown in Figure 8 illustrates the decoding success probability under varying probabilities for P sub > P dei > Pins across the DNA channel.
  • P sub > P dei > Pins across the DNA channel.
  • a combined presence of diverse errors is more challenging to accommodate compared to the presence of only one type of error. Nevertheless, the storage system is robust to varying storage conditions.
  • Each data-point in the curves is the result of decoding 10 4 oligonucleotides.
  • deletions and insertions errors are costly to accommodate in terms of storage overhead.
  • the addressing overhead is costly if a large number of oligonucleotide segments is necessary.
  • each oligonucleotide is of length 95 nucleotides, and stores 8 bits (1 byte) of pay load information.
  • V ⁇ "k-mod ⁇ ksync ⁇ kbch " kr ⁇ s ⁇ kaddr
  • Each oligonucleotide is able to store 16 bits (2 bytes) of information which allows 2 34 addresses possible.
  • each oligonucleotide In addition to increasing the total storage overhead ratio v, the lengths of each oligonucleotide would increase two-fold to 190 nucleotides.
  • the parameters for both exemplary terabyte and gigabyte DNA storage systems were selected to achieve high reliability in the presence of diverse errors in the DNA storage channel.
  • the exact probability of perfect reconstruction may be estimated based on decoding curves for decoding individual oligonucleotide segments, as illustrated in Example 5 and Example 6. For example, for a rate one-half Reed-Solomon code per block of oligonucleotide segments, if more than one-half of the segments are not missing and decoded correctly, the code recovers the entire block with zero error.
  • the simulation experiments described above illustrate that a much higher overhead redundancy ratio may be needed to correct for synchronization errors.
  • DNA storage continues to be an emerging, innovative, and viable technology.
  • the cost of high-throughput DNA synthesis is decreasing with the introduction of new technologies, and DNA sequencing costs have already dropped rapidly with the discovery of microfluidic nanopore devices.
  • Ideas such as biological editing of DNA, computing, search, and indexing provide focus for cutting-edge research. For example, recently it was shown that simple computing systems may be built in bacteria in the form of synthetic state machines. Next- generation biological computers could utilize DNA storage for memory.
  • the random variable T Define the random variable T as the number of nucleotides written starting from state s k and ending at state s k+1 in the Markov chain.
  • random variable U Define random variable U as the number of nucleotides written starting from either the WRITE state or the WRITE ERROR state and ending at the state s k+1 .
  • P T (1) (Pins) .Pdel) + (1 - Pdei ⁇ Piru)(l ⁇ Pfcur),
  • P T (n) (p iru )Pr(n - 1) + (1 - Pdei - p in s)Pu n), for n > 1 .
  • the probabilities P T (n) for n > 1 are defined recursively.
  • the generating function associated to random variable T is ⁇ ⁇ ⁇ ⁇ ( ⁇ )
  • the moments of T may be generated from the function G T (io). For example, the mean and variance of the number of nucleotides written starting at state s k and ending at state s k+1 are
  • other quantities of interest may be computed recursively.
  • the ⁇ / ⁇ probabilities are defined as follows,
  • the ⁇ / ⁇ probabilities may be computed recursively. Denote by p un ij the
  • the posterior probabilities may be computed based on the ⁇ //? recursions.
  • Tj min ⁇ 3i,/ ⁇ .
  • An estimator may choose the symbol for Tj which has the highest posterior probability, and/or provide the raw posterior probabilities to an outer code for joint decoding.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Genetics & Genomics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

A system to store digital information in DNA using a lossy DNA process involves encoding synchronization and addressing information with the digital information into DNA oligonucleotides and decoding the digital information from multiple oligonucleotide segments using synchronization based on the synchronization information. The encoding and decoding includes error correction coding (ECC) and decoding, respectively, using a two-dimensional forward error correction (FEC) code.

Description

APPARATUS, METHOD AND SYSTEM FOR
DIGITAL INFORMATION STORAGE IN DEOXYRIBONUCLEIC ACID (DNA)
TECHNICAL FIELD
The present disclosure involves digital information storage in deoxyribonucleic acid
(DNA).
BACKGROUND
Deoxyribonucleic acid (DNA) contains the genetic program for the biological development of life. However, DNA may be used as a compact storage medium for petabytes of information by encoding digital information into sequences of DNA nucleotides. The potential benefits of DNA storage include: (1) extremely high-density storage beyond the order of terabytes within 1 gram of DNA; (2) decades of stability and durability at moderate temperatures; (3) biological replication due to polymerase chain reaction (PCR)-amplification; (4) rapid biological search and indexing mechanisms via primers; (5) biological editing and re- encoding of segments via enzymes. A principal bottleneck for storing data in DNA molecules is the high cost of synthesis. Enzymatic synthesis utilizes low-cost chemicals for growing short DNA oligonucleotides of lengths under 1000 base pairs. However, the synthesis may produce a large fraction of insertion, deletion, or substitution errors. A challenge of using DNA for storage involves the management of errors. Several types of errors may occur including: (1) Insertion, deletion, substitution errors within oligonucleotide segments; (2) Missing DNA segments; (3) Synchronization errors across multiple segments with the same address; (4) Low coverage and amplification yields for certain DNA segments; (5) Structural error patterns introduced by synthesis arrays and sequencing machines. One problem involved in designing a coded DNA storage system is to ensure the efficiency and compatibility of the different components which correct diverse errors.
SUMMARY
An aspect of the present disclosure involves error-correction being introduced for DNA storage based on a DNA process that may be error prone, e.g., a noisy or lossy process that may result from using at least one of an error-prone synthesis process such as low-cost enzymatic molecular synthesis and an error-prone sequencing process such as nanopore sequencing. A modular pipelined system provides for encoding and decoding digital information reliably in DNA molecules in the presence of a large fraction of differentiated errors. A source of data is efficiently encoded, modulated, and stored in a set of oligonucleotide segments. Using DNA sequencing, the information stored on multiple segments is assembled and decoded reliably. A classification of errors is provided for the biological processes of low- cost DNA synthesis, PCR-amplification, and DNA sequencing. Error-correction is designed to recover data corrupted by insertions, deletions, and substitutions of nucleotides. Low-cost synthesis also places constraints on the modulation of bits to nucleotides. A flexible system is presented comprised of error-correction codes (e.g., Reed-Solomon, LDPC, or polar codes), synchronization codes, and constrained modulation codes. Tradeoffs are outlined for the amount of overhead envisioned for reliable storage and compared to information-theoretic bounds.
Another aspect involves addressing the problem of a noisy DNA process by providing modular error-correction codes. In an embodiment, the encoding step comprises modulation, synchronization, addressing, and error-correction within each oligonucleotide and per block of multiple oligonucleotides. After DNA amplification and sequencing, the decoding step comprises reciprocal measures to assemble source data from multiple segments. Error- correction is designed to recover data corrupted by several types of errors including missing segments. The amount of overhead redundancy envisioned for reliable storage is specified and compared to information-theoretic bounds. To boost the reliability of decoding information in the presence of synchronization errors, for each oligonucleotide synthesized, it is also possible to utilize multiple outputs of a noisy or lossy DNA synthesis process, e.g., an error prone low cost enzymatic synthesizer. That is, an enzymatic synthesizer intended to produce one oligonucleotide may actually produce multiple output strands having varying amounts of errors such as insertions, deletions, and substitutions.
Another aspect relates to an error-correction system suitable for effectively managing errors created by a lossy or error-prone DNA synthesis system, e.g., a low-cost enzymatic synthesis. Traditional error-correction systems for DNA storage are based on high-fidelity molecular synthesis (e.g., using Agilent micro-array platforms). Such synthesis machines are too costly for commercial purposes, and cannot enable ubiquitous storage. A lossy or error- prone DNA process in accordance with the present principles may introduce error rates one or two orders of magnitude greater than a high-fidelity synthesis machine. In accordance with another aspect, a DNA data storage system in accordance with the present principles enables data storage in DNA using lossy or error-prone DNA synthesis such as low-cost enzymatic synthesis producing DNA oligonucleotides using low-cost chemicals. The low-cost property allows eventually scaling up to terabytes of synthesized data. Another aspect of the current disclosure involves addressing an issue with low-cost enzymatic synthesis regarding when writing successive nucleotides (e.g., ATCAGTGAGCTAGC...). Not all transitions have a high probability of success. To avoid unreliable transitions, an aspect of the present disclosure involves introducing a constraint graph for nucleotide patterns. A modulation scheme implements the constraint graph. The purpose of modulation is to maximize the capacity of information (how many bits per nucleotide stored), and to minimize the probability of insertions/deletions (due to less reliable nucleotide transitions). Modulation in accordance with the present principles avoids less reliable transitions entirely. It is also possible to minimize the usage of the less reliable transitions.
Another aspect of the present principles addresses an issue with error prone DNA synthesis, e.g., low-cost enzymatic synthesis, involving multiple oligonucleotide segments produced all of which do not have the same lengths. An aspect of the present disclosure involves using synchronization codes to address distribution of lengths of oligonucleotides, otherwise the storage capacity may be significantly reduced. One type of synchronization code may comprise a marker code. A marker code may comprise known marker symbols which are inserted in the bit stream prior to modulation.
Another aspect of the present disclosure addresses an issue with low-cost enzymatic synthesis wherein synchronization errors, e.g., insertion and deletion errors, may be corrected. Once oligonucleotide segments are synchronized, the positions which have been deleted may be corrected. In accordance with the present principles, error-correction within each oligonucleotide addresses this problem. For example, BCH codes are applied for error- correction after the oligonucleotide has been synchronized to the correct length. That is, in accordance with the present principles, first synchronize and then correct. In addition, synchronization provides (posterior) probabilities for the sequence of bits stored in each oligonucleotide. As a result, other codes such as LDPC (low-density parity-check codes), polar codes, short Turbo codes, or convolutional codes may be applied as well.
Another aspect of the present principles addresses an issue with DNA storage involving assembling the original data from multiple, short, sequenced oligonucleotides. In accordance with the present principles, an address for each oligonucleotide is provided for assembly. During encoding, address bits are concatenated with data bits prior to applying error-correction for each oligonucleotide. During decoding, after an oligonucleotide has been synchronized and error-corrected, its address is retrieved and thus its position in assembling all of the original data is known. Another aspect of the present disclosure involves addressing an issue with DNA Storage involving that some oligonucleotides might be corrupted beyond repair, and/or lost in the DNA solution (e.g., if PCR-amplification does not yield enough copies for sequencing). An aspect of the present disclosure involves addressing this issue by introducing a code (e.g., Reed-Solomon code) for a block of multiple oligonucleotides. After synchronization of each oligonucleotide, error-correction of each oligonucleotide, and retrieving the address of each oligonucleotide, multiple oligonucleotides in a block are reinforced by extra redundancy in case any of them is lost or corrupted beyond repair.
Another aspect of the present disclosure involves correcting for multiple types of errors that may occur simultaneously with DNA storage based on low-cost enzymatic synthesis. In accordance with the present principles, this problem is addressed using a modular error- correction pipeline. The order of the blocks in the pipeline provides for correcting diverse types of errors. It is possible to jointly design the blocks in the pipeline to improve efficiency and a modular approach enables improving each block separately.
Another aspect involves the approach to ordering of encoding and decoding for low- cost enzymatic synthesis. Encoding in accordance with the present principles involves the following in order: (1) Store data bits in multiple oligonucleotides with redundancy per block of oligonucleotides; (2) Add an address per oligonucleotide; (3) Add error-correction per oligonucleotide; (4) Add synchronization per oligonucleotide; (5) Transform bits to nucleotides using modulation scheme. Decoding in accordance with the present principles involves the following in order: (1) Demodulate each oligonucleotide from nucleotides to bits; (2) De- synchronize each oligonucleotide; (3) Error-correction by decoding each oligonucleotide; (4) Retrieve address for each oligonucleotide; (5) If any oligonucleotide in a block is corrupted beyond repair or missing, utilize the redundancy per block of oligonucleotides to reconstruct the original source data.
In accordance with another aspect of the present principles, an embodiment of a method of encoding digital data comprises adding address information to the digital data; converting the digital data including the address information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable: synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy or error-prone DNA synthesis process including at least one of error-prone synthesis and error-prone sequencing, and decoding the digital data from the DNA oligonucleotide. In accordance with another aspect of the present principles, an embodiment of a method of encoding digital data including adding synchronization information to the digital data may include the synchronization information comprising a synchronization marker.
In accordance with another aspect of the present principles, an embodiment of a method of encoding digital data may further include adding error correction information to the digital data after adding address information and before converting to the digital data to codes, whereby converting includes converting the digital data including the address information and including the error correction information to the plurality of codes. In accordance with another aspect, the error correction information may include a block error correction code and a word error correction code. In accordance with another aspect, a block error correction code may include one of a Reed-Solomon code and a Fountain code. In accordance with another aspect, a word error correction code may include one of a BCH code and a LDPC code.
In accordance with another aspect of the present principles, an embodiment of a method of encoding digital data may comprise adding address information to the digital data; adding error correction information to the digital data including the address information; converting the digital data including the address information and the error correction information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy or error-prone DNA synthesis process and to enable subsequent decoding of the digital data from the DNA oligonucleotide. In accordance with another aspect, synchronization information may comprise a synchronization marker. In accordance with another aspect, error correction information may include a block error correction code and a word error correction code. In accordance with another aspect, a block error correction code may include one of a Reed-Solomon code and a Fountain code. In accordance with another aspect, a word error correction code may include one of a BCH code and a LDPC code. In accordance with another aspect, converting digital data to a plurality of codes may comprise a mapping of the digital data to nucleotides according to a modulation map. In accordance with another aspect, converting digital data to a plurality of codes using a modulation map may comprise the modulation map being in accordance with a constraint graph configuring the mapping to decrease a likelihood of unreliable transitions. In accordance with another aspect, converting digital data to a plurality of codes using a modulation map may comprise the modulation map writing an A on every odd occurrence of 0, a T on every even occurrence of 0, a C on every odd occurrence of 1, and a G on every even occurrence of 1.
In accordance with another aspect of the present principles, an embodiment of a method of decoding digital data stored in DNA may comprise synchronizing a plurality of oligonucleotides selected from a pool of DNA strands synthesized using a lossy or error-prone DNA synthesis process based on synchronization information included in each of the plurality of oligonucleotides; demodulating the plurality of oligonucleotides to produce a plurality of binary words corresponding to respective ones of the plurality of oligonucleotides, wherein the demodulating is based on a modulation map providing a mapping of nucleotides to digital bits; and extracting the digital data from the plurality of binary words. In accordance with another aspect, synchronization information may comprise a synchronization marker. In accordance with another aspect, each of the plurality of binary words may include error correction information, and extracting the digital data may include error correction based on the error correction information. In accordance with another aspect, error correction information may include a block error correction code and a word error correction code. In accordance with another aspect, a block error correction code may include one of a Reed-Solomon code and a Fountain code. In accordance with another aspect, a word error correction code may include one of a BCH code and a LDPC code. In accordance with another aspect, a modulation map may be in accordance with a constraint graph configuring the mapping to decrease a likelihood of unreliable transitions. In accordance with another aspect, a modulation map may comprise writing an A on every odd occurrence of 0, a T on every even occurrence of 0, a C on every odd occurrence of 1, and a G on every even occurrence of 1.
In accordance with another aspect of the present principles, an embodiment of a lossy DNA synthesis process for digital data storage may comprise an enzymatic molecular synthesis process.
In accordance with another aspect of the present principles, an embodiment of an encoder may comprise a processor performing one or more aspects of embodiments of methods as described herein.
In accordance with another aspect of the present principles, an embodiment of a decoder may comprise a processor performing one or more aspects of embodiments of methods as described herein.
In accordance with another aspect of the present principles, an embodiment may comprise a non-transitory computer-readable medium storing computer-executable instructions executable to perform one or more aspects of embodiments of methods as described herein. In accordance with another aspect of the present principles, an embodiment of a method of decoding digital data stored in DNA molecules may comprise accessing a plurality of DNA molecules storing encoded digital data, wherein the encoded digital data includes a code component comprising an address, synchronization information comprising a marker symbol, and error correction information, and the DNA molecules were synthesized using a lossy or error-prone DNA synthesis process; sequencing the plurality of DNA molecules; merging and assembling the plurality of DNA molecules to form a plurality of DNA oligonucleotides; and decoding the digital data from the plurality of DNA oligonucleotides wherein the decoding includes synchronizing the plurality of DNA segments using the marker symbol in each of the plurality of segments, processing the synchronized plurality of segments to extract digital information from the synchronized plurality of segments, and performing error correction on the extracted digital information using the error correction component to produce the decoded digital data.
In accordance with another aspect of the present principles, an embodiment of an encoder comprises encoding digital data including error correction information and synchronization information into a plurality of codes each representing a respective nucleotide of a DNA oligonucleotide, wherein the encoding includes a modulation of the digital data to nucleotides of DNA in accordance with a modulation map; a DNA synthesizer synthesizing a DNA segment based on the plurality of codes and using a lossy or error-prone DNA synthesis process; a PCR amplifier amplifying the DNA segment to form a plurality of DNA segments; and an archive to store the plurality of DNA segments.
In accordance with another aspect of the present principles, an embodiment comprises a sequencer sequencing a plurality of DNA segments retrieved from an archive, wherein the plurality of DNA segments were synthesized using a lossy or error-prone DNA synthesis process; and a decoder decoding digital data from a plurality of DNA oligonucleotides included in the plurality of DNA segments, wherein the decoder synchronizes the plurality of oligonucleotides using the synchronization information included in the plurality of oligonucleotides; demodulates the synchronized plurality of oligonucleotides to produce a plurality of digital words corresponding to respective ones of the plurality of oligonucleotides; and extracts the digital data from the binary words.
In accordance with another aspect, a DNA data storage system in accordance with the present principles includes an encoder producing digital words, at least one of a lossy or error- prone DNA synthesis process synthesizing DNA oligonucleotides corresponding to the digital words and a lossy or error-prone DNA sequencing process sequencing the DNA oligonucleotides, a decoder decoding the digital words from the sequenced oligonucleotides and end-to-end error correction correcting errors introduced by the error-prone DNA processing at combined error rates on the order of 25% per nucleotide.
In accordance with another aspect, an embodiment of an encoder for encoding digital data comprises at least one processor configured for adding address information to the digital data; converting the digital data including the address information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable: synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy DNA synthesis process, and decoding the digital data from the DNA oligonucleotide
In accordance with another aspect, an embodiment of a decoder for decoding digital data stored in DNA comprises at least one processor configured for synchronizing a plurality of oligonucleotides selected from a pool of DNA strands synthesized using a lossy DNA synthesis process, wherein the synchronization is based on synchronization information included in each of the plurality of oligonucleotides; demodulating the plurality of oligonucleotides to produce a plurality of binary words corresponding to respective ones of the plurality of oligonucleotides, wherein the demodulating is based on a modulation map providing a mapping of nucleotides to digital bits; and extracting the digital data from the plurality of binary words.
In accordance with another aspect, an embodiment of a system for storing digital information in DNA comprises: an encoder encoding digital data including error correction information and synchronization information into a plurality of codes each representing a respective nucleotide of a DNA oligonucleotide, wherein the encoding includes a modulation of the digital data to nucleotides of DNA in accordance with a modulation map; a DNA synthesizer synthesizing a DNA segment based on the plurality of codes and using a lossy DNA synthesis process; a PCR amplifier amplifying the DNA segment to form a plurality of DNA segments; an archive to store the plurality of DNA segments; a sequencer sequencing a plurality of DNA segments retrieved from the archive, wherein the plurality of DNA segments were synthesized using a lossy DNA synthesis process; and a decoder decoding digital data from a plurality of DNA oligonucleotides included in the plurality of DNA segments, wherein the decoder: synchronizes the plurality of oligonucleotides based on synchronization information included in the plurality of oligonucleotides; demodulates the synchronized plurality of oligonucleotides to produce a plurality of digital words corresponding to respective ones of the plurality of oligonucleotides; performs error correction on the plurality of digital words based on error correction information included in the digital words, and extracts the digital data from the error corrected digital words.
BRIEF DESCRIPTION OF THE DRAWING
The present principles can be readily understood by considering the detailed description below in conjunction with the accompanying drawings wherein:
Figure 1 illustrates, in block diagram form, a DNA storage system in accordance with the present principles;
Figure 2 illustrates, in state diagram form, a system for DNA synthesis;
Figure 3 illustrates, in graphical form, an aspect of a DNA digital data storage system in accordance with the present principles;
Figure 4 illustrates, in state diagram form, an exemplary embodiment of constraints for synthesizing DNA;
Figure 5 illustrates, in block diagram form, an exemplary embodiment of molecular- level DNA storage;
Figure 6 illustrates, in block diagram form, an exemplary embodiment of an error- correction pipeline for DNA storage of digital data;
Figure 7 illustrates, in graphical from, an aspect of a DNA digital data storage system in accordance with the present principles; and
Figure 8 illustrates, in graphical from, an aspect of DNA digital data storage system in accordance with the present principles.
It should be understood that the drawings are for purposes of illustrating exemplary aspects of the present principles and are not necessarily the only possible configurations for illustrating the present principles. To facilitate understanding, throughout the various figures like reference designators refer to the same or similar features. DETAILED DESCRIPTION
An end-to-end system for DNA storage includes several components. Source information (e.g., a movie file) is encoded, modulated, synthesized, and stored in multiple DNA oligonucleotide segments. To access and reconstruct the source data reliably, the DNA segments are sequenced, demodulated, assembled, and decoded reliably. Figure 1 depicts an exemplary embodiment of a complete digital data storage and recovery system. The system involves a digital data management and processing portion 101 including encoding, decoding and error management of digital data and a DNA storage and recovery portion 102 including DNA synthesis, polymerase chain reaction (PCR) amplification, and DNA sequencing. The system is comprised of the following components as shown in Fig. 1: source data 110 in bits; encoding mechanism 120 including all error-correction codes and modulation from bits to nucleotides; DNA synthesis 130 of multiple DNA segments; PCR-amplification 140 of DNA pools; DNA archival storage 150; DNA sequencing 160; merging and assembling multiple DNA segments 170; demodulation and decoding 180 of all codes for reliable recovery. One challenge of designing a coded DNA storage system is to ensure the efficiency and compatibility of the different components which correct diverse errors.
In DNA storage, several types of errors occur including: (1) Insertion, deletion, substitution errors within oligonucleotide segments; (2) Missing DNA segments; (3) Synchronization errors across multiple segments with the same address; (4) Low coverage and amplification yields for certain DNA segments; (5) Structural error patterns introduced by synthesis arrays and sequencing machines. By modeling the errors of DNA processing technologies, it is possible to define "DNA storage channels", many of which have only approximately known information-theoretic capacities by contrast to standardized, precisely- mapped wireless communication channels. For example, the capacity of the deletion channel is only known to within upper and lower bounds for independent and identically distributed deletion. However, errors such as deletions must be addressed to provide a reliable data storage system, especially when utilizing low-cost, next-generation DNA processes that may be lossy or noisy.
A DNA process that is lossy or noisy as referenced herein is intended to encompass various DNA synthesis and/or sequencing processes that may be error prone, e.g., produce errors such as deletions, insertions and substitutions as described herein. Such errors may occur as a result of using low cost DNA processing as part of a DNA data storage system. For example, a low cost system may include at least one of a lossy or noisy DNA synthesis process and a lossy or noisy DNA sequencing process. As a specific example, a lossy or noisy DNA synthesis process may include a low-cost enzymatic synthesis process. Such processes use low cost chemicals and, therefore, may advantageously reduce the cost of a DNA data storage system. However, such low cost synthesis may introduce diverse errors at a relatively high rate. For example, diverse errors including nucleotide deletions, substitutions and insertions may occur at error rates on the order of 25% per nucleotide. In comparison, high-fidelity DNA synthesis machines utilizing chemical synthesis methods may provide a combined error rate one or two orders of magnitude less, e.g., under 1%, but at a much higher cost. Errors may also be introduced during sequencing of DNA oligonucleotides when reading or recovering data, e.g., by using a sequencing process such as nanopore sequencing that may tend to be lossy or noisy.
A feasible system should provide error resilience and reliable reconstruction from imperfectly synthesized and/or sequenced DNA strands. For example, in accordance with an aspect of the present disclosure, multiple levels of hybrid error-protection may be involved. In the following analysis, oligonucleotide segments are equipped with an address code which is a unique identifier. Digital payload information is stored across multiple segments, protected by modern codes organized in both the "horizontal" dimension (per oligonucleotide), and the "vertical" dimension (across multiple oligonucleotides). While accuracy in retrieval is an important consideration, so also is efficiency of such codes to reduce overhead costs in DNA synthesis and sequencing.
TABLE 1. MODEL PARAMETERS FOR DNA SYNTHESIS
Figure imgf000013_0001
Several biological constraints distinguish the storage of digital information in DNA molecules from storage on traditional data devices: tape drives, disk drives, read-only memory (ROM) cartridges, USB memory, and flash memories. A fundamental constraint of storing information in DNA molecules is the limited length of oligonucleotide segments. Current DNA synthesis machines synthesize short segments nucleotide by nucleotide. Each oligonucleotide may be of variable length, depending on the number of synthesis errors incurred. To formalize the process, a DNA synthesizer produces M oligonucleotide segments of variable lengths. The fixed-length input xm and variable-length output wm of a synthesis machine are denoted as follows for the mth oligonucleotide where m E [1 ... M] ,
xm = {xm(/c)}< k = [1, ... , K],
wm = { wm(- )} = [1, ... , Lm]
where xm(k), wm(£) E {A, T, C, G}. Typical synthesis machines have input specifications of K < 1000 base pairs. Low-cost synthesizers may only function adequately over a smaller range of input lengths such as 100 < ft' < 250 base pairs. While the fixed input length is K, a synthesized oligonucleotide wm is of variable length Lm. The parameters listed in Table 1 specify a statistical model for the synthesis of molecules which is diagrammed in Figure 2. A key element is the presence of memory in the synthesis channel due to deletion errors.
It is possible that next-generation low-cost DNA synthesis exhibits insertion, deletion, substitution, and burst nucleotide errors. For example, in the case of an exemplary enzymatic synthesis, deletion errors are prevalent. The notion of memory in a deletion channel differs from memoryless channel models. A deletion occurring at any nucleotide within an oligonucleotide affects the position of all subsequent nucleotides.
Figure 2 illustrates a model for DNA synthesis. In state sk, the DNA synthesis machine aims to write the kth nucleotide of an oligonucleotide where k E [1, ... , K] . Depending on the number of insertions, deletions, and burst errors, the output oligonucleotide is of variable length. In more detail, Figure 2 illustrates a Markov chain mathematical model which includes various types of errors. Insertion, deletion, substitution, and burst errors have associated probabilities denoted by pins , pdei , psub , or pbur respectively. The following theorem derives the mean and variance of the length of the mth synthesized oligonucleotide. Theorem 1. (Variable-Length Oligonucleotide Segments):
Consider the Markov chain of Figure 2. Let T be the random number of nucleotides written starting from state sk and ending at state sk+1 for k E [1, ... , K] . Based on the nucleotide error probabilities pins, pdei, psu¾, p¾ur, the probability generating function of random variable T is „ , Λ Pdei (^ - ~ Pdel ~ Pins) - ~ Pbur)
Gt( J) = + — .
1 - Pins^ (1 - Ρϋ^ω) (1 - ρίη5ω)
For the mth output oligonucleotide, m £ [1, ... , M] , the expected length is E [Lm] = βΈτ [Τ] and variance VAR[Lm] = KVAR[T] where,
ET [T] = GT' (ω = 1),
VAR[T] = (ω = 1) + GT' (ω = 1) - (βτ' (ω = l))2.
The proof is available in Appendix I, and may be adapted to several variations of the Markov chain model.
Example 1 (Nucleotide Deletion Errors): In a simplified model, let pbur = pins = 0. Then the generating function of T based on the Markov chain is
£Γ(ω) = Pdei + (1 - Pdei)<",
E [Lm] = ff(l - pdeI),
VAR[Lm] = tf(pdei) (l - pdei)-
More specifically, consider an input length of K = 100 nucleotides to the synthesis machine. If the deletion probability is pdei = 0.10, then E [Lm] = 90, and VAR[Lm] = 9. Thus, if a high probability of deletion exists, the output oligonucleotide segments will be of many different sizes. An aspect of the present disclosure involves proper decoding of the information in each variable-length segment.
Example 2 (Distribution of Lengths): The distribution of the lengths of synthesized DNA segments is simulated and illustrated in Figure 3. For non-negligible insertion and deletion probabilities, a large fraction of the segments have unequal, variable lengths. Conventional DNA storage systems rely on high-fidelity micro-array synthesizers which produce segments of correct lengths. For example, in known experiments such as the Harvard-Technicolor collaboration to store 22 mega-bytes of information in DNA, high-cost synthesis yielded a distribution spiked at the intended template length. Low-cost enzymatic synthesis may have a non-negligible deletion probability pdel which, in accordance with an aspect of the present disclosure, may be addressed by providing synchronization codes to correct for variable lengths. As shown in Figure 3, the distribution of the length of synthesized oligonucleotides varies according to insertion and deletion probabilities. Even for pdei = Pins= 0.01, the strands have variable lengths that, in accordance with an aspect of the present disclosure, may be synchronized. There exists no qualitative difference between storing information in bits versus nucleotides. However, the limitations of biological processes for molecular-level storage constrain the set of valid nucleotide sequences. Low-cost DNA synthesis, PCR-amplification, and DNA sequencing involve storing information in valid nucleotide patterns; invalid patterns are highly prone to error. In the case of low-cost DNA synthesis, an efficiency matrix characterizes the accuracy of writing successive nucleotides.
Definition 1 (Efficiency Matrix for Nucleotide Synthesis): Let E =
Figure imgf000016_0001
denote a matrix of efficiencies (probabilities) for i,j E {A, T, C, G}. Element is the probability of successfully writing a nucleotide j given that a nucleotide ί has been written.
Figure imgf000016_0002
The matrix E of efficiencies may be asymmetric. A constraint graph may be constructed based on nucleotide efficiencies; i.e., only transitions which have a higher probability of success for biological synthesis are included as valid transitions.
Definition 2 (Constraint Graph, Adjacency Matrix): A constraint graph is a directed graph. An irreducible constraint graph contains a path between any ordered pair of nodes. Let Ae denote the adjacency matrix of a constraint graph.
Example 3 (Constraint Graph): Figure 4 depicts a constraint graph which outlines valid and invalid transitions between nucleotides. The adjacency matrix of the constraint graph is
Figure imgf000016_0003
Homopolymer sequences of arbitrary lengths are invalid since edge transitions (A, A), (T, T), (C, C), (G, G) are not included in the graph. In addition, edge transitions (A, G), (T, G), (G, T), (C, T) are invalid transitions. Beyond constraints on transitions between nucleotides, constraints on longer sequences of nucleotides are often necessitated to ensure proper PCR-amplification and DNA sequencing. For example, the exclusion of reverse complementary subsequences may be involved, and/or the avoidance of specific nucleotide patterns reserved for primers. To encourage the correct binding of primers, a balanced distribution of nucleotides might also be enforced.
Even if there occur no nucleotide errors as described above, DNA storage may involve constraints on input sequences of nucleotides. A modulation code maps sequences of bits to valid sequences of nucleotides. A demodulation function maps sequences of nucleotides to sequences of bits. The demodulation function may correct nucleotide errors partially if they occur, and/or defer error-correction to other decoding units. However, in either case, the demodulation function should not drastically affect the synchronization of the sequences.
Definition 3 (Modulation and Demodulation Functions): Denote a modulation function Φ£ and a demodulation function Φν as follows, where kmoci, kJnod, nmo(i, nJnod E 2+.
Φ£ : (0, l}femod→ {A, (1) Φν : {A, T, C, G}nm' od→
Figure imgf000017_0001
(2)
The modulation function Φ£ maps a sequence of kmod bits to a sequence of nmod nucleotides. If nucleotide insertions and deletions occur, the demodulation function Φν may accept a sequence of nucleotides of length nm' od where nm' od≠ nmod . The demodulation function produces a sequence of bits of length km' od where it may be the case that km' od≠ kmod if error correction is deferred.
Example 4 (Constrained Modulation and Demodulation): Consider the constraint graph of Figure 4. To construct a modulation map, define an auxiliary function
Figure imgf000017_0002
= dtd2 ... dk for bi, di E {0,1} and 1 < i≤ k-mod- The auxiliary sequence is d^— 1, and for ί > 1 ,
Figure imgf000017_0003
Adhering to the constraint graph, an efficient modulation map Φε ^2 ... bkmod) = t t2 ... tnmod , may be constructed with nmod = kmod based on the input sequence b1b2 ... bkmod and auxiliary sequence dxd2 ... dkmod. The map is defined by
Figure imgf000018_0001
Similarly, a simple decoding and demodulation function <i>T>(qiq2■■■Rn' d) = rir2 --- r k' d witn ^mod = nm' od is given as follows for l≤j≤ km' od,
(o, if q7 G {4 n;
r; ~ [l, if qj E {C, G}. Regarding Example 4 and Figure 4, in words, the modulation map Φ£ writes an A on every odd occurrence of 0, a Γ on every even occurrence of 0, a C on every odd occurrence of l, and a G on every even occurrence of 1. For example, Φ£(00111010) = ATCGCACA. Nucleotide transitions (A, G), (G, T), (C, T), (T, G) are not permitted in the modulation, as indicated by the constraint graph of Figure 4. The demodulation function ν maps both A and T to a 0 , and both C and G to a 1 . The mapping functions are optimal- each nucleotide may transition to two other nucleotides which is the maximum allowed by the constraint graph. It is observed that ^T,{^>8{b1b2■■■ fcmod)) = b1b2 ... bkmod. Furthermore, a nucleotide insertion, deletion, or substitution error after modulation results in only a single bit of error in the demodulated binary sequence.
After biological synthesis of each oligonucleotide, the variable-length segments {wm} for m E [1,2, ... M] are stored compactly in an ultra-dense DNA solution. In the absence of specially-allocated addresses for each segment, storing DNA in a solution implies the loss of ordering between segments. More precisely, the set of oligonucleotide segments {wm} is reordered arbitrarily via a noiseless permutation channel, yielding {w7l(jn^} for m E
[1,2, ... , M] , where π is a random one-to-one permutation of the original order.
Figure 5 illustrates a macro-level view of DNA storage. The input to the system is M oligonucleotide segments {xm} each of fixed length K. Synthesis produces variable-length segments {wm} , and storage in solution yields {ν^π(^} . Prior to sequencing, the stored segments may be PCR-amplified to aid in the read process. The final step is DNA sequencing of all detectable strands.
Whether DNA sequencing occurs via the process of electrophoresis through a gel, via an automated DNA sequencer, or via next-generation microfluidic and nanopore devices, the information in each short oligonucleotide is re-assembled from reads. The modeling of nucleotide errors such as insertions, deletions, substitutions was already included mathematically for DNA synthesis, and is therefore not covered again at this stage. However, a new type of error occurs during sequencing. With a probability γ, each oligonucleotide is absent from the read output due to low coverage in its DNA solution. The low coverage may be the result of a lack of PCR-amplification for a particular oligonucleotide. The mathematical model for sequencing each oligonucleotide is,
Figure imgf000019_0001
After sequencing, the final output of the storage system is variable-length segments {yn(m)} f°r m e [1*2, ... , M] . In describing the sequencing, the process of merging copies of oligonucleotides due to PCR-amplification is not explored in further detail. The merging of segment copies may be done easily if each segment contains a protected address as a unique identifier. The merging is combined with a consensus approach based on edit distances. The edit distance between two strings is the minimum number of symbol insertions, deletions, or substitutions needed to transform one of the strings to the other string. The edit distance for variable-length segments is computed efficiently via dynamic programming which is essentially optimal in terms of its quadratic complexity in the length of the segments.
Definition 4. (Edit Distance): Let V, V G Έ+ , and ∑ be a discrete alphabet; e.g., ∑ = {A, T, C, G}. Define a sequence t = t t2■■■ tv of length V, and a sequence r = rxr2 ... rv< of length V , where t r G∑ for i G [1,2, ... and j G [1,2, ... , V] . The edit distance D (t, r, V, V) may be defined recursively. In the base case, if min ( = 0, D(t, r, = max ( Otherwise, in the recursive case,
D (t, r, min { D (t, r, i - + 1,
D (t, r, i,j - l) + 1,
D (t, r, i - l,j - l) + l{ti = rj}}.
As displayed in Figure 10, the DNA molecular channel contains several sources of error. The inputs are M oligonucleotide segments each of fixed size K nucleotides. Thus, the capacity for the storage channel is at most 2MK bits, if each nucleotide represents 2 bits. This upper bound assumes no constraints for modulation, and no sources of error. When incorporating individual types of error in the system, the upper bound of 2MK bits may be tightened.
Theorem 2. (Upper Bounds on Storage Capacity): Let CST(M, K) be the storage capacity of the DNA molecular-level storage system described herein. The storage system contains M input oligonucleotide segments of length K nucleotides from the alphabet ∑ = {A, T, C, G}. Denote by Ae the adjacency matrix of an irreducible constraint graph specifying valid nucleotide sequences. The DNA storage system satisfies the following upper bounds,
CST (M, K) ≤ MKlog2 ( A max (4)
'M + 4K - 1\ (5)
Csr iM, K) ≤ log2 .
4K - 1
CST {M, K) < Y(l - pdel)2MK, (6)
Figure imgf000020_0001
the z-ary generalization of the entropy function. The largest real eigenvalue of a matrix is denoted by miCi( · ). The proof is provided in Appendix II.
The upper bound relaxations provided in Theorem 2 are not achievable in most cases. For example, the capacity of the deletion channel is not known exactly. Therefore, the upper bound in Eqn. (6) is not attainable. For binary deletion channels, the capacity CDL is bounded as
1
1 - Pdel)≤ Cj for 0 < pdel ≤ -
For 0 < pdel < 1,
1
g (1 - Pdel)≤ CDL < (1 - Pdel) -
Typically for deletion channels, theoretical bounds are achieved by random codes with block length approaching infinity. Even at finite lengths, the exponential complexity of decoding is impractical. Moreover, much less is known for quaternary or z-ary deletion channels. Aside from deletions, there are multiple errors occurring in DNA storage systems. It is theoretically possible to analyze multiple errors occurring simultaneously to tighten the bounds jointly.
The upper bound due to the permutation channel in Eqn. (5) is important to show that both the number of oligonucleotide segments M and the length of each oligonucleotide K must grow in order to increase storage capacity adequately. The upper bound is computed as follows for a few pairs (M, K) . (M + 3)!
CST(M, 1) < log2
3 ! M!
1
= log2 - (M + 3)(M + 2) (M + 1)
6
og2 M). (M2 + M
CST(M, log2 M) < log2
Ml (M2 - 1) !
(M2 + M)M
≤ log2
Ml
= 0(M log2 M).
Figure imgf000021_0001
For a constant if , the increase in storage capacity only grows at most as 0(log2 M) . Oligonucleotide segments must not be too short, despite biological constraints due to low-cost DNA synthesis. To increase capacity at a rate linear in M, the length K must grow at least as log2 M which is the number of bits to uniquely address each segment. As expected, the capacity is upper bounded by a rate of growth linear in K.
Measured with respect to upper bounds, practical codes for DNA storage should be efficient even in the case of short block lengths due to short lengths of each oligonucleotide. A key emphasis is handling a larger fraction of diverse errors. Figure 6 illustrates a pipeline for error-correction and processing comprising the following components: (1) Modulation and demodulation between bits and nucleotides; (2) Encoding and decoding for synchronizing sequences of bits; (3) Error-correction across each oligonucleotide; (4) Error-correction across multiple oligonucleotides.
Prior to the modulation and demodulation blocks in Figure 6, the information stream is in bits. The modulation map converts the stream of bits to a stream of nucleotides obeying the constraint graph described above. Example 3 defined a modulation map in which 1 bit mapped to 1 nucleotide. To simplify the discussion, assume an optimal unconstrained modulation map in which 2 bits map to 1 nucleotide. The modulation function is defined with parameters kmod = 2 and nmod = 1 . More precisely, Φ£(01) = A , Φ£(11) = T , Φ£(00) = C, Φ£(10) = G. The modulation map Φ£ is applied multiple times to pairs of bits in an even-length bit sequence; e.g., Φ£(00100111) = CGAT. The demodulation function is defined by Φ^^) = 01 , Φτ,(Τ) = 11 , Φ^Ο = 00 , Φ^) = 10 . During the demodulation of a nucleotide sequence to a bit sequence, the presence of an insertion or deletion of a nucleotide results in an insertion or deletion of 2 bits in the demodulated bit stream.
The input to the synchronization encoding block in Figure 6 is a stream of bits for each oligonucleotide. A stream of input bits may be grouped into symbols. For purposes of the present discussion, due to the subsequent unconstrained modulation, assume 2 bits corresponds to 1 symbol. A mechanism to achieve synchronization is to insert a marker symbol in the input stream. Markers delineate boundaries to detect insertions and deletions. For every ksync information symbols, (nsync— ksync) marker symbols may be inserted, resulting in an overhead of Usync . Thus, an exemplary embodiment of a synchronization k-sync
encoder Λ£ inserts marker symbols.
A synchronization decoder is more complex. Denote a random input symbol stream with markers as T = ... Ti ... TJ^TJ = T with a particular realization t = t t2 ... t; ... = t[. For information symbols comprised of 2 bits, any symbol is assumed equally likely, Ψ(Τι = tj) = -. For marker symbols Ρ(Γέ = tj) = 1 if the correct marker is
4
specified at position 1 < i < /, otherwise W(Ti = tj) = 0. For each oligonucleotide, the input symbol stream is modulated into nucleotides, transmitted over the DNA channel, and demodulated back into bits which are then grouped back into symbols. Denote a random output symbol stream as S = STS2 ... Sj ... SJ^ -^SJ = s with a particular realization s = S-LS2 ■■■ Sj ... SJ-1 SJ = s[. Depending on whether insertions or deletions occur, J≠ I.
Even though J≠ I , to achieve synchronization, given / received symbols, it is possible to compute F(S[ = s[ \ Ti = t) as the posterior probability of an input symbol for each ί within range 1≤ i≤ I . A synchronization decoder v may feed the raw probabilities to an outer code, or directly estimate
ti = arg max IP (S = s[ | Γ£ = t).
The decoder Αυ is able to efficiently compute posterior probabilities via dynamic programming. Appendix 7 details the α/β recursions involved based on the Markov chain model discussed above.
In addition to the synchronization achieved by the synchronization blocks in Figure 6, error-correction may be applied per oligonucleotide. For example, a deletion error causes a synchronization error, but even if the position of the deletion is known via synchronization, correction of the deleted symbol may be needed. An example of a short block code is the Bose- Chaudhuri-Hocquenghem (BCH) code. For this code, kbch bits are information bits, and (nbch ~ kbch) bits are coded redundant bits. Thus, a storage overhead of is incurred.
kbch
Primitive BCH codes over the Galois field F2 are a standard class of BCH codes. For a positive integer m≥ 3, a t-error-correcting BCR(nbch, kbch) code has parameters
nbch = 2m— 1,
Figure imgf000023_0001
d-min — 21 + 1,
where dmin is the minimum distance of the linear code. As a concrete example, for m = 6, a BCH(?¾cft = 63, kbch = 36) code exists for which t≥ 4.
The following t-error-correcting BCH codes exist: BCH(63,57,1), BCH(63,45,3),
BCH(63,39,4), BCH(63,36,5), BCH(63,30,6). Similarly, the following BCH codes of longer lengths exist: BCH (127,78,7) , BCH (127,71,9) , BCH (127,64,10) , BCH (127,57,11) , BCH(127,50,13). These BCH codes are applicable for DNA storage due to their short block lengths and error-correcting abilities per oligonucleotide. For a t-error-correcting BCH code with storage overhead ^^-, the code corrects a fraction—— of errors.
kbch nbch
Error-correction across multiple oligonucleotide segments may be included to protect against missing segments, and segments that have an individual decoding error. A basic kind of error-correction code is a Reed-Solomon (RS) code. An RS(nrs, krs) code has a minimum Hamming distance of nrs— krs + 1. The storage overhead ratio is— . The code is able to
krs
correct up to U errors with unknown locations and E errors with known locations, where 2U + E < n— k. A Reed-Solomon code operates over a Galois field 2m where m is a positive integer. The block length is nrs = 2m— 1. For example, setting m = 8, the well- known RS(nrs = 255, krs = 223) code corrects up to 16 errors with unknown locations, and up to 32 errors with known locations.
For the DNA storage system of Figure 6, a fraction γ of the oligonucleotide segments have low-coverage during PCR-amplification. Thus, these segments are missing for decoding and reconstruction. Furthermore, despite synchronization and error-correction within each oligonucleotide, a fraction of oligonucleotide segments may be corrupted beyond repair. For these dual purposes, an error-correction code across blocks of oligonucleotides is crucial. An RS(nrs, krs) block is comprised of nrs oligonucleotide segments of which krs segments hold information, and (nrs— krs) segments hold redundant parity information. The segments which are missing or corrupted beyond repair may be detectable because each oligonucleotide may be allocated address bits in addition to information bits. The address indicates not only the RS-block for each oligonucleotide, but also the specific position of the segment within the RS- block. Only correctly decoded oligonucleotide segments labeled by addresses are included for RS-block decoding.
Each oligonucleotide is equipped with an address since the DNA channel is partially described by a permutation channel as described herein. Thus, an overhead is incurred for DNA storage of Waddr where kaddr bits per oligonucleotide are reserved for information bits, and
kaddr
(naddr ~ ^addr) bits are reserved for address bits per oligonucleotide.
The end-to-end coding system aims to achieve zero error (perfect reliability) in order to extract data from DNA. Depending on the number of insertions, deletions, and substitutions expected from the DNA channel, an amount of overhead is selected in order for the overall system to have a very low probability of error in reconstruction. Each block of Figure 6 has an associated storage overhead. The total storage overhead v incurred by the system is given by,
Δ 2nm0d nSync nbch nrs naddr
V = ~ "k-mod ^ ksync ^ kbch " kr~s ^kaddr ' ^
The overall storage ratio includes overhead for the addressing of each oligonucleotide Waddr,
kaddr error-correction across multiple oligonucleotide segments— , error-correction within each
krs
oligonucleotide synchronization within each oligonucleotide Wsyn , and for constrained
k-bch k-sync
modulation and demodulation to convert between bits and nucleotides Wmod. The overhead
k-mod
ratio for modulation and demodulation includes a factor of 2 since the conversion is from kmod bits to nmod nucleotides.
Example 5 (Synchronization Overhead vs. Reliability): To illustrate the effect of synchronization on decoding error within each oligonucleotide, the following experiment was simulated based on all code blocks of Figure 6, excluding error-correction across multiple oligonucleotides. For each oligonucleotide, the BCH (nbch, kbch) code was selected with parameters kbch = 50 bits, and nbch = 127 bits. Thus each oligonucleotide is able to store 50 bits comprised of both address and information bits. An extra fixed bit was added to the end of the stream to allow for an even length of 128 bits.
In addition, for the synchronization code, a varying amount of overhead was selected. After every ksync symbols in which each symbol is set equivalent to 2 bits, a marker symbol was inserted. For example, for a synchronization overhead of = - , after every 2
k-sync 2
symbols or 4 bits, a marker symbol is inserted. The beginning and end of the stream were
2 3 4 5
without marker symbols. For synchronization ratios of ~ < 2 ' 3 ' 4' tne overaH lengths of the symbol streams were 128,95,85,79 symbols respectively. Each symbol of kmod = 2 bits was modulated onto nmod = 1 nucleotide, i.e., unconstrained modulation with 2nmod = 1.
k-mod
Thus, the oligonucleotide lengths were 128,95,85,79 nucleotides respectively. An increasing oligo length accommodates more synchronization overhead.
Figure 7 shows experimental results for error-correction of oligonucleotide segments in the presence of a large fraction of deletions. An increasing fraction of synchronization markers increases the probability of correct decoding at the expense of extra redundancy. The results of decoding individual oligonucleotides is plotted in Figure 7 for different probabilities of deletion pdei in the DNA channel. Other types of error were excluded in the simulation.
Each data-point in the curves is the result of decoding 104 oligonucleotides. The synchronization code with the highest overhead yields the best performance curve at the cost of synthesizing a longer oligonucleotide.
Example 6 (Effect of Diverse Errors on Reliability): In this experiment, a BCR(nbch = 127, kbch = 50) code was selected with a t = 13 error-correction ability, so that each oligonucleotide is able to hold 50 bits of combined address and payload data. An extra fixed bit was added to each bit stream to allow for an even length of 128 bits. The
71 3
synchronization code was selected with sync = - so that the total number of symbols was 95.
k-sync 2
Each symbol was mapped to a nucleotide with unconstrained modulation yielding oligonucleotide segments of length 95 nucleotides.
Figure 8 shows experimental results for error-correction of oligonucleotide segments in the presence of a large fraction of insertions, deletions, and substitutions. The plot shown in Figure 8 illustrates the decoding success probability under varying probabilities for P sub > P dei> Pins across the DNA channel. A combined presence of diverse errors is more challenging to accommodate compared to the presence of only one type of error. Nevertheless, the storage system is robust to varying storage conditions. Each data-point in the curves is the result of decoding 104 oligonucleotides.
Based on the simulations provided, deletions and insertions errors are costly to accommodate in terms of storage overhead. In addition, the addressing overhead is costly if a large number of oligonucleotide segments is necessary. As a basic calculation, the following parameters would enable a terabyte storage system for the DNA storage channel. Assuming unconstrained modulation in which 2 ^212^ = 1, a BCH(127,50) error-correction code within
^7710 d
Yi 3 each oligonucleotide (plus 1 bit of padding), a synchronization code of rate sync = -, a rate
k-sync 2
71 2
one-half Reed-Solomon code with— krs = - 1, and an addressing overhead of 42 bits of address vs. 8 bits (1 byte) of pay load information per oligonucleotide, the total overhead is computed as follows.
127 + 1\ /2\ /42
50 AlA 8
= 40.32.
In the above calculation, each oligonucleotide is of length 95 nucleotides, and stores 8 bits (1 byte) of pay load information. The system is able to store potentially 242 x 1 = 4 terabytes of information since 242 oligonucleotide addresses are allocated. The storage overhead is costly due primarily in terms of the address space overhead of α r =— = 5.25
kaddr 8 which is fundamental to the problem of assembling segments in DNA storage.
To adjust the above calculations, it is possible to write more bits per oligonucleotide. For example, if 34 bits are reserved for addresses and 16 bits for payload information per oligonucleotide, then α r =— . The storage overhead is then given by
kaddr 16
Δ ^nmod nsync nbch nrs naddr
V = ~~ "k-mod ^ ksync ^ kbch " kr~s ^kaddr
2(1)\ 3\ 127 + lw2w34
2 A2A 50 A1A16,
= 16.32.
Each oligonucleotide is able to store 16 bits (2 bytes) of information which allows 234 addresses possible. In this case, the storage system is able to store only a maximum of 234 x 2 = 32 gigabytes of information.
The above designs for DNA storage assume unconstrained modulation, with oligonucleotide lengths of 95 nucleotides. If modulation is constrained under low-cost molecular synthesis, the overhead ratio for modulation could rise up to 2nmod = - in which 1
k-mod 1
bit is written for every 1 nucleotide. In addition to increasing the total storage overhead ratio v, the lengths of each oligonucleotide would increase two-fold to 190 nucleotides.
The parameters for both exemplary terabyte and gigabyte DNA storage systems were selected to achieve high reliability in the presence of diverse errors in the DNA storage channel. The exact probability of perfect reconstruction may be estimated based on decoding curves for decoding individual oligonucleotide segments, as illustrated in Example 5 and Example 6. For example, for a rate one-half Reed-Solomon code per block of oligonucleotide segments, if more than one-half of the segments are not missing and decoded correctly, the code recovers the entire block with zero error. The simulation experiments described above illustrate that a much higher overhead redundancy ratio may be needed to correct for synchronization errors.
DNA storage continues to be an emerging, innovative, and viable technology. The cost of high-throughput DNA synthesis is decreasing with the introduction of new technologies, and DNA sequencing costs have already dropped rapidly with the discovery of microfluidic nanopore devices. Ideas such as biological editing of DNA, computing, search, and indexing provide focus for cutting-edge research. For example, recently it was shown that simple computing systems may be built in bacteria in the form of synthetic state machines. Next- generation biological computers could utilize DNA storage for memory.
The present description illustrates the present principles. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the present principles and are included within its spirit and scope.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the present principles and the concepts contributed by the inventor(s) to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. For example, use in the description when referring to the drawings of "top", "bottom", "left", "right" and other such terms indicating an orientation or relative relationship between areas of the Figures are illustrative only and not limiting as to the present principles.
Moreover, all statements herein reciting principles, aspects, and embodiments of the present principles, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
Reference in the specification to "one embodiment" or "an embodiment" of the present principles, as well as other variations thereof, means that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment of the present principles. Thus, the appearances of the phrase "in one embodiment" or "in an embodiment", as well any other variations, appearing in various places throughout the specification are not necessarily all referring to the same embodiment.
Although the illustrative embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the present principles are not limited to those precise embodiments, and that various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present principles. All such changes and modifications are intended to be included within the scope of the present principles.
APPENDIX I: Proof of Theorem 1
Define the random variable T as the number of nucleotides written starting from state sk and ending at state sk+1 in the Markov chain. Define random variable U as the number of nucleotides written starting from either the WRITE state or the WRITE ERROR state and ending at the state sk+1. The random variable U has a geometric distribution: Py (0) = 0, ^y (l) = 1 Pbur> and Pu (.n) = (Pbur)n l ( ~ Pbur) f°r n≥ 1. The generatingfunction of U is given by y i=0
ω(1 - Pftw)
1 - UPbur '
Similarly, the probability distribution of T is
PT (P) = Pdel,
PT (1) = (Pins) .Pdel) + (1 - Pdei ~ Piru)(l ~ Pfcur),
PT (n) = (piru)Pr(n - 1) + (1 - Pdei - pins)Pu n), for n > 1 . The probabilities PT (n) for n > 1 are defined recursively. The generating function associated to random variable T is ωι Ρτ(ι)
i=0
= Pdei +∑ωί (Pins r ii - 1)
i=i
+
Figure imgf000029_0001
Solving for GT (ω) yields,
„ , Pdei , ω - ~ Pdel ~ Pins) - ~ Pbur) Gt( J) = + — .
1 - Pins^ (1 - P6«r<") (l - Pins^)
The moments of T may be generated from the function GT (io). For example, the mean and variance of the number of nucleotides written starting at state sk and ending at state sk+1 are
ET [T] = GT' (ω = 1),
VAR[7] = (ω = 1) + Gj ioi = 1) - (βτ' {ω = l))2.
Higher-order moments of Γ may also be computed from the generating function GT (io). APPENDIX II: Proof of Theorem 2
Consider an upper bound due to the permutation channel. A related mathematical problem is first presented. Let set {Xj} for ί £ [1,2, ... ,/] contain non-negative integers. Let Y be a positive integer. The following equation
has exactly g the DNA permutation channel, the
Figure imgf000030_0001
permutations destroy ordering of oligonucleotide segments. Information is only transmitted in the number of different types of segments possible. In this case, there are 4K unique oligonucleotide segments possible of length K , and M total segments. This problem corresponds to letting J = 4K and Y = M . The total capacity in bits of the permutation channel follows directly, and is as claimed in Eqn. (5).
Consider an upper bound due to individual nucleotide deletions with probability pdei. In addition, there exist a fraction γ of full oligonucleotide segments which are omitted. An upper bound on the deletion channel is possible by relaxing the channel to an erasure channel. Assuming an overall fraction γ(1— pde ) of nucleotides are erased, the upper bound is as claimed in Eqn. (6).
Consider an upper bound due to a quaternary symmetric channel with input and output alphabet∑ = {A, T, C, G}. Let X be a random input to the channel, and Y be a random output. The conditional distribution of the quaternary channel is as follows. If a nucleotide is transmitted, then with probability 1— psub , the correct nucleotide is received. Otherwise, each one of the other nucleotides is received with probability - (psu¾). The capacity of a single use of this channel is given by
Figure imgf000030_0002
where h(x; z) =— xlog2 (x)— (1— x) log2 ( 1— x) + x log2( z— 1) for 0 < x < 1— is the z-ary generalization of the entropy function. An upper bound on the overall storage capacity is equivalent to MK uses of this noisy discrete memoryless quaternary channel as claimed in Eqn. (7). APPENDIX III: α/β Recursions
To compute posterior probabilities, define the event to represent that j symbols were received after ί symbols were transmitted. Then the probability IP(^i,7) of the event may be computed recursively,
Figure imgf000031_0001
The base case initialization is IP(< o,o) = 1 and P( oj) = 0 · Similarly, other quantities of interest may be computed recursively. The α/β probabilities are defined as follows,
Figure imgf000031_0002
To compute the probabilities recursively, the Markov chain as explained above may be approximated by "unwrapping" it. Since the modulation is unconstrained with 2 bits mapping to 1 nucleotide, and a symbol is also 2 bits, a symbol insertion, deletion, or substitution error corresponds to that of a nucleotide. As a simplification, the burst error is assumed to be pbur = 0, and a maximum of two insertions is allowed. Let the probability of symbol transmission be defined as pt = 1 — pins— pdei . Then there exist six possibilities:
· A single deletion occurs with probability pt = pdei
• A symbol is transmitted with probability p2 = Pt ;
• A random symbol is inserted followed by a deletion, with probability
P-i — PinsPde
• A random symbol is inserted followed by a symbol transmission, with
probability p4 pinspt ;
• Two (or more) random symbols are inserted followed by a deletion, with
p?
probability p5 =_=— ^ pdei ;
1 Pins
• Two (or more) random symbols are inserted followed by a symbol
p?
transmission, with probability p6 = _ s Pt- The sum p1 + p2 + p3 + p4 + Ps + Pe = 1 · The probability of more than two insertions is relatively rare so a maximum of two insertions is assumed. Based on the six possibilities of the
DNA channel, the α/β probabilities may be computed recursively. Denote by punij the
i probability of a uniformly random symbol. In the case of symbols of 2 bits, pun^ = -. Define the following function involving the symbol substitution error probability psub ,
Figure imgf000032_0001
Then the at forward probabilities defined in Eqn. (9) may be determined recursively as follows,
< ¾,
)0O), 0-
Figure imgf000032_0002
Similarly the probabilities of Eqn. (10) may be determined recursively as follows, β
Figure imgf000032_0003
+ P3 (j5i+l,;+l) (Puni/)
+ P4(0i+lJ + 2) (Ptmi/) ^ P + 1 = 0(¾ + 2*
t
+ P505i+l,7+2) (Puni/)2
+ P6 Wi+l,i+3) (Punif)2∑ P (^+l = (¾ + 3< 0-
The posterior probabilities may be computed based on the α//? recursions. Let
Tj = min { 3i,/}. The posterior probability
Figure imgf000032_0004
+∑ Ρ2(«ί-ι,;-ι ,;)ΙΡ ί = t) P3(«i-l,7-l j)(Puni/)
Figure imgf000033_0001
Thus, based on the output realization of symbols produced through the DNA storage channel, it is possible to estimate the posterior probability via dynamic programming. An estimator may choose the symbol for Tj which has the highest posterior probability, and/or provide the raw posterior probabilities to an outer code for joint decoding.

Claims

1. A method of encoding digital data comprising:
adding address information to the digital data;
converting the digital data including the address information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and
adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable:
- synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy DNA synthesis process, and
- decoding the digital data from the DNA oligonucleotide.
2. An encoder for encoding digital data comprising at least one processor configured for
adding address information to the digital data;
converting the digital data including the address information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and
adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable:
- synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy DNA synthesis process, and
- decoding the digital data from the DNA oligonucleotide
3. The method of claim 1, or the encoder of claim 2 wherein the synchronization information comprises a synchronization marker.
4. The method of claim 1 or 3 further comprising or the encoder of claim 2 or 3 configured for adding error correction information to the digital data following adding the address information and before converting, whereby converting includes converting the digital data including the address information and including the error correction information to the plurality of codes.
5. The method or encoder of claim 4 wherein the error correction information includes a block error correction code and a word error correction code.
6. The method or encoder of claim 5 wherein the block error correction code includes one of a Reed-Solomon code and a Fountain code.
7. The method or encoder of claim 5 or 6 wherein the word error correction code includes one of a BCH code and a LDPC code.
8. A method of encoding digital data comprising:
adding address information to the digital data;
adding error correction information to the digital data including the address information; converting the digital data including the address information and the error correction information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and
adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy DNA synthesis process and to enable subsequent decoding of the digital data from the DNA oligonucleotide.
9. A device for encoding digital data comprising at least one processor configured for adding address information to the digital data;
adding error correction information to the digital data including the address information; converting the digital data including the address information and the error correction information to a plurality of codes wherein each code represents a deoxyribonucleic acid (DNA) nucleotide; and
adding synchronization information to the plurality of codes, wherein the plurality of codes including the synchronization information is configured to enable synthesis of a DNA oligonucleotide representing the plurality of codes including the synchronization information using a lossy DNA synthesis process and to enable subsequent decoding of the digital data from the DNA oligonucleotide.
10. The method of claim 8 or the device of claim 9 wherein the synchronization information comprises a synchronization marker.
11. The method of any of claims 8 or 10 or the device of claim 9 or 10 wherein the error correction information includes a block error correction code and a word error correction code.
12. The method or the device of claim 11 wherein the block error correction code includes one of a Reed-Solomon code and a Fountain code.
13. The method or the device of claim 11 or 12 wherein the word error correction code includes one of a BCH code and a LDPC code.
14. The method of any of claims 1, 3 to 8 and 10 to 11, or the device of any of claims 2 to 7 and 9 to 11 wherein converting comprises a mapping of the digital data to nucleotides according to a modulation map.
15. The method or the device of claim 14 wherein the modulation map is in accordance with a constraint graph configuring the mapping to decrease a likelihood of unreliable transitions.
16. The method or device of claim 14 or 15 wherein the modulation map comprises writing an A on every odd occurrence of 0, a T on every even occurrence of 0, a C on every odd occurrence of 1, and a G on every even occurrence of 1.
17. A method of decoding digital data stored in DNA comprising:
synchronizing a plurality of oligonucleotides selected from a pool of DNA strands synthesized using a lossy DNA synthesis process, wherein the synchronization is based on synchronization information included in each of the plurality of oligonucleotides;
demodulating the plurality of oligonucleotides to produce a plurality of binary words corresponding to respective ones of the plurality of oligonucleotides, wherein the demodulating is based on a modulation map providing a mapping of nucleotides to digital bits; and
extracting the digital data from the plurality of binary words.
18. A device for decoding digital data stored in DNA comprising at least one processor configured for
synchronizing a plurality of oligonucleotides selected from a pool of DNA strands synthesized using a lossy DNA synthesis process, wherein the synchronization is based on synchronization information included in each of the plurality of oligonucleotides; demodulating the plurality of oligonucleotides to produce a plurality of binary words corresponding to respective ones of the plurality of oligonucleotides, wherein the demodulating is based on a modulation map providing a mapping of nucleotides to digital bits; and
extracting the digital data from the plurality of binary words.
19. The method of claim 17 or the device of claim 18 wherein the synchronization information comprises a synchronization marker.
20. The method of claim 17 or 19 or the device of claim 18 or 19 wherein each of the plurality of binary words comprises error correction information, and extracting the digital data includes error correction based on the error correction information.
21. The method or the device of claim 20 wherein the error correction information includes a block error correction code and a word error correction code.
22. The method or the device of claim 21 wherein the block error correction code includes one of a Reed-Solomon code and a Fountain code.
23. The method or the device of claim 21 or 22 wherein the word error correction code includes one of a BCH code and a LDPC code.
24. The method of any of claims 17 and 19 to 23, or the device of any of claims 18 to 23 wherein the modulation map is in accordance with a constraint graph configuring the mapping to decrease a likelihood of unreliable transitions.
25. The method or the device of claim 24 wherein the modulation map comprises writing an A on every odd occurrence of 0, a T on every even occurrence of 0, a C on every odd occurrence of 1, and a G on every even occurrence of 1.
26. The method of any of claims 1, 3 to 8, 10 to 17, and 19 to 26 wherein the lossy synthesis process comprises an enzymatic molecular synthesis process.
27. A non-transitory computer-readable medium storing computer-executable instructions executable to perform a method according to any of claims 1, 3 to 8, 10 to 17, and 19 to 26.
28. A method of decoding digital data stored in DNA molecules comprising: accessing a plurality of DNA molecules storing encoded digital data, wherein:
the encoded digital data includes a code component comprising an address, synchronization information, and error correction information, and the DNA molecules were synthesized using a lossy DNA synthesis process; sequencing the plurality of DNA molecules;
merging and assembling the plurality of DNA molecules to form a plurality of DNA oligonucleotides; and
decoding the digital data from the plurality of DNA oligonucleotides wherein the decoding includes
synchronizing the plurality of DNA segments using the synchronization information in each of the plurality of segments,
processing the synchronized segments to extract digital information from the plurality of segments, and
performing error correction on the extracted digital information using the error correction component to produce the decoded digital data.
29. The method of claim 27 wherein the lossy DNA synthesis process comprises an enzymatic molecular synthesis process.
30. Apparatus comprising:
an encoder encoding digital data including error correction information and synchronization information into a plurality of codes each representing a respective nucleotide of a DNA oligonucleotide, wherein the encoding includes a modulation of the digital data to nucleotides of DNA in accordance with a modulation map;
a DNA synthesizer synthesizing a DNA segment based on the plurality of codes and using a lossy DNA synthesis process;
a PCR amplifier amplifying the DNA segment to form a plurality of DNA segments; and
an archive to store the plurality of DNA segments.
31. Apparatus comprising:
a sequencer sequencing a plurality of DNA segments retrieved from an archive, wherein the plurality of DNA segments were synthesized using a lossy DNA synthesis process; and a decoder decoding digital data from a plurality of DNA oligonucleotides included in the plurality of DNA segments, wherein the decoder:
synchronizes the plurality of oligonucleotides based on synchronization information included in the plurality of oligonucleotides;
- demodulates the synchronized plurality of oligonucleotides to produce a plurality of digital words corresponding to respective ones of the plurality of oligonucleotides;
- performs error correction on the plurality of digital words based on error correction information included in the digital words, and
- extracts the digital data from the error corrected digital words.
32. A system comprising:
an encoder encoding digital data including error correction information and synchronization information into a plurality of codes each representing a respective nucleotide of a DNA oligonucleotide, wherein the encoding includes a modulation of the digital data to nucleotides of DNA in accordance with a modulation map;
a DNA synthesizer synthesizing a DNA segment based on the plurality of codes and using a lossy DNA synthesis process;
a PCR amplifier amplifying the DNA segment to form a plurality of DNA segments; an archive to store the plurality of DNA segments;
a sequencer sequencing a plurality of DNA segments retrieved from the archive, wherein the plurality of DNA segments were synthesized using a lossy DNA synthesis process; and
a decoder decoding digital data from a plurality of DNA oligonucleotides included in the plurality of DNA segments, wherein the decoder:
synchronizes the plurality of oligonucleotides based on synchronization information included in the plurality of oligonucleotides; - demodulates the synchronized plurality of oligonucleotides to produce a plurality of digital words corresponding to respective ones of the plurality of oligonucleotides;
- performs error correction on the plurality of digital words based on error correction information included in the digital words, and
- extracts the digital data from the error corrected digital words.
PCT/US2018/017188 2017-02-13 2018-02-07 Apparatus, method and system for digital information storage in deoxyribonucleic acid (dna) WO2018148257A1 (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US201762458093P 2017-02-13 2017-02-13
US62/458,093 2017-02-13
US201762565142P 2017-09-29 2017-09-29
US201762565138P 2017-09-29 2017-09-29
US62/565,142 2017-09-29
US62/565,138 2017-09-29

Publications (1)

Publication Number Publication Date
WO2018148257A1 true WO2018148257A1 (en) 2018-08-16

Family

ID=61244788

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/US2018/017188 WO2018148257A1 (en) 2017-02-13 2018-02-07 Apparatus, method and system for digital information storage in deoxyribonucleic acid (dna)
PCT/US2018/017193 WO2018148260A1 (en) 2017-02-13 2018-02-07 Apparatus, method and system for digital information storage in deoxyribonucleic acid (dna)

Family Applications After (1)

Application Number Title Priority Date Filing Date
PCT/US2018/017193 WO2018148260A1 (en) 2017-02-13 2018-02-07 Apparatus, method and system for digital information storage in deoxyribonucleic acid (dna)

Country Status (1)

Country Link
WO (2) WO2018148257A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020132935A1 (en) * 2018-12-26 2020-07-02 深圳华大生命科学研究院 Method and device for fixed-point editing of nucleotide sequence stored with data
WO2020072197A3 (en) * 2018-10-05 2020-07-09 Microsoft Technology Licensing, Llc Enzymatic dna repair
CN111680797A (en) * 2020-05-08 2020-09-18 中国科学院计算技术研究所 DNA type printer, data storage device and method based on DNA
CN112079893A (en) * 2020-09-23 2020-12-15 南京原码科技合伙企业(有限合伙) Method for synthesizing text required by DNA storage based on solid phase chemical synthesis method
CN116226049A (en) * 2022-12-19 2023-06-06 武汉大学 Method, system and equipment for storing information by using DNA based on large and small fountain codes

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10650312B2 (en) 2016-11-16 2020-05-12 Catalog Technologies, Inc. Nucleic acid-based data storage
GB2563105B (en) 2016-11-16 2022-10-19 Catalog Tech Inc Nucleic acid-based data storage
EP3766077A4 (en) 2018-03-16 2021-12-08 Catalog Technologies, Inc. Chemical methods for nucleic acid-based data storage
KR20210029147A (en) 2018-05-16 2021-03-15 카탈로그 테크놀로지스, 인크. Compositions and methods for storing nucleic acid-based data
KR20220017409A (en) 2019-05-09 2022-02-11 카탈로그 테크놀로지스, 인크. Data structures and behaviors for searching, computing, and indexing in DNA-based data stores.
GB201907460D0 (en) * 2019-05-27 2019-07-10 Vib Vzw A method of storing information in pools of nucleic acid molecules
US10956806B2 (en) 2019-06-10 2021-03-23 International Business Machines Corporation Efficient assembly of oligonucleotides for nucleic acid based data storage
US20210074380A1 (en) * 2019-09-05 2021-03-11 Microsoft Technology Licensing, Llc Reverse concatenation of error-correcting codes in dna data storage
US11795450B2 (en) 2019-09-06 2023-10-24 Microsoft Technology Licensing, Llc Array-based enzymatic oligonucleotide synthesis
CA3157804A1 (en) 2019-10-11 2021-04-15 Catalog Technologies, Inc. Nucleic acid security and authentication
CA3183416A1 (en) 2020-05-11 2021-11-18 Catalog Technologies, Inc. Programs and functions in dna-based data storage
KR102574250B1 (en) * 2021-08-09 2023-09-06 서울대학교산학협력단 Method, program and apparatus for encoding and decoding dna data using low density parity check code

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
BLAWAT MEINOLF ET AL: "Forward Error Correction for DNA Data Storage", PROCEDIA COMPUTER SCIENCE, ELSEVIER, AMSTERDAM, NL, vol. 80, 1 June 2016 (2016-06-01), pages 1011 - 1022, XP029565777, ISSN: 1877-0509, DOI: 10.1016/J.PROCS.2016.05.398 *
ERIC L ANSON ET AL: "ReAligner", PROCEEDINGS OF THE FIRST ANNUAL INTERNATIONAL CONFERENCE ON COMPUTATIONAL MOLECULAR BIOLOGY, SANTA FE, NEW YORK, ACM, US, 19 January 1997 (1997-01-19), pages 9 - 16, XP058359793, ISBN: 978-0-89791-882-4, DOI: 10.1145/267521.267524 *
GOELA NAVEEN ET AL: "Advances in DNA storage", PROC. 2017 INFORMATION THEORY AND APPLICATIONS WORKSHOP (ITA), IEEE, 12 February 2017 (2017-02-12), pages 1, XP033146241, DOI: 10.1109/ITA.2017.8023453 *
GOELA NAVEEN ET AL: "Encoding movies and data in DNA storage", PROC. 2016 INFORMATION THEORY AND APPLICATIONS WORKSHOP (ITA), IEEE, 31 January 2016 (2016-01-31), pages 1, XP033082410, DOI: 10.1109/ITA.2016.7888163 *
MEINOLF BLAWAT ET AL: "Storing Movies in DNA", FOCUS ON FUTURE DISRUPTIVE TECHNOLOGIES - INNOVATION IN MOTION SUMMER 2015, 1 January 2015 (2015-01-01), pages 38 - 42, XP055462207, Retrieved from the Internet <URL:https://www.afcinema.com/IMG/pdf/technicolor_storing_movies_in_dna.pdf> [retrieved on 20180323] *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020072197A3 (en) * 2018-10-05 2020-07-09 Microsoft Technology Licensing, Llc Enzymatic dna repair
US11781169B2 (en) 2018-10-05 2023-10-10 Microsoft Technology Licensing, Llc Enzymatic DNA repair
WO2020132935A1 (en) * 2018-12-26 2020-07-02 深圳华大生命科学研究院 Method and device for fixed-point editing of nucleotide sequence stored with data
CN111680797A (en) * 2020-05-08 2020-09-18 中国科学院计算技术研究所 DNA type printer, data storage device and method based on DNA
CN111680797B (en) * 2020-05-08 2023-06-06 中国科学院计算技术研究所 DNA type printer, DNA-based data storage device and method
CN112079893A (en) * 2020-09-23 2020-12-15 南京原码科技合伙企业(有限合伙) Method for synthesizing text required by DNA storage based on solid phase chemical synthesis method
CN112079893B (en) * 2020-09-23 2022-05-03 南京原码科技合伙企业(有限合伙) Method for synthesizing text required by DNA storage based on solid phase chemical synthesis method
CN116226049A (en) * 2022-12-19 2023-06-06 武汉大学 Method, system and equipment for storing information by using DNA based on large and small fountain codes
CN116226049B (en) * 2022-12-19 2023-11-10 武汉大学 Method, system and equipment for storing information by using DNA based on large and small fountain codes

Also Published As

Publication number Publication date
WO2018148260A1 (en) 2018-08-16

Similar Documents

Publication Publication Date Title
WO2018148257A1 (en) Apparatus, method and system for digital information storage in deoxyribonucleic acid (dna)
Chandak et al. Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes
US10027347B2 (en) Methods for storing and reading digital data on a set of DNA strands
US9070453B2 (en) Multiple programming of flash memory without erase
WO2018142391A1 (en) Device, system and method of implementing product error correction codes for fast encoding and decoding
JP5723967B2 (en) Method, encoder apparatus, and solid-state storage device for recording input data to s-level storage of a solid-state storage device
JP2013524609A5 (en)
Shomorony et al. Torn-paper coding
CN102457356A (en) Methods and systems for encoding and decoding in trellis coded modulation systems
CN101779379B (en) Encoding and decoding using generalized concatenated codes (GCC)
Cai et al. Coding for segmented edits with local weight constraints
Yan et al. A segmented-edit error-correcting code with re-synchronization function for DNA-based storage systems
Sima et al. Correcting deletions in multiple-heads racetrack memories
Sima et al. Robust indexing-optimal codes for DNA storage
US9502138B2 (en) Data encoding in solid-state storage apparatus
Tang et al. Error-correcting codes for short tandem duplications and at most $ p $ substitutions
Park et al. BIC codes: bit insertion-based constrained codes with error correction for DNA storage
Xue et al. Notice of violation of IEEE publication principles: Construction of GC-balanced DNA with deletion/insertion/mutation error correction for DNA storage system
Xiang et al. A tutorial on coding methods for DNA-based molecular communications and storage
Sima et al. Robust indexing for the sliced channel: Almost optimal codes for substitutions and deletions
EP2457329A2 (en) Compact decoding of punctured codes
TWI354999B (en) Memory module and writing and reading method there
AU2007237272B2 (en) Reliable error detection and error correction encoding for very small block lengths
CN101447234B (en) Memory module and writing and reading method thereof
CN103973318A (en) Linear programming coding method of 73 square residue code

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18710175

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18710175

Country of ref document: EP

Kind code of ref document: A1