WO2019040871A1 - Device for information encoding and, storage using artificially expanded alphabets of nucleic acids and other analogous polymers - Google Patents

Device for information encoding and, storage using artificially expanded alphabets of nucleic acids and other analogous polymers Download PDF

Info

Publication number
WO2019040871A1
WO2019040871A1 PCT/US2018/047957 US2018047957W WO2019040871A1 WO 2019040871 A1 WO2019040871 A1 WO 2019040871A1 US 2018047957 W US2018047957 W US 2018047957W WO 2019040871 A1 WO2019040871 A1 WO 2019040871A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
nucleobases
information
natural
dna
Prior art date
Application number
PCT/US2018/047957
Other languages
French (fr)
Inventor
Julian MILLER
Heshan ILLANGKOON
Original Assignee
Miller Julian
Illangkoon Heshan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Miller Julian, Illangkoon Heshan filed Critical Miller Julian
Publication of WO2019040871A1 publication Critical patent/WO2019040871A1/en

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C13/00Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00
    • G11C13/0002Digital stores characterised by the use of storage elements not covered by groups G11C11/00, G11C23/00, or G11C25/00 using resistive RAM [RRAM] elements
    • G11C13/0009RRAM elements whose operation depends upon chemical change
    • G11C13/0014RRAM elements whose operation depends upon chemical change comprising cells based on organic memory material
    • G11C13/0019RRAM elements whose operation depends upon chemical change comprising cells based on organic memory material comprising bio-molecules

Definitions

  • the present invention relates to a method and approach to encoding, storage and/or transfer of information or data using artificially expanded alphabets consisting of combinations of natural, synthetic, modified, non-natural nucleic acids, and/or other analogous polymers in sequence on a DNA backbone or analogous structural polymer.
  • DNA as a storage medium.
  • the process for storing information in DNA nucleobases existing in the art is a stepped process by which ASCII characters are matched with unique combinations of the characters A, T, G, and C which represent the naturally occurring DNA nucleobases adenine, thymine, guanine, and cytosine.
  • Each nucleic acid base, modified or otherwise, and analogous backbone structures may represent a unique ASCII character, a word, a formula or other communication element (thought or phonemes).
  • the present invention solves the unmet need by providing a method for encoding and storing digital information in DNA or the sugar phosphate backbone of DNA or analogous structural mediums by use of expanded alphabet channels and derivative lexicons. Expanding the number of molecular vehicles used for storing information expands the universe of utilizable n-ary channel alphabets used to represent the expanded variable nucleobases which addresses the unmet need in addition to facilitating more efficient and effective secure transfer of digital information.
  • the present invention further addresses the optimizing of encoding data or information into lexicons representative of expanded alphabet nucleic acid or analogous polymer libraries and thus allowing for greater degrees of compression, encryption, random access, compatibility and optimization of data storage across various types of digital information including but not limited to structured data, unstructured data, text files, video files, image files and audio files.
  • the present invention further provides a method of storing information as ASCII characters, words, formulas or other communication elements (thoughts or phonemes) and combinations thereof as physical polymers in sequence on a structural medium forming codons in an X n schema using an X base channel alphabet.
  • the present invention also provides methods of encoding and storing datasets or subsets which can be encoded and stored by optimizing around a set of specifications and/or considerations which may include compression, encryption, compatibility, random access, searchability, channel coding, cost, standardization, security, amongst other considerations that may be present with any data storage.
  • the present invention also provides a method of encrypting encoded data, or data to be encoded, in n-ary channels representing expanded alphabets of bases and derivative lexicons using various means of substitutions, hashing, and/or scrambling or combinations thereof or other similar methodologies.
  • the present invention also provides a method of tagging and addressing information stored on a DNA backbone or analogous structural medium.
  • the use of molecular tags to label individual oligos aids in identification, decoding, navigation, random access, searching, encryption and error detection.
  • FIG. 1 provides a PNG file represented by ASCII characters.
  • FIG. 2 provides the PNG file represented in FIG. 1 wherein ASCII characters are substituted with unique sequence of 4-DNA bases using the values from Table 1, and substituted back into the original data file.
  • FIG. 3 provides the PNG file represented in FIG. 1 wherein ASCII characters are substituted with codons with lengths of 3 variable bases from an 8 base channel alphabet comprising an 8 3 schema using the values from Table 2, and substituted back into the original data file.
  • FIG. 4 provides the PNG File represented in FIG. 1 after compression wherein sequentially repeating character strings in the incoming data file are replaced by one or two characters and said characters are substituted into the incoming data file in place of the sequentially repeating characters.
  • Said replacement characters are in the form "BA” wherein "A” represents the sequentially repeating character and "B” represents a tally of the number of sequential repetitions of said character. Additionally, null represents "1" and "0" represents ten.
  • FIG. 5 provides the compressed PNG file represented in FIG. 4 wherein each character contained in the source data represented in FIG. 1 is substituted with an assigned sequence of letters representing a 6 base channel alphabet using the values from Table 5, and substituted back into the original data file.
  • FIG. 6 provides the compressed PNG file represented in FIG. 1 wherein each character contained in the source data is substituted with a uniquely decodable sequence of varying length derived from a Huffman encoding utilizing a channel alphabet of 8-DNA bases, each with a prime state achieved by adornment or modification of the nucleobase using the values from the codex in Table 4.
  • FIG. 7 provides a sample means of tagging oligos for random access and search using different channel alphabets on the tag regions.
  • FIG. 8 provides an excerpt of the children's story, "The Ugly Duckling” by Hans
  • FIG. 9 provides the file represented in FIG. 8 after the excerpt was encoded using codons in an 8 3 schema from Table 6 from within a channel alphabet of 8 bases (ATGCZPXY), and substituted back into the original data file.
  • FIG. 10 provides the file represented in FIG. 9 after compression by frequency analyzing the trigrams (codons) then performing an optimized Huffman compression using the same channel alphabet of ATGCZPXY.
  • FIG. 11 provides the file represented in FIG. 10 after encryption was performed by taking the already encoded then compressed excerpt and performing asymmetric substitutions using a set of uniquely decodable words created from the existing channel alphabet of ATGCZPXY by creating an expanded Huffman tree then randomizing the values to be assigned to represent each word.
  • FIG. 12 provides a structure representing, but not limiting, potential modification sites on a nucleoside heterocycle and carbohydrate.
  • FIG. 13 presents potential modification sites to a sugar phosphate backbone.
  • Ri H.
  • R 2 H for DNA
  • R 2 OH for RNA
  • R 2 OR for an ether moiety
  • R 2 R for various linkers, decorations or fluorophores.
  • modification sites R 3 , R 4 , and R 5 can host a variety of substitutions to create unique spacers.
  • the present invention provides a method for encoding, compressing, encrypting, and storing digital information in a combination of natural, modified, and/or synthesized DNA, and/or analogous backbone structures and/or mirror image enantiomeric structures through use of an expanded lexicon or channel alphabet.
  • the present invention also provides a method for storing structured or unstructured data and knowledge in the form of books, papers, text, formulas, or other communication elements in natural, modified, and/or synthesized DNA, and/or analogous backbone structures and/or enantiomers thereof.
  • range is intended to encompass not only the end point values of the range but also intermediate values of the range as explicitly being included within the range and varying by the last significant figure of the range.
  • a recited range from 1 to 4 is intended to include 1-2, 1-3, 2-4, 3-4, and 1-4.
  • base shall mean any of the multitude of physical structures used to represent and/or store data or information which include the distinguishable nucleobases (natural or synthetic), non-natural nucleobases, modified nucleobases, analogous polymers, or spacers in addition to the mirror image enantiomers of said structures that can be attached at a binding or bonding point on a DNA backbone or analogous structural medium.
  • mirror image shall mean enantiomers of the structures referred to.
  • n-ary schema shall mean a schema where n is the number of differentiable bases or representative characters, symbols, letters, markings or combinations thereof that the system uses for its channel alphabet and derivative lexicons. For example a binary system uses two digits (0,1) in its channel alphabet and the universe of all possible lexicons within it is limited to permutations of said characters.
  • lexicon shall mean a specified set of arrangements of bases or their representative letters, characters, symbols, markings, or combinations thereof derived from within a channel alphabet that are in turn used to represent symbols, characters, words, or other pieces of information, or combinations thereof from within the source information or data, or subset, or derivative thereof.
  • n-gram analysis shall mean an analysis of all combinations of adjacent symbols or characters or letters or words or other pieces of information of length n found in the source dataset or subset, or derivative thereof .
  • channel alphabet shall mean a particular n-ary set of characters representing the underlying set of distinguishable nucleobases (natural or synthetic), non-natural nucleobases, modified nucleobases, analogous polymers, spacers or mirror images thereof used for encoding source data or information to be stored.
  • X base system shall mean an n-ary channel alphabet where X is equal to n.
  • a 6 base system would have a set of six (6) letters representing a set comprised of 6 distinguishable nucleobases (natural or synthetic), non-natural nucleobases, modified nucleobases, analogous polymers, or spacers.
  • one six (6) base system might include the letters A,T,G,C,Z,P.
  • X n schema shall mean an X base system raised to a power of n, where n is the uniform length of a given codon in a channel lexicon comprised of that channel's alphabet.
  • n is the uniform length of a given codon in a channel lexicon comprised of that channel's alphabet.
  • an 8 3 schema would represent codons with length of three (3) variable bases with eight (8) possible variables per position, for a total of 512 possible differentiable codons.
  • cognid shall mean a coding comprised from the letters within a specified channel alphabet that allows for the creation of a uniquely decodable lexicon consisting solely of words of a uniform length.
  • the present inventive method enables storing data in natural, modified, or synthesized DNA or analogous backbone structures or their mirror images by substituting the symbols, letters, characters, words, phrases, formulas, abstractions, and/or other knowledge elements (phonemes or thoughts), and/or combinations thereof from within the incoming data or information with characters from within one or more channel alphabets representing the expanded universe of natural, non-natural, modified, or synthesized DNA or analogous backbone structures or mirror images upon which the data is then physically stored.
  • Said stored information can be organized by arranging sequences of physically stored information into larger structural matrixes such as dendrimers. Said arrangement may be accomplished by inserting structural or binding polymers into the sequence. Further organization can be accomplished by introducing histones or other analogous peptides into the storage medium.
  • each ASCII character is replaced by characters corresponding to one of the 256 codons comprised of the potential combinations of the 4 naturally occurring DNA nucleobases.
  • the first letter of the PNG file contained in FIG. 1, "M” is replaced with the characters from Table 1 that represent the unique combinations of four natural nucleotides correlating to the ASCII character "M”. In this embodiment those four letters would be "TACT”.
  • ASCII characters and the method of conversion to a 4 4 schema constrained by the channel alphabet of four nucleobases As Table 1 illustrates each ASCII character is paired with one of the 256 four letter codons comprised of the combinations of the letters representing natural nucleotides A, G, T, C.
  • Binary data streams must be converted to ASCII prior to conversion to four letter codons. The process of converting from binary to ASCII to four (4) letter codons that can later be used to synthesize the represented nucleobases presents size, scope and methodology issues.
  • the present invention accomplishes its methodology from first receiving and/or analyzing data or information that is desired to be stored into a sequence of DNA or analogous backbone structure.
  • the source data or information to be stored may be letters, numbers, alpha numeric characters, binary, decimal, hexadecimal, words, phrases, abstractions, and/or other means of communicating data or information known in the art, and/or combinations thereof.
  • the data or information to be stored may be pre-processed or translated into an alternative form, or may be stored directly.
  • the present method then includes the step of encoding the data or information by translating the incoming data or information into one or more optimized channel alphabets representing a combination of distinguishable nucleobases (natural or synthetic), non-natural nucleobases, modified nucleobases, analogous polymers, spacers or mirror images of said structures for storage, synthesis, and/or transfer.
  • Encoding may include substituting symbols, characters, words, phrases, n-grams, abstractions and/or combinations thereof from the data or information to be stored for the purpose of compressing or encrypting the data.
  • the present method optionally includes the additional step of synthesizing DNA or analogous backbone structure based on the translated data or information using a combination of nucleobases (natural or synthetic), non-natural/artificial nucleobases, modified nucleobases, other analogous polymers, or spacers or their mirror images.
  • nucleobases naturally or synthetic
  • non-natural/artificial nucleobases modified nucleobases
  • other analogous polymers or spacers or their mirror images.
  • spacers or their mirror images There are many known DNA synthesis methods known in the art, and nothing herein is intended to limit the methods of DNA synthesis to be used in the inventive processes described herein. It should be appreciated that Data may be sent off for synthesis using any of the methods known in the art or it may be stored without synthesizing. In some embodiments, DNA synthesis may be accomplished by DNA replication, Polymerase chain reaction (PCR), gene synthesis, oligonucleotide synthesis, base pair synthesis, peptide substitution,
  • the present method includes the additional step of tagging and addressing
  • DNA by means of chemical modification, or physical modification, or any other means of distinguishing one nucleobase or DNA strand from another for aid in identifying, decoding, navigating, randomly accessing, organizing, searching, filtering, encrypting and detecting errors within stored or synthesized encoded, compressed and encrypted data and information or any combination thereof.
  • the present invention stores information on synthetic DNA backbones or analogous structural mediums by substituting incoming data or information with letters, symbols, characters, and/or markers correlating to distinguishable nucleobases (natural or synthetic), non-natural nucleobases, modified nucleobases, analogous polymers, spacers or their mirror images.
  • It provides a method for encoding the data or information by processing and/or analyzing said data or information then dividing into one or more sets of symbols, characters, words, phrases, n-grams, and/or abstractions, and/or combinations thereof within said data or information then translating those divisions and/or subdivisions into one or more lexicons, each derived from one or more channel alphabets selected to optimally represent these divisions with a combination of distinguishable nucleobases (natural or synthetic), non-natural nucleobases, modified nucleobases, analogous polymers, or spacers, or mirror images thereof to facilitate for storage once synthesized.
  • Encoding of the source data or information is performed by the creation of one or more n-ary channel alphabets representing distinguishable nucleobases (natural or synthetic), non- natural nucleobases, modified nucleobases, other polymers, or spacers to be used for storing data and information from which the lexicons for encoding are generated.
  • the encoding process is intended to substitute each unique symbol, character, word, phrase, formula, combinations thereof or other similar knowledge from within said data or information with one or more unique letters or combination of letters that comprise the encoding lexicon from an n-ary channel alphabet of "n" letters expanded beyond the four base system representing or comprised of natural nucleic acids.
  • These expanded channel alphabets include characters representing Artificially Expanded Genetic Information System (AEGIS) synthetic DNA, artificial or non-terrestrial nucleobases, modified natural, modified synthetic, modified AEGIS synthetic, other non-natural DNA nucleobases that have been further modified to be made differentiable from the corresponding unmodified nucleobase, other polymers, or spacers or mirror images of any of said structures.
  • AEGIS Artificially Expanded Genetic Information System
  • the expanded channel alphabets allow for lexicons of greater fidelity, accessibility, security, capacity, and/or density by using nucleobases that may include, but are not limited to, combinations of natural nucleobases, modified natural nucleobases, artificial nucleobases, AEGIS bases covering the 2, 3 and 4-hydrogen bond electron pair donor and acceptor patterns, size pairing bases (with or without hydrogen bonding), spacers where a nucleobase is omitted but the backbone remains, spacers where a nucleobase is omitted and only a minimal carbohydrate or backbone mimic remains, modifications or decorations of both carbohydrates and carbohydrate mimics.
  • nucleobases may include, but are not limited to, combinations of natural nucleobases, modified natural nucleobases, artificial nucleobases, AEGIS bases covering the 2, 3 and 4-hydrogen bond electron pair donor and acceptor patterns, size pairing bases (with or without hydrogen bonding), spacers where a nucleobase is omitted but the backbone remains, spacer
  • Natural and non-natural nucleobases may be modified with, but not limited to, pyrimidines decorated with various ligands at the 4 or 5 positions or other positions as necessary, purines decorated with various ligands at the 7 or 8 positions or other positions as necessary, or combinations or derivatives thereof. Examples of potential modification sites are shown in FIG. 12
  • Modified, natural, and artificial nucleobases and other polymers may include, but are not limited to: any of the natural nucleotides and nucleosides of adenine (A), guanine (G), thymine (T), cytosine (C); uracil (U), any of the AEGIS nucleotides of 4-amino-l- methylpyrimidin-2(lH)-one (S), 6-amino-l ,9-dihydro-2H-purin-2-one (B), 6-amino-3- nitropyridin-2(lH)-one (V), 4-aminoimidazo[l,2-a][l,3,5]triazin-2(8H)-one (J), 2,4- diaminopyrimidine (K), 5-aza-7-deaza xanthosine (X), 6-araino-5-nitro-2(1 H ' )-pyridone (Z), 2- amino-imidazo[ ' l ,2-a
  • spacers or polymers are polyethers (for non-aqueous applications), locked nucleic acids (LNA), threose nucleic acids (TNA), Peptide nucleic acids (PNA), Ribonucleic acids (RNA) or combinations or derivatives or mirror images or modifications thereof. Examples of potential modifications to a phosphodiester backbone are demonstrated in FIG. 13.
  • FIG. 13 further provides potential modifications to a sugar phosphate backbone.
  • Ri H
  • R 2 H for DNA
  • R 2 OH for RNA
  • R 2 OR for an ether moiety
  • R 2 R for various linkers, decorations or fluorophores.
  • modification sites R 3 , R 4 , and R 5 can host a variety of substitutions to create unique spacers.
  • each of the 256 ASCII characters is assigned a codon in an 8 3 schema using a subset of 256 codons from the lexicon of 512 possible codons of this uniform length possible comprised from the channel alphabet of A, T, G, C, Z, P, X, Y as depicted in Table 2.
  • FIG. 3 represents an encoding of the PNG file represented in FIG. 1 using said 8 3 schema depicted in Table 2.
  • the remaining 256 codons in the 8 3 schema can be assigned to each of the 256 ASCII in addition to the codons depicted in Table 2 as a means of channel coding allowing for error prone sequences of codons to be avoided, by providing an alternative codon for each character to potentially be encoded with. Additionally, this channel-coding measure allows for increased data fidelity through synthesis of one or more additional permuted variants from the codex for redundancy and can be furthered by having the second codon be the complementary base pair sequence to the first.
  • the current n-ary schema allows data and knowledge to be stored using expanded alphabets and subsequent lexicons that represent data and knowledge elements in their native form.
  • every ASCII character, word (from every known language), formula and other communication elements (phonemes or higher level thoughts) is represented by within a single table by a unique twelve letter combination representing one twelve possible bases that comprise the channel alphabet.
  • One practical application of the invention is the ability to store knowledge more effectively.
  • the Library of Congress's books could be stored on 5.12e +11 strands of DNA without having to be converted to ASCII characters. That is a savings of over 94%.
  • Table 3 depicts one embodiment of an English language to base 12 DNA codex (truncated).
  • language elements are paired in a manner that allows knowledge created in one language to be read in all other languages.
  • American is represented with the letters AAAAAAAAGGGZ.
  • the English word American is also be link the Spanish Norteamericano (VAAAAAAAGGGZ), French Americain (JAAAAAAAGGGZ), Polish amerykanski (KAAAAAAAGGGZ), etc. in the same table allowing knowledge to be stored in its native (or optimal) format and accessed in other formats.
  • the present invention provides a method of storing the over 120 million "other items” including Chinese wood-block prints as unique communication elements in DNA.
  • Existing methods are limited in scope and do not provide a means for storing "other items” as unique communication elements.
  • "Other items” have to be stored either as converted PNG files, similar to the file depicted in FIG. 1, or they have to be extrapolated into a modern language equivalent and stored as character strings.
  • these non-conforming items stored in DNA would be subject to transcription error as they are converted to image files, then ASCII characters and finally to letters representing the four nucleobases of natural DNA. This process must be replicated for each language.
  • the knowledge conveyed within these non-standard communication elements is not readily convertible to modern language equivalents, therefore each item requires translation into the native language of the repository prior to conversion.
  • each item requires extensive metadata be appended to it in order to be searchable as a knowledge element. To do anything less would be to reduce the value of the knowledge element to that of an simple image.
  • the present invention provides a method by which non-standard communication elements, such as Chinese wood-blocks, may be stored as knowledge kernels which includes the simultaneous storage as a PNG files, as its modern language equivalent and as unique communication element. Storing unique communication elements as knowledge kernels greatly improves the methodology for storing knowledge intrinsic to higher level communication processes.
  • the present invention may include one or more additional steps for compressing and storing data or information in synthetic DNA using character substitution.
  • the invention enables replacing patterns within the source information or data, such as, but not limited to sequentially repeating strings of characters, or combinations thereof, within incoming data or information with appreciably less characters.
  • Compressing using character substitution within data or information may be accomplished using lossless, lossy, machine learning or other data compression algorithms, or other character substitutions known in the art of data compression and storage.
  • character substitution is accomplished by, but not limited to, replacing sequential repetitions of a single character or patterned strings of characters in said data or information with characters representing the repetition tally of the sequentially repeating character or patterned strings of characters and said character or patterned strings of characters.
  • compression is accomplished by substituting repeating characters or patterned strings of characters within said data or information with a lesser number of representative characters.
  • compression is accomplished using n-gram analysis, a
  • Huffman algorithm pattern analysis and regression, arithmetic coding, forward error correction, reductionist compression, Lempel-Ziv-Welch, Brotli, LZX, probabilistic modelings, block-sorting of attributes, or combination or derivatives thereof.
  • compression methods and approaches may be optimized for the data or information being compressed, the channel or medium for transmitting the data or information as well as but not limited to the backbone on which it will be stored or the interference that the repository may be subjected to.
  • the incoming data or information is compressed by substituting a string of sequentially repeating characters within said data or information with one or more characters. Replacing sequential repetitions of a single character within said data or information with one or more characters can compress the amount of data to be stored without risk of loss.
  • null represents a single repetition and "0" represents 10 repetitions.
  • FIG. 4 represents one potential method in which the sequentially repeating characters within the PNG file contained in FIG. 1 can be compressed using the "BA" schema as described in paragraphs 60 through 66 above.
  • the first letter of the PNG file in FIG. 1 "M" sequentially repeats 15 times.
  • the 15 sequentially repeating M's are replaced with the characters 15M. This pattern of replacing sequentially repeating characters with a character or characters that represent the number of times said character repeats and said character is continued until the PNG file from FIG. 1 is converted into the file of FIG. 4.
  • the PNG file may be converted into a lexicon derived from a channel alphabet of bases such as depicted in Table 2 or it may be further manipulated. The substitution of terms is continued until the compressed PNG file of FIG. 4 is converted to the file depicted in FIG. 5 using a lexicon comprised from the 6 base channel alphabet as depicted in Table 5.
  • compression is accomplished with a lossless data compression algorithm.
  • the incoming data set is analyzed for discrete n-grams and their frequency distribution using a lossless data compression algorithm. From the list of discrete n-grams a uniquely decodable codex of letters, characters, symbols, markings and/or combinations thereof that correspond to bases and combinations thereof from within the channel alphabet is created using a Huffman Tree.
  • the incoming data or information is compressed by substituting the n-grams with a uniquely decodable combination of letters representing a base or sequence of bases in a manner such that the most commonly occurring n-grams are represented with the least amount of bases.
  • the incoming data set, having been converted to letters representing bases may be further manipulated or synthesized into nucleic acids or other analogous polymers for storage.
  • the channel alphabet selected is comprised of 24 letters representing 12 nucleobases, each with a discernable prime state achieved through modification.
  • the following letters are used; “A”, “T”, “G”, “C”, “S”, “B”, “V”, “J”, “K”, “X”, “Z”, “P”, "a”, “t”, “g”, “c”, “s", “b”, “v”, “j”, “k”, “x”, “z”, and “p” with each capital letter representing an unmodified synthetic nucleobase, and each lowercase letter representing the discernibly modified synthetic nucleobase.
  • the PNG file FIG. 1 is analyzed for trigrams with a frequency analysis of said trigrams.
  • Each of said words of varying length is then used to represent one of the unique trigrams generated from said trigram analysis.
  • One embodiment of the resulting codex is depicted in Table 4.
  • the PNG file FIG. 1 is converted to the specified 12 base prime channel alphabet lexicon by using the codex of Table 4 wherein each trigram is replaced with a unique base or sequence of bases represented by said lexicon.
  • the resulting data file is depicted in FIG. 6.
  • data or information may be compressed by adding one or more levels of abstraction.
  • data or information may be compressed by translating the data into other data systems or structures of information prior to compression for further optimization.
  • subsets of data may be encoded and compressed through means locally optimized for the subset and these sequences may be tagged to indicate this in addition to being able to be used for random access or search.
  • tagging may be accomplished through chemical modification, or physical modification, or any other means of distinguishing one or more bases or DNA string from another.
  • said tags may include one or more channel alphabets or lexicons different from the data stored on the remainder of the oligo.
  • Encryption is accomplished by encoding the data and information in a manner that it can be read only by the sender and the intended recipient in possession of the decoding, decompression, and/or decryption keys.
  • Encoding can be accomplished by character substitution (symmetrically or asymmetrically), hashing, adding one or more levels of abstraction, layering encodings, mechanical or formulaic ciphers, or any other means of disguising information, or combinations or derivatives thereof.
  • encryption is accomplished by substituting
  • encryption is accomplished by hashing, using variables, by adding one or more levels of abstraction, by converting the data into other data systems or structures of information, or combinations or derivatives thereof in accordance with a mechanical or formulaic ciphers.
  • data may be encrypted through selectively inserting tags within the data storage regions of DNA sequences. These tags may include information or indicate as to which decoding, decompression, and/or decryption keys are to be used to access the dataset or subset thereof.
  • tags may be accomplished through chemical modification, or physical modification, or any other means of distinguishing one nucleobase or DNA string from another.
  • data may be encrypted by substituting words, phrases, n-grams, characters, symbols, and/or abstractions, and/or combinations thereof within the incoming data stream prior to conversion to characters representing bases or words from within the channel alphabet and its subsequent lexicon.
  • ASCII characters, words, formulas or other communication elements may be represented by more than one individual base or combinations thereof from within the channel alphabet.
  • Tagging is accomplished by appending or otherwise modifying bases or DNA strings for address, organization, random access, meta-tagging, or keyword searches that enables selectively sorting and filtering.
  • FIG. 7 further illustrates a representation of a data sequence (green) encoded in DNA (line A) can be segmented into portions (line B) where each segment can include a prefix (orange) or suffix (blue) tag. These segments can also be varied in length and staggered (lines C & D). These encodings can also be paired with their complementary primers (line E, red - line G, blue) for PCR replication (line H) or isolation for further manipulation such as editing or deletion.
  • a dataset can be divided into segments of 100-150 bases in length. These segments can be assigned a prefix tag and/or a suffix tag. These tagging regions can consist of a combination of natural and/or non-standard bases.
  • the tags can contain information relevant to the sequential order of each data segment for reading, compression data or encryption motifs.
  • the tags can also provide the filename, data type, and/or meta tag information.
  • the tags can be used to readily access portions of data specifically through a random access process. Using a mixture of genetic alphabets, these sequences can be shorter and more efficient than naturally occurring DNA sequences alone. This method of retrieval can further be used to copy, edit or delete the data segment.
  • Tagging of nucleic acids or analogous polymers, or sequences thereof further comprises identification through chemical modification, or physical modification, or any other means of distinguishing one or more base or DNA string from another.
  • tagging might be accomplished by adding one or more sequences of bases to the beginning, end or amidst the data stored on an oligo for the purpose of organizing, addressing, meta-tagging, or identifying keywords or other elements contained within the data.
  • Said tags might be comprised of lexicons from within one or more channel alphabets that are optimized in a manner different from the data stored on the oligo, and may also include an indication as to the appropriate decoding, decompression, and/or decryption keys to be used to access the data on the strand.
  • High fidelity amplification of specific data sets, or portions thereof, is accomplished by introduction of the appropriate complementary primers of the prefix and suffix sequences.
  • prefixes or suffixes are appended to the DNA strings to allow for random access.
  • prefixes or suffixes may be appended to the DNA strings in the form [prefix] [data] [suffix] as depicted in FIG. 7.
  • a string of data from a divided into segments of 150 nucleobases, with a 25 base prefix and 25 base suffix with sequence information of the data packet, metadata for search, and also serves as a site for specific and highly selective random access.
  • the current invention improves amplification due to increased affinity of these mixed modified bases to themselves versus a data region which may comprise of only ATGC bases to simplify the DNA polymerization/replication paradigm.
  • the current invention targets the desired sequence with complementary 3'-5' and 5'-3' prefix and suffix primers.
  • targeting provides for isolation, editing, replication, random access reading of data and information within the repository.
  • editing and error correction can be accomplished by means of CRISPR (clustered regularly interspaced short palindromic repeats)/CAS 9, and newer generations of the technology, for site-specific modifications, additions or deletions to a sequence.
  • damage to the oligos can be repaired by a sequence of a series of enzymes including but not limited to glycosylases, endonucleases, polymerases and ligases.
  • the channel alphabets and the subsequent lexicons with which data or information is encoded in order to be stored can be locally optimized based on the source dataset or subset thereof by assigning weights of importance placed on one or more of the following variable design parameters, which may include but are not limited to, compression, encryption, compatibility, random access, search & filtering, channel coding, cost, universal access, self-healing, editability, re-writability, duplicability, and/or security.
  • variable design parameters may include but are not limited to, compression, encryption, compatibility, random access, search & filtering, channel coding, cost, universal access, self-healing, editability, re-writability, duplicability, and/or security.
  • Various embodiments of said local optimization is not limited to, and may include one or more of the methods of encoding, compressing, and/or encrypting information described herein in order to achieve the optimal or one of many ideal storage and/or transfer methods.
  • FIGS. 1-11 are merely illustrative and may not be exhaustive.
  • the present invention also provides a method for storing data, structured or unstructured, and knowledge, books, papers or text.
  • the method enables storing data in DNA or analogous backbone structures by substituting the information in the incoming data stream with characters that represent a base.
  • Using the natural nucleotides of A, G, T, C it is possible to create up to 256 unique combinations of pairing. Beginning with AAAA and ending with CCCC, each character is the ASCII system is assigned a unique combination of four letters. Table 1 represents one potential way the nucleotides could be paired with the ASCII characters using the present limited state of the art.
  • Table 2 depicts one embodiment of an ASCII to an 8 base channel alphabet with a codex using an 8 3 schema to generate codons that comprise the encoding lexicon.
  • sequentially repeating characters within an incoming data stream are substituted with characters that represent the repetition tally of the sequentially repeating character and the character that is repeating.
  • the substituted characters can be encoded into the incoming data stream in the form "BA", where "A” represents the sequentially repeating character and "B” represents the number of times that said character repeats sequentially in a given instance.
  • null represents a single repetition and "0" represents 10 repetitions.
  • FIG. 4 represents one potential way the sequentially repeating characters within the
  • PNG file contained in FIG. 1 can be compressed using the "BA". This pattern of replacing sequentially repeating characters with a character or characters that represent the number of times said character repeats and said character is continued until the PNG file from FIG. 1 is converted into the file of FIG. 4. [0095] Once the PNG file is compressed as depicted in FIG. 4, it is translated to a string of bases using Table 2. This string is further compressed by the generation of a table wherein the numbers 0-9 and the characters represented in the file are solely present. Using a table string consisting of the sequence of codons from Table
  • the present invention also provides a method of encrypting compressed data sets using n-grams and an n-ary codex then randomizing and or substituting values.
  • compressed data sets are analyzed for frequency distribution of n-grams within the compressed data set to generate an optimal unambiguous lexicon of words within the selected n-ary channel through n-ary huffman encoding .
  • These words are in turn represented by a specified base or sequence of bases in a manner which assigns the most commonly used words to require the least amount of bases.
  • the PNG file from FIG. 1 is used to generate a list of unambiguous trigrams.
  • a frequency analysis of these trigrams is run to create a codex wherein the most frequently occurring trigrams are represented by the least number of characters in the codex.
  • One such resulting codex is depicted in Table 4.
  • the PNG file from FIG. 1 is converted to the channel alphabet by symmetrically substituting each trigram from the incoming data set or information with representative characters from the codex of Table 4.
  • the resulting data file is depicted in FIG. 6.
  • the trigrams may be substituted asymmetrically.
  • the data may be further optimized or encoded by means of tagging of nucleic acids or analogous polymers, or sequences thereof.
  • further optimization or encoding comprises tagging through chemical modification, or physical modification, or any other means of distinguishing one or more bases or DNA string from another.
  • the present invention also provides a method for storing knowledge, books, papers or text in DNA.
  • FIG. 8 represents an excerpt from the children's book writing by Hans Christian Andersen, "The Ugly Duckling.”
  • FIG. 9 provides the file represented in FIG. 8 after the excerpt was encoded using a lexicon of codons in an 8 3 schema from Table 6 with lengths of 3 bases, from a channel alphabet of 8 bases (ATGCZPXY) and substituted back into the original data file.
  • FIG. 10 provides the file represented from FIG. 9 after compression by frequency analyzing the trigrams (codons) in the encoded excerpt then performing an optimal Huffman compression using the same channel alphabet to create a compressed lexicon.
  • FIG. 11 provides the file represented in FIG. 10 after encryption was performed by taking the already encoded, then compressed excerpt and performing asymmetric substitutions using a lexicon of uniquely decodable words, using the same channel alphabet as in the previous steps, generated by performing an expanded Huffman compression then randomizing the assignment of the new encrypted lexicon with randomized the values for each word in the previously compressed file's lexicon.
  • a codex is created to convey certain data and information.
  • a codex is generated using the human genome in such a way that a treating physician can scan said patient's DNA and generate a data repository depicting the patient's entire medical history, records and history of illness and injury.

Landscapes

  • Physics & Mathematics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention solves the unmet need by providing a method for encoding and storing digital information in DNA or the sugar phosphate backbone of DNA or analogous structural mediums by use of expanded alphabet channels and derivative lexicons. Expanding the number of molecular vehicles used for storing information expands the universe of utilizable n-ary channel alphabets used to represent the expanded variable nucleobases which addresses the unmet need in addition to facilitating more efficient and effective secure transfer of digital information. The present invention further addresses the optimizing of encoding data or information into lexicons representative of expanded alphabet nucleic acid or analogous polymer libraries and thus allowing for greater degrees of compression, encryption, random access, compatibility and optimization of data storage across various types of digital information including but not limited to structured data, unstructured data, text files, video files, image files and audio files.

Description

DEVICE FOR INFORMATION ENCODING AND, STORAGE USING ARTIFICIALLY EXPANDED ALPHABETS OF NUCLEIC ACIDS AND OTHER ANALOGOUS
POLYMERS
CROSS REFERENCE TO RELATED APPLICATION [0001] This application claims priority from U.S. Provisional Application No. 62/549,680,
Filed, August 24, 2017, which is incorporated by reference herein in its entirety.
FIELD OF INVENTION [0002] The present invention relates to a method and approach to encoding, storage and/or transfer of information or data using artificially expanded alphabets consisting of combinations of natural, synthetic, modified, non-natural nucleic acids, and/or other analogous polymers in sequence on a DNA backbone or analogous structural polymer.
BACKGROUND OF INVENTION [0003] The Information Age's thirst for data and processing is rapidly evolving to the point where a binary system for storing and transmitting information on silicon will no longer be viable. Moreover, and to the extent that, storing information stored on silicon is susceptible to complete loss due to electronic interference. Digitization of knowledge, interconnectivity of individuals and entities, and new and unique data capture devices have made the rapid capture and storage of extremely large data sets routine. As we move towards the interconnectivity of devices and artificial intelligence, the amount and rate at which data is collected and stored is likely to exceed the limitations of the current silicon-binary channel paradigm.
[0004] One solution being used to address the issue of expanding the ability to store information and data past the silicon-binary channel is the use of DNA as a storage medium. The process for storing information in DNA nucleobases existing in the art is a stepped process by which ASCII characters are matched with unique combinations of the characters A, T, G, and C which represent the naturally occurring DNA nucleobases adenine, thymine, guanine, and cytosine. Storing information with synthetic, modified, or non-natural nucleic acids, or other polymers or combinations thereof using data compression and encryption not only changes the medium on which data is stored, it changes the potential naming nomenclatures for encoding, storing, compressing, and transmitting information from the current Hindu-Arabic numerals and letters to be represented by physical chemical compounds without losing any of the data or the knowledge inherent in the information. Each nucleic acid base, modified or otherwise, and analogous backbone structures (molecular image) may represent a unique ASCII character, a word, a formula or other communication element (thought or phonemes).
[0005] No process or method of storing information using sets of expanded alphabets above and beyond the limits of the A, T, G, and C letters of naturally occurring DNA nucleobases currently exists. Use of such expanded method of storing information would addresses the capacity limitations of a silicon-binary channel schema, as well as the limitations of storing data on strands of sequenced DNA subcomponents in their naturally occurring states. A relatively small mass of DNA has the capacity to store large amounts of data; an even larger amount of data can be stored on a smaller mass of an expanded alphabet channel that combines naturally occurring elements such as ATGC with modified and non-naturally occurring elements that would otherwise not be synthesized by nature or man. Using an ASCII to four base schema one gram of DNA has the potential to store greater than 215 petabytes (215 million, billion bytes) of data. It has been estimated that every bit of datum ever recorded by humans could fit in a container the size and weight of two pick-up trucks. But a method limited to the 4 naturally occurring DNA nucleobases is an inefficient method of storing data or for creating knowledge repositories. Thus there remains an unmet need for a method or process of storing information using sets of expanded alphabets above and beyond the limits of the A, T, G, and C letters of naturally occurring DNA nucleobases to allow for the storage of greater than 215 petabytes (215 million, billion bytes) of data.
[0006] Approaches currently available for storing information in non-modified, four nucleobase DNA with sequences 4 bases in length are limited to 256 characters (4A4=256). This requires information and data to be transmitted one character at a time, by converting them to
ASCII characters and then translating the characters into characters representing natural DNA nucleotides encoded with codon lengths of four letters, each with four possible values.
[0007] Other approaches translate ASCII codes to binary (8 bits per ASCII code) and then to genetic lexicons which correspond to either a 1 or a 0.
[0008] These approaches require many oligo strands of 150 - 200 DNA nucleotides. Given that more data has been created in the last two years than all of preceding history, it is not difficult to imagine a time in the near future where the capacity to generate, transmit and store DNA will be insufficient to meet the need. This is especially true as the rate of data creation and data capture continue to expand exponentially. Thus there remains an unmet need.
SUMMARY OF THE INVENTION
[0009] The present invention solves the unmet need by providing a method for encoding and storing digital information in DNA or the sugar phosphate backbone of DNA or analogous structural mediums by use of expanded alphabet channels and derivative lexicons. Expanding the number of molecular vehicles used for storing information expands the universe of utilizable n-ary channel alphabets used to represent the expanded variable nucleobases which addresses the unmet need in addition to facilitating more efficient and effective secure transfer of digital information. The present invention further addresses the optimizing of encoding data or information into lexicons representative of expanded alphabet nucleic acid or analogous polymer libraries and thus allowing for greater degrees of compression, encryption, random access, compatibility and optimization of data storage across various types of digital information including but not limited to structured data, unstructured data, text files, video files, image files and audio files.
[0010] The present invention further provides a method of storing information as ASCII characters, words, formulas or other communication elements (thoughts or phonemes) and combinations thereof as physical polymers in sequence on a structural medium forming codons in an Xn schema using an X base channel alphabet.
[0011] The present invention also provides methods of encoding and storing datasets or subsets which can be encoded and stored by optimizing around a set of specifications and/or considerations which may include compression, encryption, compatibility, random access, searchability, channel coding, cost, standardization, security, amongst other considerations that may be present with any data storage.
[0012] The present invention also provides a method of encrypting encoded data, or data to be encoded, in n-ary channels representing expanded alphabets of bases and derivative lexicons using various means of substitutions, hashing, and/or scrambling or combinations thereof or other similar methodologies.
[0013] The present invention also provides a method of tagging and addressing information stored on a DNA backbone or analogous structural medium. The use of molecular tags to label individual oligos aids in identification, decoding, navigation, random access, searching, encryption and error detection.
BRIEF DESCRIPTION OF THE DRAWINGS
[0014] FIG. 1 provides a PNG file represented by ASCII characters.
[0015] FIG. 2 provides the PNG file represented in FIG. 1 wherein ASCII characters are substituted with unique sequence of 4-DNA bases using the values from Table 1, and substituted back into the original data file.
[0016] FIG. 3 provides the PNG file represented in FIG. 1 wherein ASCII characters are substituted with codons with lengths of 3 variable bases from an 8 base channel alphabet comprising an 83 schema using the values from Table 2, and substituted back into the original data file.
[0017] FIG. 4 provides the PNG File represented in FIG. 1 after compression wherein sequentially repeating character strings in the incoming data file are replaced by one or two characters and said characters are substituted into the incoming data file in place of the sequentially repeating characters. Said replacement characters are in the form "BA" wherein "A" represents the sequentially repeating character and "B" represents a tally of the number of sequential repetitions of said character. Additionally, null represents "1" and "0" represents ten.
[0018] FIG. 5 provides the compressed PNG file represented in FIG. 4 wherein each character contained in the source data represented in FIG. 1 is substituted with an assigned sequence of letters representing a 6 base channel alphabet using the values from Table 5, and substituted back into the original data file.
[0019] FIG. 6 provides the compressed PNG file represented in FIG. 1 wherein each character contained in the source data is substituted with a uniquely decodable sequence of varying length derived from a Huffman encoding utilizing a channel alphabet of 8-DNA bases, each with a prime state achieved by adornment or modification of the nucleobase using the values from the codex in Table 4.
[0020] FIG. 7 provides a sample means of tagging oligos for random access and search using different channel alphabets on the tag regions.
[0021] FIG. 8 provides an excerpt of the children's story, "The Ugly Duckling" by Hans
Christian Andersen.
[0022] FIG. 9 provides the file represented in FIG. 8 after the excerpt was encoded using codons in an 83 schema from Table 6 from within a channel alphabet of 8 bases (ATGCZPXY), and substituted back into the original data file.
[0023] FIG. 10 provides the file represented in FIG. 9 after compression by frequency analyzing the trigrams (codons) then performing an optimized Huffman compression using the same channel alphabet of ATGCZPXY.
[0024] FIG. 11 provides the file represented in FIG. 10 after encryption was performed by taking the already encoded then compressed excerpt and performing asymmetric substitutions using a set of uniquely decodable words created from the existing channel alphabet of ATGCZPXY by creating an expanded Huffman tree then randomizing the values to be assigned to represent each word.
[0025] FIG. 12 provides a structure representing, but not limiting, potential modification sites on a nucleoside heterocycle and carbohydrate.
[0026] FIG. 13 presents potential modification sites to a sugar phosphate backbone. In the absence of a heterocycle, an abasic site is present where Ri=H. Here R2=H for DNA, R2=OH for RNA, R2=OR for an ether moiety, or R2=R for various linkers, decorations or fluorophores. The spacing of a sugar phosphate backbone can be maintained while eliminating a portion of the ribose carbohydrate where R3=H, R4=H and R5=H. Additionally modification sites R3, R4, and R5 can host a variety of substitutions to create unique spacers.
DETAILED DESCRIPTION OF THE INVENTION
[0027] The present invention provides a method for encoding, compressing, encrypting, and storing digital information in a combination of natural, modified, and/or synthesized DNA, and/or analogous backbone structures and/or mirror image enantiomeric structures through use of an expanded lexicon or channel alphabet. The present invention also provides a method for storing structured or unstructured data and knowledge in the form of books, papers, text, formulas, or other communication elements in natural, modified, and/or synthesized DNA, and/or analogous backbone structures and/or enantiomers thereof.
[0028] The following detailed description is merely exemplary in nature and is in no way intended to limit the scope of the invention, its application, or uses, which may vary. The invention is described with relation to the non-limiting definitions and terminology included herein. These definitions and terminology are not designed to function as a limitation on the scope or practice of the invention, but are presented for illustrative and descriptive purposes only.
[0021] It is to be understood that in instances where a range of values are provided that the range is intended to encompass not only the end point values of the range but also intermediate values of the range as explicitly being included within the range and varying by the last significant figure of the range. By way of example, a recited range from 1 to 4 is intended to include 1-2, 1-3, 2-4, 3-4, and 1-4.
[0022] As used herein the word "base" shall mean any of the multitude of physical structures used to represent and/or store data or information which include the distinguishable nucleobases (natural or synthetic), non-natural nucleobases, modified nucleobases, analogous polymers, or spacers in addition to the mirror image enantiomers of said structures that can be attached at a binding or bonding point on a DNA backbone or analogous structural medium.
[0023] As used herein "mirror image" shall mean enantiomers of the structures referred to.
[0024] As used herein "n-ary" schema shall mean a schema where n is the number of differentiable bases or representative characters, symbols, letters, markings or combinations thereof that the system uses for its channel alphabet and derivative lexicons. For example a binary system uses two digits (0,1) in its channel alphabet and the universe of all possible lexicons within it is limited to permutations of said characters.
[0025] As used herein "lexicon" shall mean a specified set of arrangements of bases or their representative letters, characters, symbols, markings, or combinations thereof derived from within a channel alphabet that are in turn used to represent symbols, characters, words, or other pieces of information, or combinations thereof from within the source information or data, or subset, or derivative thereof.
[0026] As used herein "n-gram analysis" shall mean an analysis of all combinations of adjacent symbols or characters or letters or words or other pieces of information of length n found in the source dataset or subset, or derivative thereof .
[0027] As used herein "channel alphabet" shall mean a particular n-ary set of characters representing the underlying set of distinguishable nucleobases (natural or synthetic), non-natural nucleobases, modified nucleobases, analogous polymers, spacers or mirror images thereof used for encoding source data or information to be stored.
[0028] As used herein "X base system" shall mean an n-ary channel alphabet where X is equal to n. Thus a 6 base system would have a set of six (6) letters representing a set comprised of 6 distinguishable nucleobases (natural or synthetic), non-natural nucleobases, modified nucleobases, analogous polymers, or spacers. For example one six (6) base system might include the letters A,T,G,C,Z,P.
[0029] As used herein "Xn schema" shall mean an X base system raised to a power of n, where n is the uniform length of a given codon in a channel lexicon comprised of that channel's alphabet. For example an 83 schema would represent codons with length of three (3) variable bases with eight (8) possible variables per position, for a total of 512 possible differentiable codons.
[0030] As used herein "codon" shall mean a coding comprised from the letters within a specified channel alphabet that allows for the creation of a uniquely decodable lexicon consisting solely of words of a uniform length.
General
[0031] The present inventive method enables storing data in natural, modified, or synthesized DNA or analogous backbone structures or their mirror images by substituting the symbols, letters, characters, words, phrases, formulas, abstractions, and/or other knowledge elements (phonemes or thoughts), and/or combinations thereof from within the incoming data or information with characters from within one or more channel alphabets representing the expanded universe of natural, non-natural, modified, or synthesized DNA or analogous backbone structures or mirror images upon which the data is then physically stored. Said stored information can be organized by arranging sequences of physically stored information into larger structural matrixes such as dendrimers. Said arrangement may be accomplished by inserting structural or binding polymers into the sequence. Further organization can be accomplished by introducing histones or other analogous peptides into the storage medium.
[0032] The benefit of this new method and approach is clearly evident when compared to the existing art. In a channel alphabet comprised solely of the natural nucleotides of adenine (A), guanine (G), thymine (T), cytosine (C) it is possible to create a lexicon of up to 256 unique combinations of pairings using a 44 schema for codon generation. Beginning with AAAA and ending with CCCC, each character in the ASCII system is assigned a unique combination of the four nucleobases. Table 1 represents one potential way the nucleotides could be paired with the ASCII characters. To convert the PNG file contained in FIG. 1, each ASCII character is replaced by characters corresponding to one of the 256 codons comprised of the potential combinations of the 4 naturally occurring DNA nucleobases. For example, the first letter of the PNG file contained in FIG. 1, "M", is replaced with the characters from Table 1 that represent the unique combinations of four natural nucleotides correlating to the ASCII character "M". In this embodiment those four letters would be "TACT". Once all of the ASCII characters have been substituted with the corresponding nucleotide pairings Table 1, FIG. 1 will have been converted into the representation depicted in FIG. 2.
Figure imgf000012_0001
Figure imgf000013_0001
Figure imgf000014_0001
Figure imgf000015_0001
Figure imgf000016_0001
Figure imgf000017_0001
Figure imgf000018_0001
Figure imgf000019_0001
[0033] One limitation with using natural nucleotides is that it restricts the incoming data to
ASCII characters and the method of conversion to a 44 schema constrained by the channel alphabet of four nucleobases. As Table 1 illustrates each ASCII character is paired with one of the 256 four letter codons comprised of the combinations of the letters representing natural nucleotides A, G, T, C. Binary data streams must be converted to ASCII prior to conversion to four letter codons. The process of converting from binary to ASCII to four (4) letter codons that can later be used to synthesize the represented nucleobases presents size, scope and methodology issues.
[0034] For example, Harvard University has 17 million volumes, recordings, titles and digital files in their collection. The Library of Congress has over 16 million books and 120 million other items including Chinese wood-block prints. If one were to assume that a book of average length contains 40,000 words, and a word of average length is six letters long, and a synthetic DNA strand of average length is 150 nucleotides long, each oligo synthesized would on average be limited to just 6 words using the schema depicted in Table 1. Therefore the average book would require 6,667 oligos to store it using only naturally occurring nucleobases. Consequently the state of the art would require 1.0667e+l l strands of DNA to store all of the Library of Congress's 16 million books in DNA, without accounting for sequences of bases required for addressing to allow for random access which would significantly increase the amount of DNA needed to be synthesized.
[0035] The present invention accomplishes its methodology from first receiving and/or analyzing data or information that is desired to be stored into a sequence of DNA or analogous backbone structure. In some embodiments the source data or information to be stored may be letters, numbers, alpha numeric characters, binary, decimal, hexadecimal, words, phrases, abstractions, and/or other means of communicating data or information known in the art, and/or combinations thereof. The data or information to be stored may be pre-processed or translated into an alternative form, or may be stored directly.
[0036] The present method then includes the step of encoding the data or information by translating the incoming data or information into one or more optimized channel alphabets representing a combination of distinguishable nucleobases (natural or synthetic), non-natural nucleobases, modified nucleobases, analogous polymers, spacers or mirror images of said structures for storage, synthesis, and/or transfer. Encoding may include substituting symbols, characters, words, phrases, n-grams, abstractions and/or combinations thereof from the data or information to be stored for the purpose of compressing or encrypting the data.
[0037] The present method optionally includes the additional step of synthesizing DNA or analogous backbone structure based on the translated data or information using a combination of nucleobases (natural or synthetic), non-natural/artificial nucleobases, modified nucleobases, other analogous polymers, or spacers or their mirror images. There are many known DNA synthesis methods known in the art, and nothing herein is intended to limit the methods of DNA synthesis to be used in the inventive processes described herein. It should be appreciated that Data may be sent off for synthesis using any of the methods known in the art or it may be stored without synthesizing. In some embodiments, DNA synthesis may be accomplished by DNA replication, Polymerase chain reaction (PCR), gene synthesis, oligonucleotide synthesis, base pair synthesis, peptide substitution, r combinations or derivatives thereof.
[0038] Finally, the present method includes the additional step of tagging and addressing
DNA by means of chemical modification, or physical modification, or any other means of distinguishing one nucleobase or DNA strand from another for aid in identifying, decoding, navigating, randomly accessing, organizing, searching, filtering, encrypting and detecting errors within stored or synthesized encoded, compressed and encrypted data and information or any combination thereof.
Encoding
[0039] The present invention stores information on synthetic DNA backbones or analogous structural mediums by substituting incoming data or information with letters, symbols, characters, and/or markers correlating to distinguishable nucleobases (natural or synthetic), non-natural nucleobases, modified nucleobases, analogous polymers, spacers or their mirror images. It provides a method for encoding the data or information by processing and/or analyzing said data or information then dividing into one or more sets of symbols, characters, words, phrases, n-grams, and/or abstractions, and/or combinations thereof within said data or information then translating those divisions and/or subdivisions into one or more lexicons, each derived from one or more channel alphabets selected to optimally represent these divisions with a combination of distinguishable nucleobases (natural or synthetic), non-natural nucleobases, modified nucleobases, analogous polymers, or spacers, or mirror images thereof to facilitate for storage once synthesized.
[0040] Encoding of the source data or information is performed by the creation of one or more n-ary channel alphabets representing distinguishable nucleobases (natural or synthetic), non- natural nucleobases, modified nucleobases, other polymers, or spacers to be used for storing data and information from which the lexicons for encoding are generated.
[0041] The encoding process is intended to substitute each unique symbol, character, word, phrase, formula, combinations thereof or other similar knowledge from within said data or information with one or more unique letters or combination of letters that comprise the encoding lexicon from an n-ary channel alphabet of "n" letters expanded beyond the four base system representing or comprised of natural nucleic acids. These expanded channel alphabets include characters representing Artificially Expanded Genetic Information System (AEGIS) synthetic DNA, artificial or non-terrestrial nucleobases, modified natural, modified synthetic, modified AEGIS synthetic, other non-natural DNA nucleobases that have been further modified to be made differentiable from the corresponding unmodified nucleobase, other polymers, or spacers or mirror images of any of said structures.
[0042] The expanded channel alphabets allow for lexicons of greater fidelity, accessibility, security, capacity, and/or density by using nucleobases that may include, but are not limited to, combinations of natural nucleobases, modified natural nucleobases, artificial nucleobases, AEGIS bases covering the 2, 3 and 4-hydrogen bond electron pair donor and acceptor patterns, size pairing bases (with or without hydrogen bonding), spacers where a nucleobase is omitted but the backbone remains, spacers where a nucleobase is omitted and only a minimal carbohydrate or backbone mimic remains, modifications or decorations of both carbohydrates and carbohydrate mimics.
[0043] Natural and non-natural nucleobases may be modified with, but not limited to, pyrimidines decorated with various ligands at the 4 or 5 positions or other positions as necessary, purines decorated with various ligands at the 7 or 8 positions or other positions as necessary, or combinations or derivatives thereof. Examples of potential modification sites are shown in FIG. 12
[0044] Modified, natural, and artificial nucleobases and other polymers may include, but are not limited to: any of the natural nucleotides and nucleosides of adenine (A), guanine (G), thymine (T), cytosine (C); uracil (U), any of the AEGIS nucleotides of 4-amino-l- methylpyrimidin-2(lH)-one (S), 6-amino-l ,9-dihydro-2H-purin-2-one (B), 6-amino-3- nitropyridin-2(lH)-one (V), 4-aminoimidazo[l,2-a][l,3,5]triazin-2(8H)-one (J), 2,4- diaminopyrimidine (K), 5-aza-7-deaza xanthosine (X), 6-araino-5-nitro-2(1 H')-pyridone (Z), 2- amino-imidazo['l ,2-a]-'i ,3,5-triaziii-4(8H)one (P); 7-(2-thienyl)-imidazo[4,5-£]pyridine (Ds) and 2- nitro-4-propynylpyrrole (Px); unnatural base pairs relying on hydrophobic and packing interactions including, but not limited to d5SICS (6-methylisoquinoline-l(2H)-thione) - dNaM (2- methoxynaphthalene), dTPT3 (thieno[2,3-c]pyridine-7(6H)-thione) - dNaM (2- methoxynaphthalene); or any modification of the natural nucleobases or artificial nucleobases using any spacers, polymers, mirror image polymers or combinations thereof. Examples of potential modification sites on nucleoside heterocycles are demonstrated in FIG. 12.
[0045] Wherein said spacers or polymers are polyethers (for non-aqueous applications), locked nucleic acids (LNA), threose nucleic acids (TNA), Peptide nucleic acids (PNA), Ribonucleic acids (RNA) or combinations or derivatives or mirror images or modifications thereof. Examples of potential modifications to a phosphodiester backbone are demonstrated in FIG. 13.
[0046] FIG. 13 further provides potential modifications to a sugar phosphate backbone. In the absence of a heterocycle, an abasic site is present where Ri=H. Here R2=H for DNA, R2=OH for RNA, R2=OR for an ether moiety, or R2=R for various linkers, decorations or fluorophores. The spacing of a sugar phosphate backbone can be maintained while eliminating a portion of the ribose carbohydrate where R3=H, R4=H and R5=H. Additionally modification sites R3, R4, and R5 can host a variety of substitutions to create unique spacers.
[0047] In one embodiment each of the 256 ASCII characters is assigned a codon in an 83 schema using a subset of 256 codons from the lexicon of 512 possible codons of this uniform length possible comprised from the channel alphabet of A, T, G, C, Z, P, X, Y as depicted in Table 2. FIG. 3 represents an encoding of the PNG file represented in FIG. 1 using said 83 schema depicted in Table 2.
Table 2. Extended ASCII to 8 base channel alphabet in a 83 schema.
Figure imgf000024_0001
Figure imgf000025_0001
Figure imgf000026_0001
Figure imgf000027_0001
Figure imgf000028_0001
Figure imgf000029_0001
Figure imgf000030_0001
[0048] In another embodiment the remaining 256 codons in the 83 schema can be assigned to each of the 256 ASCII in addition to the codons depicted in Table 2 as a means of channel coding allowing for error prone sequences of codons to be avoided, by providing an alternative codon for each character to potentially be encoded with. Additionally, this channel-coding measure allows for increased data fidelity through synthesis of one or more additional permuted variants from the codex for redundancy and can be furthered by having the second codon be the complementary base pair sequence to the first.
[0049] The current n-ary schema allows data and knowledge to be stored using expanded alphabets and subsequent lexicons that represent data and knowledge elements in their native form. In one embodiment every ASCII character, word (from every known language), formula and other communication elements (phonemes or higher level thoughts) is represented by within a single table by a unique twelve letter combination representing one twelve possible bases that comprise the channel alphabet. One practical application of the invention is the ability to store knowledge more effectively. Using the previous analogy, the Library of Congress's books could be stored on 5.12e +11 strands of DNA without having to be converted to ASCII characters. That is a savings of over 94%. Table 3 depicts one embodiment of an English language to base 12 DNA codex (truncated). Additionally, language elements are paired in a manner that allows knowledge created in one language to be read in all other languages. For example in one embodiment American is represented with the letters AAAAAAAAGGGZ. The English word American is also be link the Spanish Norteamericano (VAAAAAAAGGGZ), French Americain (JAAAAAAAGGGZ), Polish amerykanski (KAAAAAAAGGGZ), etc. in the same table allowing knowledge to be stored in its native (or optimal) format and accessed in other formats.
Table 3 - English Language to 12 Base Channel Alphabet in 1212 Schema Codex (truncated)
Figure imgf000031_0001
[0050] Another issue that the present invention addresses is scope. In the Library of
Congress example, the present invention provides a method of storing the over 120 million "other items" including Chinese wood-block prints as unique communication elements in DNA. Existing methods are limited in scope and do not provide a means for storing "other items" as unique communication elements. "Other items" have to be stored either as converted PNG files, similar to the file depicted in FIG. 1, or they have to be extrapolated into a modern language equivalent and stored as character strings. When a communication element is stored as an image it loses the intrinsic element of knowledge it was intended to represent. Similarly with each level of extrapolation, the risk of degradation of the knowledge inherent to the information is increased.
[0051] A third issue that the present embodiment addresses is methodology. In both scenarios, these non-conforming items stored in DNA would be subject to transcription error as they are converted to image files, then ASCII characters and finally to letters representing the four nucleobases of natural DNA. This process must be replicated for each language. Additionally, the knowledge conveyed within these non-standard communication elements is not readily convertible to modern language equivalents, therefore each item requires translation into the native language of the repository prior to conversion. In addition to translation, each item requires extensive metadata be appended to it in order to be searchable as a knowledge element. To do anything less would be to reduce the value of the knowledge element to that of an simple image.
[0052] The present invention provides a method by which non-standard communication elements, such as Chinese wood-blocks, may be stored as knowledge kernels which includes the simultaneous storage as a PNG files, as its modern language equivalent and as unique communication element. Storing unique communication elements as knowledge kernels greatly improves the methodology for storing knowledge intrinsic to higher level communication processes.
Data Compression [0053] In at least one embodiment, the present invention may include one or more additional steps for compressing and storing data or information in synthetic DNA using character substitution. In such embodiment incorporating the additional steps of data compression, the invention enables replacing patterns within the source information or data, such as, but not limited to sequentially repeating strings of characters, or combinations thereof, within incoming data or information with appreciably less characters.
[0054] Compressing using character substitution within data or information, may be accomplished using lossless, lossy, machine learning or other data compression algorithms, or other character substitutions known in the art of data compression and storage.
[0055] In at least one embodiment, character substitution is accomplished by, but not limited to, replacing sequential repetitions of a single character or patterned strings of characters in said data or information with characters representing the repetition tally of the sequentially repeating character or patterned strings of characters and said character or patterned strings of characters.
[0056] In at least one embodiment, compression is accomplished by substituting repeating characters or patterned strings of characters within said data or information with a lesser number of representative characters.
[0057] In at least one embodiment, compression is accomplished using n-gram analysis, a
Huffman algorithm, pattern analysis and regression, arithmetic coding, forward error correction, reductionist compression, Lempel-Ziv-Welch, Brotli, LZX, probabilistic modelings, block-sorting of attributes, or combination or derivatives thereof.
[0058] It is appreciated that compression methods and approaches may be optimized for the data or information being compressed, the channel or medium for transmitting the data or information as well as but not limited to the backbone on which it will be stored or the interference that the repository may be subjected to.
[0059] In one embodiment the incoming data or information is compressed by substituting a string of sequentially repeating characters within said data or information with one or more characters. Replacing sequential repetitions of a single character within said data or information with one or more characters can compress the amount of data to be stored without risk of loss.
[0060] In this embodiment of the invention sequentially repeating characters within said data or information are substituted with characters that represent the repetition tally of the sequentially repeating character and the character that is repeating. The substituted characters are encoded into the incoming data stream in the form "BA", where "A" represents the sequentially repeating character and "B" represents the number of times that said character repeats sequentially in a given instance.
[0061] In this embodiment of the invention, null represents a single repetition and "0" represents 10 repetitions.
[0062] FIG. 4 represents one potential method in which the sequentially repeating characters within the PNG file contained in FIG. 1 can be compressed using the "BA" schema as described in paragraphs 60 through 66 above. In this example, the first letter of the PNG file in FIG. 1, "M" sequentially repeats 15 times. To compress it using the present invention, the 15 sequentially repeating M's are replaced with the characters 15M. This pattern of replacing sequentially repeating characters with a character or characters that represent the number of times said character repeats and said character is continued until the PNG file from FIG. 1 is converted into the file of FIG. 4.
[0063] Once the PNG file is compressed as depicted in FIG. 4, it may be converted into a lexicon derived from a channel alphabet of bases such as depicted in Table 2 or it may be further manipulated. The substitution of terms is continued until the compressed PNG file of FIG. 4 is converted to the file depicted in FIG. 5 using a lexicon comprised from the 6 base channel alphabet as depicted in Table 5.
[0064] In another embodiment of the invention compression is accomplished with a lossless data compression algorithm. In that embodiment of the invention, the incoming data set is analyzed for discrete n-grams and their frequency distribution using a lossless data compression algorithm. From the list of discrete n-grams a uniquely decodable codex of letters, characters, symbols, markings and/or combinations thereof that correspond to bases and combinations thereof from within the channel alphabet is created using a Huffman Tree. Using said codex, the incoming data or information is compressed by substituting the n-grams with a uniquely decodable combination of letters representing a base or sequence of bases in a manner such that the most commonly occurring n-grams are represented with the least amount of bases. The incoming data set, having been converted to letters representing bases may be further manipulated or synthesized into nucleic acids or other analogous polymers for storage.
Table 4 - N-gram analysis Encoding and Compression of FIG. 1
Figure imgf000035_0001
Figure imgf000036_0001
[0065] In one embodiment of the invention the channel alphabet selected is comprised of 24 letters representing 12 nucleobases, each with a discernable prime state achieved through modification. In this embodiment the following letters are used; "A", "T", "G", "C", "S", "B", "V", "J", "K", "X", "Z", "P", "a", "t", "g", "c", "s", "b", "v", "j", "k", "x", "z", and "p" with each capital letter representing an unmodified synthetic nucleobase, and each lowercase letter representing the discernibly modified synthetic nucleobase. In this embodiment of the invention, the PNG file FIG. 1 is analyzed for trigrams with a frequency analysis of said trigrams. In this embodiment the universe of all possible combinations of natural and artificial nucleobases in the channel alphabet of 24 characters are reduced to a uniquely decodable lexicon of words comprised of one or more characters generated through an optimal n-ary huffman compression where n=24. Each of said words of varying length is then used to represent one of the unique trigrams generated from said trigram analysis. One embodiment of the resulting codex is depicted in Table 4.
[0066] In this embodiment of the invention, the PNG file FIG. 1 is converted to the specified 12 base prime channel alphabet lexicon by using the codex of Table 4 wherein each trigram is replaced with a unique base or sequence of bases represented by said lexicon. The resulting data file is depicted in FIG. 6.
[0067] In other embodiments of the invention, data or information may be compressed by adding one or more levels of abstraction.
[0068] In other embodiments of the invention, data or information may be compressed by translating the data into other data systems or structures of information prior to compression for further optimization.
[0069] In other embodiments of the invention subsets of data may be encoded and compressed through means locally optimized for the subset and these sequences may be tagged to indicate this in addition to being able to be used for random access or search. In said embodiments tagging may be accomplished through chemical modification, or physical modification, or any other means of distinguishing one or more bases or DNA string from another. Furthermore said tags may include one or more channel alphabets or lexicons different from the data stored on the remainder of the oligo.
Encryption
[0070] Encryption is accomplished by encoding the data and information in a manner that it can be read only by the sender and the intended recipient in possession of the decoding, decompression, and/or decryption keys. Encoding can be accomplished by character substitution (symmetrically or asymmetrically), hashing, adding one or more levels of abstraction, layering encodings, mechanical or formulaic ciphers, or any other means of disguising information, or combinations or derivatives thereof.
[0071] In one embodiment of the invention, encryption is accomplished by substituting
(symmetrically or asymmetrically) characters, words, phrases, n-grams, abstractions, and/or strings of same within the data or information to be encrypted with representative characters, and/or combinations, and/or derivatives thereof in accordance with a mechanical or formulaic ciphers.
[0072] In other embodiments of the invention, encryption is accomplished by hashing, using variables, by adding one or more levels of abstraction, by converting the data into other data systems or structures of information, or combinations or derivatives thereof in accordance with a mechanical or formulaic ciphers.
[0073] In other embodiments of the invention, data may be encrypted through selectively inserting tags within the data storage regions of DNA sequences. These tags may include information or indicate as to which decoding, decompression, and/or decryption keys are to be used to access the dataset or subset thereof. In said embodiments tagging may be accomplished through chemical modification, or physical modification, or any other means of distinguishing one nucleobase or DNA string from another.
[0074] In other embodiments of the invention data may be encrypted by substituting words, phrases, n-grams, characters, symbols, and/or abstractions, and/or combinations thereof within the incoming data stream prior to conversion to characters representing bases or words from within the channel alphabet and its subsequent lexicon.
[0075] In other embodiments of the invention ASCII characters, words, formulas or other communication elements may be represented by more than one individual base or combinations thereof from within the channel alphabet.
Tagging for Random Access, Organization, And Search
[0076] Tagging is accomplished by appending or otherwise modifying bases or DNA strings for address, organization, random access, meta-tagging, or keyword searches that enables selectively sorting and filtering. One embodiment of a tagging method is depicted in FIG. 7. FIG. 7 further illustrates a representation of a data sequence (green) encoded in DNA (line A) can be segmented into portions (line B) where each segment can include a prefix (orange) or suffix (blue) tag. These segments can also be varied in length and staggered (lines C & D). These encodings can also be paired with their complementary primers (line E, red - line G, blue) for PCR replication (line H) or isolation for further manipulation such as editing or deletion. A dataset can be divided into segments of 100-150 bases in length. These segments can be assigned a prefix tag and/or a suffix tag. These tagging regions can consist of a combination of natural and/or non-standard bases. The tags can contain information relevant to the sequential order of each data segment for reading, compression data or encryption motifs. The tags can also provide the filename, data type, and/or meta tag information. The tags can be used to readily access portions of data specifically through a random access process. Using a mixture of genetic alphabets, these sequences can be shorter and more efficient than naturally occurring DNA sequences alone. This method of retrieval can further be used to copy, edit or delete the data segment.
[0077] Tagging of nucleic acids or analogous polymers, or sequences thereof, further comprises identification through chemical modification, or physical modification, or any other means of distinguishing one or more base or DNA string from another.
[0078] In some embodiments tagging might be accomplished by adding one or more sequences of bases to the beginning, end or amidst the data stored on an oligo for the purpose of organizing, addressing, meta-tagging, or identifying keywords or other elements contained within the data. Said tags might be comprised of lexicons from within one or more channel alphabets that are optimized in a manner different from the data stored on the oligo, and may also include an indication as to the appropriate decoding, decompression, and/or decryption keys to be used to access the data on the strand.
PCR Amplification
[0079] High fidelity amplification of specific data sets, or portions thereof, is accomplished by introduction of the appropriate complementary primers of the prefix and suffix sequences.
[0080] In one embodiment of the invention prefixes or suffixes are appended to the DNA strings to allow for random access. In one embodiment of the invention prefixes or suffixes may be appended to the DNA strings in the form [prefix] [data] [suffix] as depicted in FIG. 7. [0081] In one embodiment of the invention, a string of data from a divided into segments of 150 nucleobases, with a 25 base prefix and 25 base suffix with sequence information of the data packet, metadata for search, and also serves as a site for specific and highly selective random access.
[0082] The current invention improves amplification due to increased affinity of these mixed modified bases to themselves versus a data region which may comprise of only ATGC bases to simplify the DNA polymerization/replication paradigm.
[0083] To mitigate potential replication errors which may occur with bases in systems where a machine learning system is not in place, the current invention targets the desired sequence with complementary 3'-5' and 5'-3' prefix and suffix primers.
[0084] In one embodiment of the invention targeting provides for isolation, editing, replication, random access reading of data and information within the repository.
Editing and Error Correction
[0085] In one embodiment of the invention editing and error correction can be accomplished by means of CRISPR (clustered regularly interspaced short palindromic repeats)/CAS 9, and newer generations of the technology, for site-specific modifications, additions or deletions to a sequence..
[0086] In one embodiment of the invention damage to the oligos can be repaired by a sequence of a series of enzymes including but not limited to glycosylases, endonucleases, polymerases and ligases.
Channel Alphabet and Lexicon Optimization
[0087] In some embodiments of the technology the channel alphabets and the subsequent lexicons with which data or information is encoded in order to be stored can be locally optimized based on the source dataset or subset thereof by assigning weights of importance placed on one or more of the following variable design parameters, which may include but are not limited to, compression, encryption, compatibility, random access, search & filtering, channel coding, cost, universal access, self-healing, editability, re-writability, duplicability, and/or security. Various embodiments of said local optimization is not limited to, and may include one or more of the methods of encoding, compressing, and/or encrypting information described herein in order to achieve the optimal or one of many ideal storage and/or transfer methods.
EXAMPLES
[0088] It is to be understood that while the invention has been described in conjunction with the detailed description thereof, the foregoing description is intended to illustrate and not limit the scope of the invention, which is defined by the scope of the appended claims. Other aspects, advantages, and modifications are within the scope of the following claims.
EXAMPLE 1
[0089] The figures and tables shown herein depict example arrangements of process and method of the invention. More or less than all the features available or contemplated may be present in an actual embodiment. It should also be understood that FIGS. 1-11 are merely illustrative and may not be exhaustive.
[0090] The present invention also provides a method for storing data, structured or unstructured, and knowledge, books, papers or text. The method enables storing data in DNA or analogous backbone structures by substituting the information in the incoming data stream with characters that represent a base. Using the natural nucleotides of A, G, T, C it is possible to create up to 256 unique combinations of pairing. Beginning with AAAA and ending with CCCC, each character is the ASCII system is assigned a unique combination of four letters. Table 1 represents one potential way the nucleotides could be paired with the ASCII characters using the present limited state of the art. To convert the PNG file contained in FIG. 1, each ASCII character is replaced by a corresponding DNA 4 base codon within the 44 schema. After all of the ASCII characters have been substituted with words from the channel's specified lexicon, FIG. 1 will have been converted into the representation depicted in FIG. 2.
[0091] Table 2 depicts one embodiment of an ASCII to an 8 base channel alphabet with a codex using an 83 schema to generate codons that comprise the encoding lexicon.
[0092] In one embodiment of the invention sequentially repeating characters within an incoming data stream are substituted with characters that represent the repetition tally of the sequentially repeating character and the character that is repeating. The substituted characters can be encoded into the incoming data stream in the form "BA", where "A" represents the sequentially repeating character and "B" represents the number of times that said character repeats sequentially in a given instance.
[0093] In one embodiment of the invention null represents a single repetition and "0" represents 10 repetitions.
[0094] FIG. 4 represents one potential way the sequentially repeating characters within the
PNG file contained in FIG. 1 can be compressed using the "BA". This pattern of replacing sequentially repeating characters with a character or characters that represent the number of times said character repeats and said character is continued until the PNG file from FIG. 1 is converted into the file of FIG. 4. [0095] Once the PNG file is compressed as depicted in FIG. 4, it is translated to a string of bases using Table 2. This string is further compressed by the generation of a table wherein the numbers 0-9 and the characters represented in the file are solely present. Using a table string consisting of the sequence of codons from Table
2(AXAAXTAXGAXCAXZAXPAXXAXKAKAAKTTTPTTXTZZTPATPPTPKTXCTKTAKG APKA PC), the system generated a table into characters within the representative 2 letter codon, 6 base channel alphabet depicted in Table 5 . Using Table 5, " 15" is represented by ATTT . The substitution of terms is continued until the compressed PNG file of FIG. 4 is converted to the file depicted in FIG. 5.
Table 5. 6 Base Locally Optimized 62 Schema
Figure imgf000044_0001
Figure imgf000045_0001
[0096] The present invention also provides a method of encrypting compressed data sets using n-grams and an n-ary codex then randomizing and or substituting values. In one embodiment of the invention, compressed data sets are analyzed for frequency distribution of n-grams within the compressed data set to generate an optimal unambiguous lexicon of words within the selected n-ary channel through n-ary huffman encoding . These words are in turn represented by a specified base or sequence of bases in a manner which assigns the most commonly used words to require the least amount of bases.
[0097] In this embodiment of the invention, the PNG file from FIG. 1 is used to generate a list of unambiguous trigrams. A frequency analysis of these trigrams is run to create a codex wherein the most frequently occurring trigrams are represented by the least number of characters in the codex. One such resulting codex is depicted in Table 4.
[0098] In one embodiment of the invention, the PNG file from FIG. 1 is converted to the channel alphabet by symmetrically substituting each trigram from the incoming data set or information with representative characters from the codex of Table 4. The resulting data file is depicted in FIG. 6. In other embodiments the trigrams may be substituted asymmetrically.
[0099] The data may be further optimized or encoded by means of tagging of nucleic acids or analogous polymers, or sequences thereof. In other embodiments further optimization or encoding comprises tagging through chemical modification, or physical modification, or any other means of distinguishing one or more bases or DNA string from another.
EXAMPLE 2
[00100] The present invention also provides a method for storing knowledge, books, papers or text in DNA. FIG. 8 represents an excerpt from the children's book writing by Hans Christian Andersen, "The Ugly Duckling."
[00101] FIG. 9 provides the file represented in FIG. 8 after the excerpt was encoded using a lexicon of codons in an 83 schema from Table 6 with lengths of 3 bases, from a channel alphabet of 8 bases (ATGCZPXY) and substituted back into the original data file.
[00102] FIG. 10 provides the file represented from FIG. 9 after compression by frequency analyzing the trigrams (codons) in the encoded excerpt then performing an optimal Huffman compression using the same channel alphabet to create a compressed lexicon. FIG. 11 provides the file represented in FIG. 10 after encryption was performed by taking the already encoded, then compressed excerpt and performing asymmetric substitutions using a lexicon of uniquely decodable words, using the same channel alphabet as in the previous steps, generated by performing an expanded Huffman compression then randomizing the assignment of the new encrypted lexicon with randomized the values for each word in the previously compressed file's lexicon.
Table 6 Sequential Encoding, Compression, and Encryption of an Excerpt from "The Ugly
Duckling" With Lexicons From Within an 8 Base Channel Alphabet
Figure imgf000046_0001
Figure imgf000047_0001
Figure imgf000048_0001
Figure imgf000049_0001
Figure imgf000050_0001
EXAMPLE 3
[00103] In a third example of the invention the process is run in reverse. Beginning with the
DNA sequence a codex is created to convey certain data and information. In one embodiment of the invention, a codex is generated using the human genome in such a way that a treating physician can scan said patient's DNA and generate a data repository depicting the patient's entire medical history, records and history of illness and injury.
Other Embodiments
[00104] While at least one exemplary embodiment has been presented in the foregoing detailed description, it should be appreciated that a vast number of variations exist. For example, some uses do not require both ends of the apparatus to be secured to an object, and the apparatus may be hung or dangled from one end of an object. It should also be appreciated that the exemplary embodiment or exemplary embodiments are only examples, and are not intended to limit the scope, applicability, or configuration of the described embodiments in any way. Rather, the foregoing detailed description will provide those skilled in the art with a convenient road map for implementing the exemplary embodiment or exemplary embodiments. It should be understood that various changes can be made in the function and arrangement of elements without departing from the scope as set forth in the appended claims and the legal equivalents thereof.

Claims

1. A method of encoding data and information for physical storage on a backbone comprised of synthetic, natural, and/or modified DNA or analogous structural mediums, including but not limited to polymer backbones, using expanded alphabets of distinguishable nucleobases (natural, synthetic, non-natural, and/or modified) or analogous polymers or enantiomers thereof for representing a selected set of discrete values, the method comprising:
receiving data or information for the storage into a sequence of DNA or analogous backbone structure;
encoding the data or information by translating the data or information into a channel alphabet and lexicon comprised of letters, symbols, characters, or markings and/or combinations thereof that represent a combination of distinguishable nucleobases (natural or synthetic), non- natural nucleobases, modified nucleobases, analogous polymers, spacers, or enantiomers thereof for representing a selected set of discrete values by substituting symbols, characters, words, phrases, n-grams, random access phrasing or combinations thereof with a combination of nucleobases (natural or synthetic), modified nucleobases, artificial nucleobases, analogous polymers, spacers, enantiomers thereof , or alphanumeric representations thereof; and
synthesizing DNA or analogous backbone structure based on the translated data or information using a combination of nucleobases (natural or synthetic), non-natural/artificial nucleobases, modified nucleobases, other analogous polymers, spacers or enantiomers thereof.
2. The method of claim 1 further comprising compressing said received data or information, or the encoding thereof, using character substitution.
3. The method of any of claims 1 or 2 further comprising compressing said data or information, or the encoding thereof, using lossless, lossy, machine learning or other data compression algorithms.
4. The method of any of claims 1 to 3 further comprising encrypting said data or information, the encoding or compressions thereof.
5. The method of any of claims 1 to 4 wherein said encoding of the data or information is accomplished by substituting each distinct symbol, character, word, phrase or formula or combinations thereof with one or more uniquely decodable letters, or string of letters, representing an alphabet comprised of a combination of differentiable nucleobases (natural or synthetic), artificial nucleobases, modified nucleobases, other polymers, spacers, or enantiomers thereof in sequence on a DNA backbone or other analogous structural polymer.
6. The method of claim 5 wherein said encoding of the data or information is performed based on an artificially expanded n-ary channel alphabet representing distinguishable nucleobases (natural or synthetic), non-natural nucleobases, modified nucleobases, other polymers, spacers, or enantiomers thereof.
7. The method of claim 6 wherein said n-ary system substitutes each unique symbol, or character, or word, or phrase or formula or combinations thereof or other similar knowledge from within the data or information with one or more unique letters or combination of letters from a channel alphabet of n letters expanded beyond the four base system representing or comprised of natural nucleic acids, which may include AEGIS synthetic DNA, artificial or non-terrestrial nucleobases, modified natural, modified synthetic, modified AEGIS synthetic, other non-natural DNA nucleobases that have been further modified to be made differentiable from the corresponding unmodified nucleobase, other polymers, spacers, or enantiomers thereof.
8. The method of any of claims 2 through 7 wherein said character substitution further comprises replacing sequential repetitions of a single character in said data or information with characters representing the repetition tally of the sequentially repeating character or patterned strings of and said character.
9. The method of claim 8 further comprising substituting repeating characters or patterned strings of characters within said received data or information with a lesser number of representative characters.
10. The method of any of claims 3 through 9 wherein using lossless, lossy, machine learning data compression algorithms, or otherwise, further comprises using n-gram analysis, a Huffman algorithm, pattern analysis and regression, arithmetic coding, forward error correction, reductionist compression, Lempel-Ziv-Welch, Brotli, LZX, probabilistic modelings, block-sorting of attributes, or combination or derivatives thereof.
11. The method of claim 1 wherein encryption of the second compressed data is performed using n-gram analysis, substitution (symmetrical or asymmetrical), hashing, adding one or more levels of abstraction, layering encodings, mechanical or formulaic ciphers or any other means of disguising information, or combinations or derivatives thereof.
12. The method of claim 1 wherein said expanded lexicons of nucleobases include combinations of natural nucleobases, modified natural nucleobases, artificial nucleobases Artificially Expanded Genetic Information System (AEGIS) bases covering the 2, 3 and 4- hydrogen bond electron pair donor and acceptor patterns, size pairing bases with or without hydrogen bonding), spacers where a nucleobase is omitted but the backbone remains, spacers where a nucleobase is omitted and only a minimal carbohydrate or backbone mimic remains, modifications and/or decorations of both carbohydrates and carbohydrate mimics and/or enantiomers thereof.
13. The method of claim 1 further comprising tagging nucleobases or DNA strings for address, decoding, organization, orientation, random access, meta-tagging, or keyword search that enables selectively sorting and filtering.
14. The method of any of claims 4 through 13 wherein said encryption of said data or information is converted and embedded into the compressed data stream asymmetrically, hashed, using variables, by means of substitutions, by adding one or more level of abstraction, by converting the data into other data systems or structures of information, or combinations or derivatives thereof.
15. The method of any of claims 13 or 14 wherein appending or otherwise modifying nucleic acids or analogous polymers, or sequences thereof, further comprises tagging through chemical modification, or physical modification, or any other means of distinguishing one nucleobase or analogous polymer, combinations thereof, or DNA string from another.
16. The method of any of claims 1 to 15 wherein natural nucleobases are modified with pyrimidines decorated with various ligands at the 4 or 5 positions or other positions as necessary, purines decorated with various ligands at the 7 or 8 positions or other positions as necessary, or combinations or derivatives thereof.
17. The method of any of claims 12 through 16 wherein said modified, natural, and artificial nucleobases and other polymers may include, but are not limited to: any of the natural nucleotides and nucleosides of adenine (A), guanine (G), thymine (T), cytosine (C); uracil (U) any of the AEGIS nucleotides of 4-amino-l-methylpyrimidin-2(lH)-one (S), 6-amino-l,9-dihydro-2H- purin-2-one (B), 6-amino-3-nitropyridin-2(lH)-one (V), 4-aminoimidazo[l,2-a][l,3,5]triazin- 2(8H)-one (J), 2,4-diaminopyrimidine (K), 5-aza-7-deaza xanthosine (X), 6~araino-5~nitro-2(lH)- pyridone (Z), 2"amino-imidazo[L2-a]-l,3,5 riazin-4(8H)one (P); or any modification of the natural nucleobases or artificial nucleobases using any spacers, polymers, mirror polymers, mirror images or combinations thereof.
18. The method of any of claims 1 to 17 wherein said spacers or polymers are polyethers (for non-aqueous applications), locked nucleic acids (LNA), threose nucleic acids (TNA), Peptide nucleic acids (PNA), Ribonucleic acids (RNA) or combinations or derivatives or mirror images or modifications thereof.
19. The method of any of claims 1 to 18 further comprising appending said synthesized nucleic acids or analogous polymers to attach prefixes or suffixes to support high fidelity PCR amplification.
20. The method of claim 19 wherein prefixes or suffixes may be appended to the DNA strings in the form [prefix] [data] [suffix].
21. The method of claim 19 wherein editing and error correction is accomplished with CRISPR/CAS 9 or other variants of CRISPR or combinations thereof.
22. The method of claim 1 wherein information or data is encoded to an existing sequence of distinguishable nucleobases (natural, synthetic, non-natural, and/or modified) or analogous polymers or enantiomers thereof with or without the methods of claim 15 or claim 21.
23. The method of claim 1 where the selection of channel alphabet and methods of claims 2-22 are locally optimized based on the source information or dataset or subset thereof by assigning weights of importance placed on one or more of the following variable design parameters, which may include but are not limited to, compression, encryption, compatibility, random access, search & filtering, channel coding, cost, universal access, self-healing, editability, re-writability, duplicability, and/or security.
24. A method of claim 23 further comprising securing stored information or data by including nucleobases (natural or synthetic), non-natural/artificial nucleobases, modified nucleobases, other analogous polymers, or spacers or enantiomers thereof, or combinations thereof at specified intervals (fixed distance, formulaic, or random) within the sequence, to identify tampering or alteration of the storage medium.
25. A method of claim 24 further comprising channel coding by identification of said bases at specified intervals.
26. A method of claim 1 further comprising arranging sequences of physically stored information into larger structural matrixes such as dendrimers. Said arrangement may be accomplished by inserting structural or binding polymers into the sequence. Further organization can be accomplished by introducing histones or other analogous peptides into the storage medium.
PCT/US2018/047957 2017-08-24 2018-08-24 Device for information encoding and, storage using artificially expanded alphabets of nucleic acids and other analogous polymers WO2019040871A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201762549680P 2017-08-24 2017-08-24
US62/549,680 2017-08-24

Publications (1)

Publication Number Publication Date
WO2019040871A1 true WO2019040871A1 (en) 2019-02-28

Family

ID=65439279

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2018/047957 WO2019040871A1 (en) 2017-08-24 2018-08-24 Device for information encoding and, storage using artificially expanded alphabets of nucleic acids and other analogous polymers

Country Status (1)

Country Link
WO (1) WO2019040871A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110684791A (en) * 2019-11-15 2020-01-14 天津大学 Method for storing information in vivo by using DNA
CN110729024A (en) * 2019-08-27 2020-01-24 浙江工业大学 Protein structure model quality evaluation method based on topological structure similarity
WO2020185896A1 (en) * 2019-03-11 2020-09-17 President And Fellows Of Harvard College Methods for processing and storing dna encoding formats of information
CN112687338A (en) * 2020-12-31 2021-04-20 云舟生物科技(广州)有限公司 Method for storing and restoring gene sequence, computer storage medium and electronic device
US11017170B2 (en) * 2018-09-27 2021-05-25 At&T Intellectual Property I, L.P. Encoding and storing text using DNA sequences
CN113380322A (en) * 2021-06-25 2021-09-10 倍生生物科技(深圳)有限公司 Artificial nucleic acid sequence watermark encoding system, watermark character string and encoding and decoding method
CN114317684A (en) * 2021-12-15 2022-04-12 南京大学 Intracellular magnesium ion imaging method based on TNA molecules

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040001371A1 (en) * 2002-06-26 2004-01-01 The Arizona Board Of Regents On Behalf Of The University Of Arizona Information storage and retrieval device using macromolecules as storage media
US20130282677A1 (en) * 2011-01-07 2013-10-24 Zhen Ji Data compression system for dna sequence
WO2013178801A2 (en) * 2012-06-01 2013-12-05 European Molecular Biology Laboratory High-capacity storage of digital information in dna
WO2016059610A1 (en) * 2014-10-18 2016-04-21 Malik Girik A biomolecule based data storage system
WO2016181412A2 (en) * 2015-05-12 2016-11-17 Council Of Scientific & Industrial Research Method for encoding and decoding large scale molecular virtual libraries into a barcode
US20170017436A1 (en) * 2015-07-13 2017-01-19 President And Fellows Of Harvard College Methods for Retrievable Information Storage Using Nucleic Acids
US20170074855A1 (en) * 2014-05-15 2017-03-16 Two Pore Guys, Inc. Scaffold Data Storage and Target Detection in a Sample Using a Nanopore

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040001371A1 (en) * 2002-06-26 2004-01-01 The Arizona Board Of Regents On Behalf Of The University Of Arizona Information storage and retrieval device using macromolecules as storage media
US20130282677A1 (en) * 2011-01-07 2013-10-24 Zhen Ji Data compression system for dna sequence
WO2013178801A2 (en) * 2012-06-01 2013-12-05 European Molecular Biology Laboratory High-capacity storage of digital information in dna
US20170074855A1 (en) * 2014-05-15 2017-03-16 Two Pore Guys, Inc. Scaffold Data Storage and Target Detection in a Sample Using a Nanopore
WO2016059610A1 (en) * 2014-10-18 2016-04-21 Malik Girik A biomolecule based data storage system
WO2016181412A2 (en) * 2015-05-12 2016-11-17 Council Of Scientific & Industrial Research Method for encoding and decoding large scale molecular virtual libraries into a barcode
US20170017436A1 (en) * 2015-07-13 2017-01-19 President And Fellows Of Harvard College Methods for Retrievable Information Storage Using Nucleic Acids

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11017170B2 (en) * 2018-09-27 2021-05-25 At&T Intellectual Property I, L.P. Encoding and storing text using DNA sequences
US11361159B2 (en) 2018-09-27 2022-06-14 At&T Intellectual Property I, L.P. Encoding and storing text using DNA sequences
US20220358290A1 (en) * 2018-09-27 2022-11-10 At&T Intellectual Property I, L.P. Encoding and storing text using dna sequences
WO2020185896A1 (en) * 2019-03-11 2020-09-17 President And Fellows Of Harvard College Methods for processing and storing dna encoding formats of information
CN110729024A (en) * 2019-08-27 2020-01-24 浙江工业大学 Protein structure model quality evaluation method based on topological structure similarity
CN110729024B (en) * 2019-08-27 2021-12-17 浙江工业大学 Protein structure model quality evaluation method based on topological structure similarity
CN110684791A (en) * 2019-11-15 2020-01-14 天津大学 Method for storing information in vivo by using DNA
CN112687338A (en) * 2020-12-31 2021-04-20 云舟生物科技(广州)有限公司 Method for storing and restoring gene sequence, computer storage medium and electronic device
CN113380322A (en) * 2021-06-25 2021-09-10 倍生生物科技(深圳)有限公司 Artificial nucleic acid sequence watermark encoding system, watermark character string and encoding and decoding method
CN113380322B (en) * 2021-06-25 2023-10-24 倍生生物科技(深圳)有限公司 Artificial nucleic acid sequence watermark coding system, watermark character string and coding and decoding method
CN114317684A (en) * 2021-12-15 2022-04-12 南京大学 Intracellular magnesium ion imaging method based on TNA molecules
CN114317684B (en) * 2021-12-15 2023-12-26 南京大学 Intracellular magnesium ion imaging method based on TNA molecules

Similar Documents

Publication Publication Date Title
WO2019040871A1 (en) Device for information encoding and, storage using artificially expanded alphabets of nucleic acids and other analogous polymers
Mäkinen et al. Storage and retrieval of highly repetitive sequence collections
CN110603595B (en) Methods and systems for reconstructing genomic reference sequences from compressed genomic sequence reads
US20170249345A1 (en) A biomolecule based data storage system
CN110945595B (en) DNA-based data storage and retrieval
US8659451B2 (en) Indexing compressed data
CN100367189C (en) Method for coding DNA sequence and device and computer readability medium
Mäkinen et al. Storage and retrieval of individual genomes
ES2922420T3 (en) Efficient data structures for the representation of bioinformatics information
Franceschini et al. Data compression using encrypted text
Heinis et al. Survey of information encoding techniques for dna
CN110168652B (en) Method and system for storing and accessing bioinformatic data
Garafutdinov et al. Encoding of non-biological information for its long-term storage in DNA
CN109658981A (en) A kind of data classification method of unicellular sequencing
CN111095423A (en) Encoding/decoding method, apparatus and data processing apparatus
CN110168649A (en) The method and apparatus of compact representation for biological data
Mishra et al. Fast pattern matching in compressed text using wavelet tree
CN107169315A (en) The transmission method and system of a kind of magnanimity DNA data
Petri et al. Efficient indexing algorithms for approximate pattern matching in text
Pathak et al. RETRACTED: LFQC: a lossless compression algorithm for FASTQ files
Wang et al. DNA Digital Data Storage based on Distributed Method
Laddha et al. Digital data storage on DNA
JP2003101485A (en) Information communication method, information recording method, encoder and decoder with biomacromolecular or as communication medium or recording medium
CN112449716A (en) Method for storing information by using DNA molecules
JP7089804B2 (en) A storage medium that stores a data creation device, a data creation method, and a data creation program.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18848759

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18848759

Country of ref document: EP

Kind code of ref document: A1