WO2018149405A1 - Procédé de stockage et de lecture d'informations - Google Patents

Procédé de stockage et de lecture d'informations Download PDF

Info

Publication number
WO2018149405A1
WO2018149405A1 PCT/CN2018/076721 CN2018076721W WO2018149405A1 WO 2018149405 A1 WO2018149405 A1 WO 2018149405A1 CN 2018076721 W CN2018076721 W CN 2018076721W WO 2018149405 A1 WO2018149405 A1 WO 2018149405A1
Authority
WO
WIPO (PCT)
Prior art keywords
dna
sequence
information
sequences
output
Prior art date
Application number
PCT/CN2018/076721
Other languages
English (en)
Chinese (zh)
Inventor
杨平
蔡晓辉
李彦敏
齐金才
蔡锦雄
时逢宽
田净净
刁文一
王彬彬
Original Assignee
苏州泓迅生物科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州泓迅生物科技股份有限公司 filed Critical 苏州泓迅生物科技股份有限公司
Publication of WO2018149405A1 publication Critical patent/WO2018149405A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/50Compression of genetic data

Definitions

  • the present application relates to the field of information storage technology, for example, to a method of information storage and reading.
  • the present application proposes a method for information storage and reading, which has good versatility, can simplify calculation, improve the continuity of storage of DNA information, storage efficiency and storage density, and can also reduce error rate and reduce sequence synthesis and detection cost. .
  • the present application provides a method for storing information, which may include: converting original document information into a full DNA sequence represented by four deoxyribonucleotides of adenine A, thymine T, cytosine C, and guanine G; Breaking the entire sequence of the DNA into a plurality of DNA fragments, and constructing the plurality of DNA fragments respectively to obtain a plurality of output DNA sequences; synthesizing the plurality of output DNA sequences into corresponding artificial DNA sequences and storing the same Converting the original document information into a full sequence of DNA represented by four deoxyribonucleotides of adenine A, thymine T, cytosine C and guanine G, including: reading binary information of original file information, the binary information Convert to quaternary information and encode the quaternary information into a full sequence of DNA.
  • the application also provides a method for reading information, including:
  • the full sequence of DNA is converted to quaternary information, the quaternary information is converted to binary information, and the binary information is read.
  • the application also provides a method for storing information, including:
  • the original file information includes at least two kinds of character information; the original file information is converted into a complete DNA sequence represented by four deoxyribonucleotides of adenine A, thymine T, cytosine C and guanine G
  • the method comprises: reading binary information of the original file information, converting the binary information into hexadecimal Unicode information, and encoding and converting the hexadecimal Unicode information into a complete DNA sequence.
  • the application also provides a computer readable storage medium storing computer executable instructions for performing any of the methods described above.
  • the application also provides an information storage device including one or more processors, a memory, and one or more programs, the one or more programs being stored in a memory, when executed by one or more processors , perform the above method.
  • the present application also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer Having the computer perform any of the methods described above.
  • the application adopts a quaternary-based coding method to optimize the structure of the output DNA sequence, improve the continuity of the information storage, storage efficiency and density, reduce the DNA synthesis and data recovery error rate, and reduce the cost.
  • FIG. 1 is a schematic flowchart of a method for storing information according to an embodiment
  • FIG. 1b is a schematic flowchart of a method for storing and reading information according to an embodiment
  • FIG. 2 is a flow chart of constructing an output DNA sequence according to an embodiment
  • FIG. 3 is a schematic flowchart of a method for reading information according to an embodiment
  • FIG. 4 is a schematic flowchart of a method for storing information according to another embodiment
  • FIG. 5 is a schematic structural diagram of hardware of an information storage device according to an embodiment.
  • nucleic acid sequences in the present application may be single-stranded or double-stranded.
  • Deoxyribonucleic acid also known as deoxyribonucleic acid, is a deoxyribose, phosphoric acid, and four bases (adenine (A), thymine (T), cytosine (C), and birds. ⁇ (G)) is a biological macromolecule whose primary biological function is to store biological information. DNA can form genetic instructions that guide biological development and vital functioning, a process necessary to construct other compounds in the cell. Due to the high density and long-term stability of DNA (half-life > 500 years), DNA can be used as an information storage medium with promising information.
  • the digital storage of DNA refers to the conversion of digitized information into the base sequence information of DNA and stored in the base sequence of synthetic DNA, and then the information stored in the synthetic DNA is read by sequencing, and finally completed on the computer. Conversion of DNA base sequences to digitized information.
  • DNA has many potential advantages as a new type of information storage medium. For example, DNA storage density is very large. At the theoretical level, each nucleotide (nt) of DNA can be used to encode two bytes, or each gram of single-stranded DNA can be used to encode 455 exabytes; DNA stability is strong. It can be stored for tens of thousands of years under conditions of low temperature drying and darkness; in addition to degradation under non-ideal conditions, DNA storage is usually readable. In addition, unlike other digital storage media, DNA storage is not limited to planar layers.
  • An encoding method for storing information by using DNA the main steps of the method include: using a Huffman coding strategy to transcode a binary sequence of a file into a ternary sequence, and then using an anti-homopolymer DNA coding strategy, The hexadecimal sequence is encoded as a DNA sequence, and the DNA sequence obtained above is interrupted by a four-fold overlapping step to obtain a DNA fragment, and the fragment is added to the first information region and the anterior-poster primer label to obtain a final DNA sequence fragment.
  • the DNA fragment obtained above is synthesized into a DNA fragment by DNA oligonucleotide strand synthesis technology, and the synthesized fragment is stored as a dry powder or a solution; if copying of information is required, reverse complementation with the primer linker can be utilized. Primers were subjected to PCR amplification.
  • the high-throughput sequencing is then used to resolve the information stored in the DNA.
  • the parsing process includes sequencing the original sequence, splicing the sequence fragments, translating the DNA sequence into a ternary file, and restoring the ternary file to the original binary computer information. Wait for steps.
  • the quadruple overlap step break means that the two adjacent 100 bp segments contain 75 bp overlap. After a complete sequence is interrupted, except for 100 bp before and after, the other positions are repeated in 4 adjacent segments.
  • the coding method for DNA storage technology constructed by this patent method is basically similar to the method of the European Bioinformatics Institute.
  • the method uses a directly encoded DNA storage reading method, and uses some indexing tables and Unicode codes for digitized information and base sequences ( Unicode) combined method, but the method is less versatile, limited to text storage, only can achieve English, Chinese, digital, punctuation to DNA base sequence conversion, can not achieve DNA storage for pictures and audio and Read.
  • the present application provides a digital information DNA storage scheme that can combine and store synthetic DNA sequences and Next-generation sequencing technology (NGS) technology to store and read digitized information in any format.
  • NGS Next-generation sequencing technology
  • DNA synthesis and sequencing technologies are evolving at an exponential rate.
  • the technique of information storage based on synthetic DNA sequences proposed in this application will be a better method for high storage density and long-term information storage in the future.
  • This embodiment provides an information storage method. As shown in FIG. 1a, the method may include the following steps:
  • step 110 binary information of the original file information is read, the binary information is converted into quaternary information, and the quaternary information is encoded and converted into adenine A, thymine T, cytosine C, and bird. ⁇ G
  • the original file information may include at least one form of information such as text, pictures, and audio.
  • the original file information may be a computer binary readable file format, such as word, pdf, xls, rar, txt, ps, jpt, rft, jpg, jpeg, jpe, tif, png, and the like.
  • step 120 the entire DNA sequence is broken into a plurality of DNA fragments, and the plurality of DNA fragments are separately constructed to obtain a plurality of output DNA sequences.
  • step 130 the plurality of output DNA sequences are synthesized into corresponding artificial DNA sequences and stored.
  • the data exists in a binary form in the computer, and the lossless conversion can be implemented between binary and quaternary.
  • the binary conversion to quaternary encoding provided in this embodiment may be referred to as quaternary-based bit DNA (BitDNA) encoding.
  • itDNA quaternary-based bit DNA
  • the binary information can be converted to quaternary information and further converted to corresponding base sequence information.
  • the base sequence information is divided into fragments and constructed into a DNA sequence for output, and the output DNA sequence can be 100 nt in length, and a DNA chip (DNA storage medium) is synthesized according to the designed output DNA sequence.
  • the information stored in the DNA chip can be read by amplification, sequencing, and recovery of the sequencing structure on a computer, such as by polymerase chain reaction (PCR) amplification, second generation sequencing (NGS), and sequencing on a computer. The recovery of the results is achieved by reading the information stored by the DNA chip. Among them, amplification and sequencing can also be performed by other means.
  • Convert binary information to quaternary BitDNA encoded data read the binary information of the original file information, perform BitDNA encoding according to the base pair index table of Table 1, convert the binary information into quaternary BitDNA coding sequence data (ie, the complete sequence of DNA) ).
  • Binary code Quaternary code Base pair code 00 0 A 01 1 T 10 2 C 11 3 G
  • Construction of an output DNA sequence The entire DNA sequence is interrupted or divided into multiple DNA fragments according to the DNA export format.
  • the binary codes 00, 01, 10, and 11, the quaternary codes 0, 1, 2, and 3, and the correspondences of A, T, C, and G can be adjusted according to the arrangement and combination rules, and various types can be obtained.
  • a code conversion relationship; the binary information of the original file information is read, the binary information is converted into quaternary information, and the code conversion relationship of converting the quaternary information into a full sequence of DNA may be One of the coding conversion relationships.
  • the conversion relationship of the binary code quaternary code conforms to the conversion relationship between different hexadecimal numbers, that is, the binary codes 00, 01, 10, and 11 correspond to the quaternary codes 0, 1, 2, and 3, respectively.
  • the binary codes 00, 01, 10, and 11 the quaternary codes 0, 1, 2, and 3, and the correspondences of A, T, C, and G are adjusted, and various types are obtained.
  • the encoding conversion relationship is as follows:
  • Binary codes 00, 01, 10, and 11 correspond to quaternary codes 0, 1, 2, and 3, respectively, corresponding to base pair codes A, T, C, and G; binary codes 00, 10, 01, and 11 respectively correspond to four Codes 0, 2, 1, and 3 correspond to base pair codes A, T, C, and G, respectively; binary codes 00, 01, 11, and 10 correspond to quaternary codes 0, 1, 3, and 2, respectively, corresponding to bases Base pair code A, T, C, and G; may also be binary codes 00, 10, 11, and 01 corresponding to quaternary codes 0, 2, 3, and 1, respectively, corresponding to base pair codes A, T, C, and G, respectively Binary codes 00, 11, 01, and 10 correspond to quaternary codes 0, 3, 1, and 2, respectively, corresponding to base pair codes A, T, C, and G; binary codes 00, 11, 10, and 01 correspond to four, respectively.
  • the hex codes 0, 3, 2, and 1 correspond to the base pair codes A, T, C, and G, respectively; the binary codes 01, 00, 10, and 11 correspond to the quaternary codes 1, 0, 2, and 3, respectively.
  • binary codes 01, 11, 00, and 10 correspond to quaternary codes 1, 3, 0, and 2, respectively, corresponding to base pair codes A, T, C, and G;
  • Binary codes 01, 11, 10, and 00 correspond to quaternary codes 1, 3, 2, and 0, respectively, corresponding to base pair codes A, T, C, and G;
  • binary codes 10, 00, 01, and 11 respectively correspond to four Codes 2, 0, 1, and 3 correspond to base pair codes A, T, C, and G, respectively;
  • binary codes 10, 00, 11, and 01 correspond to quaternary codes 3, 0, 4, and 1, respectively, corresponding to bases Pairs of codes A, T, C, and G;
  • binary codes 10, 01, 00, and 11 correspond to quaternary codes 2, 1, 0, and 3, respectively, corresponding to base pair codes A, T, C, and G;
  • binary code 10, 01, 11 and 00 correspond to quaternary codes 2, 1, 3 and 0, respectively, corresponding to base pair codes A, T, C and G;
  • binary codes 10, 11, 00 and 01 respectively correspond
  • Binary codes 11, 10, 00 and 01 correspond to quaternary codes 3, 2, 0 and 1, respectively, corresponding to base pair codes A, T, C and G; binary codes 11, 10, 01 and 00 respectively correspond to four Codes 3, 2, 1, and 0 correspond to base pair codes A, T, C, and G, respectively.
  • the above table 1 may have other correspondences, and the base pair code order may be fixed unchanged, and the order of the binary code and the quaternary code in the above Table 1 may be adjusted according to the arrangement and combination, and the table may be established.
  • a similar table of 24 by 12 correspondences for example, binary codes 00, 10, 01, and 11 correspond to quaternary codes 1, 0, 2, and 3, respectively, and then correspond to base pair codes A, T, and C, respectively. G; It is also possible that the binary codes 11, 10, 01, and 00 correspond to the quaternary codes 0, 1, 2, and 3, respectively, and further correspond to the base pair codes A, T, C, and G, respectively.
  • the binary code can also be directly converted into a base pair code, such as binary codes 00, 01, 10, and 11 corresponding to base pair codes A, T, C, and G, respectively, or binary code 00.
  • 10, 01, and 11 respectively correspond to the base pair codes A, T, C, and G, and may also be binary codes 00, 01, 11, and 10 corresponding to the base pair codes A, T, C, and G, respectively, or may be binary.
  • Codes 00, 10, 11, and 01 correspond to base pair codes A, T, C, and G, respectively, and may also be binary code 00, 11, 01, and 10 corresponding to base pair codes A, T, C, and G, respectively.
  • the binary codes 00, 11, 10, and 01 correspond to the base pair codes A, T, C, and G, respectively, and the binary codes 01, 00, 10, and 11 correspond to the base pair codes A, T, C, and G, respectively. It is also possible that the binary codes 01, 00, 11 and 10 correspond to the base pair codes A, T, C and G, respectively, and the binary codes 01, 10, 00 and 11 respectively correspond to the base pair codes A, T, C and G, it is also possible that binary codes 01, 10, 11 and 00 correspond to base pair codes A, T, C and G, respectively, and binary codes 01, 11, 00 and 10 respectively.
  • the base pair codes A, T, C, and G may also be binary codes 01, 11, 10, and 00 corresponding to base pair codes A, T, C, and G, respectively, and may be binary codes 10, 00, 01, and 11 Corresponding to the base pair codes A, T, C and G, respectively, the binary codes 10, 00, 11 and 01 respectively correspond to the base pair codes A, T, C and G,
  • And 11 correspond to the base pair codes A, T, C, and G, respectively, and may also be binary code 10, 01, 11 and 00 corresponding to the base pair codes A, T, C, and G, respectively, or may be binary codes 10, 11 00 and 01 respectively correspond to base pair codes A, T, C, and G, and may also be binary code 10, 11, 01, and 00 corresponding to base pair codes A, T, C, and G, respectively, or may be binary code 11 00, 01, and 10 correspond to base pair codes A, T, C, and G, respectively, and may also be binary code 11, 00, 10, and 01 corresponding to base pair codes A, T, C, and G, respectively, or may be binary.
  • Codes 11, 01, 00, and 10 correspond to base pair codes A, T, C, and G, respectively, and may also be binary code 11, 01, 10, and 00 corresponding to base pair codes A, T, C, and G, respectively.
  • Binary code 11,10,00 and 01 base pairs respectively correspond to codes A, T, C and G may also be binary code 11,10,01 and 00 base pairs respectively corresponding to the code A, T, C and G.
  • the entire sequence of DNA can be interrupted without overlap according to the same sequence length, such as 44 nt, wherein no overlapping interrupt means that the remaining DNA may be lower than 44 nt except for the last remaining DNA fragment.
  • the fragments are all 44 nt in length, i.e., contain 44 nucleotides; each DNA fragment is organized into a coding structure of 90-200 nt in length, i.e., an export DNA sequence, which may also be a coding structure of 90-110 nt (e.g., 100 nt).
  • the encoding structure ie, the output DNA sequence
  • the output DNA sequence consists of DNA fragments, as shown in Figure 2, for storing and extracting the original file information mixed in Chinese and English - "Hello, World! Hello, World!
  • amplification eg, PCR
  • a paired index code sequence is assigned inside each of the flanking index sequences to indicate the location of the data block during the information recovery process.
  • the oligonucleotide is prepared by an oligonucleotide synthesizer (primer synthesizer) to complete the writing of the digitized information, and finally the ssDNA or dsDNA with digitized information is obtained and stored in the gene chip or the plasmid. , or in living cells or other containers, to save the original file information.
  • the encoded (converted) DNA sequence when the digitized information is small, the encoded (converted) DNA sequence is short, and the artificial DNA sequence can be synthesized by using a common primer synthesizer, and the oligonucleotide sequence with the digitized information is stored in the gene chip or the plasmid. Or the oligonucleotide is amplified to obtain the DNA sequence and stored in a living cell or other container; when the digitized information is large, the encoded (converted) DNA sequence is long, and the oligonucleotide can be synthesized by a high-throughput synthesizer.
  • the high-throughput synthesizer can include an oligonucleotide chip synthesizer, an oligonucleotide microarray synthesizer, an oligonucleotide microfluidic synthesizer, etc., to obtain a gene chip containing an oligonucleotide, which will complete the digital information. Save.
  • the plurality of output DNA sequences are synthesized and saved by the corresponding artificial DNA sequence, and may include at least one of the following : preparing a gene chip by using a high-throughput oligonucleotide synthesizer, and storing the original file information by the gene chip;
  • the gene chip is prepared by a high-throughput oligonucleotide synthesizer, and the oligonucleotide eluted by the gene chip is amplified to obtain a DNA sequence and stored in a plasmid, a living cell or another container.
  • the oligonucleotide when the amount of original file information to be stored is small, can be prepared by using an ordinary oligonucleotide synthesizer or a high-throughput oligonucleotide synthesizer, and the obtained oligonucleoside The acid is amplified to obtain a DNA sequence that is stored in a plasmid, living cell or other container.
  • each of the flanking index sequences may be 16-50 nt in length, and may also be 18-22 nt; each of the index encoding sequences may be 6-20 nt in length, and may also be 6-10 nt.
  • the end of the inserted nucleotide coding sequence of the one of the output DNA sequences is complemented by a random sequence.
  • the breaking the whole sequence of the DNA into a plurality of DNA fragments may include: breaking the entire DNA sequence according to the same set sequence length to obtain the plurality of DNA fragments, wherein The set sequence length may be 36-156 nt, and may also be 36-52 nt.
  • each of the DNA fragments can be 156 nt in length.
  • each of the DNA fragments can be within 52 nt in length.
  • each of said DNA fragments is within 44 nt in length.
  • the length of each of the DNA fragments can be determined based on the ability of the DNA synthesis device to synthesize, such as each DNA fragment can also be greater than 156 nt in length.
  • quaternary BitDNA encoding is more highly compressed, increasing the storage density of digitized information on DNA.
  • the data recovery error caused by the sequence error can be significantly reduced without the need of a quadruple overlapping step structure.
  • the full text of the 21505 words of the text is saved as a 4017 nucleotide output sequence, and 21504 numbers are finally recovered by amplification (such as PCR) and sequencing (such as NGS). , only 1 word has an error;
  • the output sequence used is shorter, which effectively reduces the cost of synthesis and sequencing and improves storage efficiency.
  • This embodiment further provides a method for reading information. As shown in FIG. 3, the method may include the following steps:
  • step 310 a plurality of output DNA sequences are obtained.
  • step 320 the plurality of output DNA sequences are sequenced, and a DNA fragment corresponding to each output DNA sequence is determined according to the sequencing result; wherein the DNA fragment is composed of adenine A, thymine T, cytosine C and birds ⁇ G is represented by four deoxyribonucleotides.
  • step 330 the DNA fragments of the plurality of export DNA sequences are integrated to obtain the entire DNA sequence.
  • step 340 the full sequence of DNA is converted to quaternary information, the quaternary information is converted to binary information, and the binary information is read.
  • the method for reading information provided in this embodiment can improve the reading efficiency and ensure the continuity of the read information.
  • the obtaining the plurality of output DNA sequences may include: eluting the DNA stored on the gene chip, and amplifying the DNA sequence after elution to obtain the plurality of output DNA sequences.
  • each of the export DNA sequences can include: an inserted nucleotide coding sequence consisting of each of the DNA fragments, located at both ends of the inserted nucleotide coding sequence for amplification and sequencing, respectively A sequence of flanking indices and an index encoding sequence located inside each of the flanking index sequences for indicating the location of the data blocks during the information recovery process.
  • the plurality of output DNA sequences are sequenced, and the DNA fragments corresponding to each of the output DNA sequences are determined according to the sequencing results, including:
  • the index sequence and the index coding sequence recognize and extract a DNA fragment of each of the output DNA sequences.
  • the integrating the DNA fragments of the plurality of output DNA sequences to obtain a DNA full sequence comprises: recovering a position of a data block during information recovery according to an index coding sequence corresponding to each DNA fragment And integrating the DNA fragments of the plurality of export DNA sequences to obtain the entire DNA sequence.
  • an oligonucleotide or oligonucleotide library ie, a gene chip
  • an oligonucleotide synthesizer to realize the storage of digitized information.
  • the reading can be performed as follows.
  • the reading step is: eluting the DNA on the gene chip, and amplifying the entire DNA library to obtain the target sequence; at high throughput Sequencing platform Illumina HiSeq uses NGS technology to sequence, read the expected DNA sequence length of the output DNA sequence; extract the obtained sequencing results, remove the first and last primer sequences and index sequences, according to the index sequence, restore the data block position; according to BitDNA coding Converting the base sequence into quaternary information and then converting it into binary computer information, that is, completing the recovery and reading of the DNA base sequence on the computer.
  • the reading step is: obtaining the target sequence by amplifying the oligonucleotide sequence; reading the expected output by gene sequencing The DNA sequence of the DNA sequence length; the obtained sequencing result is removed, the head-to-tail primer sequence and the index sequence are removed, and the position of the data block is recovered according to the index sequence; according to the BitDNA code, the base sequence is converted into the quaternary information, and then converted into Binary computer information, that is, the recovery and reading of DNA base sequences on a computer.
  • the quaternary-based BitDNA coding method provided by the embodiment When storing the digitized information, the quaternary-based BitDNA coding method provided by the embodiment is adopted, which avoids Huffman ternary coding and rotational coding, which reduces the complexity of the operation and improves the rotation coding. After the information is rewritten, the problem of discontinuity is stored, thereby improving storage and reading efficiency.
  • quaternary BitDNA encoding information is highly compressed, increasing the storage density of digitized information on DNA media.
  • the coding sequence of the related storage method is generally long, the synthesis and sequencing are costly and the reliability is poor, and the inserted nucleotide coding sequence provided in this embodiment may be a nucleotide of 44 nucleotides or longer.
  • Length in addition to further reducing the complexity of the operation and reducing the time and expense of synthesis, detection and reading, can also make the coding DNA pool preparation and information recovery more accurate.
  • the full text of the 21505 words of the analects is saved as 4017 nucleotide sequences for output, and can be recovered by PCR amplification and NGS sequencing, and finally recovers 21504 Chinese characters, only 1 character. error.
  • This embodiment also uses a paired index sequence, which can reduce the index information extraction error caused by gene synthesis or sequencing when the single index is recovered.
  • the text file (26B) of "Hello, World! Hello, World!” which is composed of Chinese and English characters and punctuation is converted into a quaternary BitDNA according to the method provided in the above embodiment.
  • the coding sequence data (ie the full sequence of DNA) is as follows:
  • DNA fragment 3 GCTACGCACTTCGCTGCTTTCAGAGCGGCGGACAAT.
  • the above three DNA fragments were constructed into three DNA sequences of 100 nt in length according to the format of the exported DNA sequence, and three output DNA sequences were obtained, wherein the DNA fragment 3 was less than 44 nt, plus the flanking index sequence (length 20 nt) and The index coding sequence (length 8 nt) has a total length of less than 100 nt and is complemented by a random sequence at the end of DNA fragment 3.
  • the DNA sequence is short, and the oligonucleotide is prepared by an oligonucleotide synthesizer, stored in an E. coli plasmid, and the digital information is written, and the result is "Hello,” World! Hello! This digital information DNA storage medium.
  • the plasmid When extracting information, the plasmid can be amplified by PCR to obtain a DNA sequence; the DNA sequence carrying the encoded information is sequenced by gene, and the index coding sequence of the expected length of the output DNA sequence is read; the sequence required for decoding is extracted: the head and tail are removed The flanking index sequence and the index coding sequence recover the sequence position according to the position indicated by the index coding sequence; according to the BitDNA coding, the base sequence is converted into the quaternary information on the computer, and further converted into binary computer information, that is, the completion "Hello, World! Hello! Recovery and reading on the computer.
  • the obtained DNA sequence was divided into 357 DNA fragments of 44 nt in length without overlapping, and the DNA fragments were constructed into 357 export DNA sequences of 100 nt in length according to the output DNA format (eg, in each DNA fragment). Adding a 20 nt flanking index sequence and an 8 nt index encoding sequence to each end, that is, completing the digitized information of the image to convert the DNA sequence; using the 357 output DNA sequences obtained above, using the oligonucleotide The synthesizer prepares a DNA library and stores it on the gene chip, thereby completing the writing of the digitized information, and obtaining a DNA storage medium having digitized information of the image "emoji.jpg".
  • the DNA on the gene chip is eluted, and the DNA sequence we need can be obtained by PCR amplification; after that, the DNA sequence with the coding information is detected by an Illumina sequencer.
  • the Illumina HiSeq is sequenced using the NGS technique, and only the expected 100 nt length DNA sequence is read; the sequence required for decoding is extracted from the DNA sequence, the head-to-tail primer sequence and the index sequence are removed, and the sequence position is restored according to the position indicated by the index sequence;
  • BitDNA coding the base sequence is converted into quaternary information on a computer, and then converted into binary computer information, that is, the computer reading of the digitized information of the "emoji.jpg" picture is completed.
  • sample audio file "Example Audio-Laughter.mp3" (4.18 KB) in the MP3 format is converted into hexadecimal BitDNA encoded data according to the encoding method provided in the above embodiment, and a full sequence of 17148 bases is obtained, as in the appendix. Sequence 1;
  • the DNA sequence was divided into 389 DNA fragments of 44 nt length and a DNA fragment of 32 nt length according to the non-overlapping method.
  • 390 DNA fragments were constructed into 390 export DNAs of 100 nt length according to the exported DNA format. a sequence, such as a flanking index sequence of 20 nt in length and an index coding sequence of 8 nt in length at each end of each DNA fragment, wherein a DNA sequence of a length of 32 nt is complemented with a random sequence by 100 nt, ie Complete the conversion of the digitized information in Chinese and English to the DNA sequence; according to the 390 output DNA sequences obtained above, the DNA library is prepared by the oligonucleotide synthesizer and stored on the gene chip, thereby completing the writing of the digitized information of the audio. Into, get the DNA sequence with the digitized information of the audio of "example audio-laughter.mp3".
  • step 410 the binary information of the original file information is read, the binary information is converted into hexadecimal Unicode information, and the hexadecimal Unicode information is converted into adenine A, thymus.
  • step 420 the entire DNA sequence is broken into a plurality of DNA fragments, and the plurality of DNA fragments are separately constructed to obtain a plurality of output DNA sequences.
  • step 430 the plurality of output DNA sequences are synthesized into corresponding artificial DNA sequences and stored.
  • the information storage method provided in this embodiment can be applied to multi-language mixed file information, and all digitized information can be transcoded into Unicode code, and then encoded into a corresponding nucleotide sequence by Unicode code to ensure coding efficiency and provide The reliability of the encoding improves the accuracy of subsequent decoding.
  • a correspondence between a character specified by an American Standard Code for Information Interchange (ASCII) and a base code is provided.
  • ASCII American Standard Code for Information Interchange
  • Table 2 binary information of a character corresponding to ASCII is converted into The hexadecimal Unicode code is converted into a base code by the hexadecimal Unicode code to obtain the correspondence between the ASCII-corresponding character and the base code shown in Table 2.
  • the original file information includes at least two kinds of text information
  • the text information may be Chinese, Traditional Chinese, English, Arabic, Amharic, Azeri, Irish, Estonian, Basque, Belarus. , Bulgarian, Icelandic, Polish, Laun, Persian, Boolean, Danish, German, Russian, French, Filipino, Finnish, Khmer, Georgian, Tamili, Kazakh , Haitian Creole, Korean, Hausa, Dutch, Kyrgyz, Galician, Catalan, Czech, Kannada, Corsican, Wegn, Kurdish , Latin, Lithuanian, Lao, Lithuanian, Luxembourgish, Romanian, Malagasy, Maltese, Marathi, Malayalam, Malay, Ardian, Maori, Mongolian, Bengali, Miao, South African Kossa, South African Zulu, Nepali, Norwegian, Punjabi, Portuguese, Pashto, Chichewa Japanese, Swedish, Samoan, Serbian, Sesotho, Sinhalese, Esperanto, S
  • the “People’s Republic of China” is expressed in 16 languages, and the original document information is obtained by mixing with symbols (such as brackets and commas) and numbers as follows:
  • the 16 languages are Chinese, English, Arabic, Irish, Persian, German, Russian, French, Khmer, Korean, Kannada, Japanese, Esperanto, Hebrew. Language, Greek and Hungarian.
  • the original document information composed of the "People's Republic of China” and the symbols and numbers in these 16 languages is transcoded to obtain the corresponding output DNA sequence.
  • the output DNA sequence is stored, and then The stored DNA sequence is decoded and read, and the experiment proves that the decoding accuracy is 100%.
  • the above-mentioned 102 languages are used to represent "People's Republic of China", and the expressions of the 102 languages are transcoded to obtain corresponding DNA sequences and stored, and then the stored DNA sequences are decoded. And the reading operation, the experiment proves that the decoding correct rate is 100%.
  • the correctly encoded text information is artificially simulated and mutated (that is, the correct base sequence is manually rewritten and mutated), and the mutated coding sequence is decoded into text information.
  • the text information content is a brief introduction of the poet Li Bai of the Tang Dynasty: Li Bai (701-762), the word is too white, No. Qinglian lay. The great romantic poet of the Tang Dynasty was hailed as a "poetry fairy" by later generations.
  • the coding sequence after artificial mutation is as follows, in which 5 bases are mutated (rewritten), and the mutated bases are marked with bold and underline:
  • the text information decoded by the mutated coding sequence is as follows.
  • the text information corresponding to the mutation part of the sequence is bolded and underlined: Li Bai (701 ⁇ -762 ), the word is too white, and the number is Qinglian. Jin Wei Tang Dynasty Romantic poet,? Later generations are known as "Poetry.”
  • the information decoded after the mutation After comparing with the original text, the information decoded after the mutation has 3 characters compared with the original text information, and the text information corresponding to the unmutated region is not affected.
  • the encoding method provided in this embodiment adds text. The stability and flexibility of the coding of information.
  • each of the export DNA sequences can include: an inserted nucleotide coding sequence consisting of each of the DNA fragments, located at both ends of the inserted nucleotide coding sequence for amplification and sequencing, respectively A sequence of flanking indices and an index encoding sequence located inside each of the flanking index sequences for indicating the location of the data blocks during the information recovery process.
  • each of the output DNA sequences can be from 90 to 200 nt in length.
  • each of the output DNA sequences can be from 90 to 110 nt in length.
  • each of the output DNA sequences can be 100 nt in length.
  • each of the flanking index sequences may be 16-50 nt in length, and each of the index encoding sequences may be 6-20 nt in length.
  • each of the flanking index sequences may be 18-22 nt in length, and each of the index encoding sequences may be 6-10 nt in length.
  • a random sequence can be used to complement the end of the inserted nucleotide coding sequence of the one of the export DNA sequences.
  • the breaking the whole sequence of the DNA into a plurality of DNA fragments may include: breaking the entire DNA sequence according to the same set sequence length to obtain the plurality of DNA fragments, wherein The set sequence length can be 36-156 nt.
  • the breaking the whole sequence of the DNA into a plurality of DNA fragments may include: breaking the entire DNA sequence according to the same set sequence length to obtain the plurality of DNA fragments, wherein The set sequence length can be 36-52 nt.
  • the interruption may be a non-overlapping interruption.
  • each of the DNA fragments can be 156 nt in length.
  • each of the DNA fragments can be within 52 nt in length.
  • each of said DNA fragments is within 44 nt in length.
  • the breaking the entire sequence of the DNA into a plurality of DNA fragments, and respectively constructing the plurality of DNA fragments to obtain a plurality of output DNA sequences may include: setting a sequence length of 44 nt, according to The set sequence length is subjected to no overlapping interruption of the entire DNA sequence, and the inserted nucleotide coding sequence length of each of the obtained DNA fragments is within 44 nt; the length of each of the output DNA sequences is set to 100 nt.
  • Each of said flanking index sequences is 20 nt in length, each of said index coding sequences being 8 nt in length; when said intervening nucleotide coding sequence of an output DNA sequence, two of said flanking index sequences, and two When the sum of the lengths of the index coding sequences is less than 100 nt, a random sequence may be complemented at the end of the inserted nucleotide coding sequence of the DNA fragment.
  • the synthesizing the plurality of output DNA sequences into the corresponding artificial DNA sequence and storing the same may include: preparing a gene chip by using a high-throughput oligonucleotide synthesizer, and storing the gene chip by the gene chip The original file information.
  • the plurality of output DNA sequences are synthesized and saved by the corresponding artificial DNA sequence, and may include at least one of the following : preparing a gene chip by using a high-throughput oligonucleotide synthesizer, and storing the original file information by the gene chip;
  • the gene chip is prepared by using a high-throughput oligonucleotide synthesizer, and the DNA sequence obtained by amplifying the eluted oligonucleotide of the gene chip is stored in a plasmid, a living cell or a container thereof.
  • the oligonucleotide when the amount of original file information to be stored is small, can be prepared by using an ordinary oligonucleotide synthesizer or a high-throughput oligonucleotide synthesizer, and the obtained oligonucleoside
  • the DNA sequence obtained by amplification of the acid is stored in a plasmid, a living cell or other container.
  • the embodiment can also provide an information reading method for a DNA sequence saved based on binary information to hexadecimal information, the information reading method is an inverse process of the above information storage method, and the above-mentioned binary information is converted into a quaternary The process of reading the information of the DNA sequence of the information is similar.
  • the embodiment further provides a computer readable storage medium storing computer executable instructions for performing any of the above methods.
  • FIG. 5 it is a hardware structure diagram of an information storage device provided by this embodiment. As shown in FIG. 5, the device includes: a processor 510 and a memory 520. (Communications Interface) 530 and bus 540.
  • the processor 510, the memory 520, and the communication interface 530 can complete communication with each other through the bus 540.
  • Communication interface 530 can be used for information transfer.
  • Processor 510 can invoke logic instructions in memory 520 to perform any of the methods of the above-described embodiments.
  • the memory 520 may include a storage program area and a storage data area, and the storage program area may store an operating system and an application required for at least one function.
  • the storage data area can store data and the like created according to the use of the device.
  • the memory may include, for example, a volatile memory of a random access memory, and may also include a non-volatile memory. For example, at least one disk storage device, flash memory device, or other non-transitory solid state storage device.
  • the logic instructions in the memory 520 described above can be implemented in the form of software functional units and sold or used as separate products, the logic instructions can be stored in a computer readable storage medium.
  • the technical solution of the present disclosure may be embodied in the form of a computer software product, which may be stored in a storage medium, and includes a plurality of instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) All or part of the steps of the method described in this embodiment are performed.
  • the storage medium may be a non-transitory storage medium or a transitory storage medium.
  • the non-transitory storage medium may include: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. medium.
  • All or part of the corresponding methods provided by the foregoing embodiments may be completed by a computer program to indicate related hardware, and the program may be stored in a non-transitory computer readable storage medium, and when the program is executed, The flow of an embodiment as described above is included.
  • the present application provides a method for information storage and reading, which has good versatility, can simplify calculation, improve the continuity of storage of DNA information, storage efficiency and density, and can reduce error rate and reduce sequence synthesis and detection cost. advantage.

Landscapes

  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioethics (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé de stockage et de lecture d'informations comprenant : la lecture d'informations binaires d'informations de fichiers bruts, la conversion des informations binaires en informations quaternaires et la conversion des informations quaternaires en une séquence d'ADN entière représentée par quatre désoxyribonucléotides comprenant l'adénine (A), la thymine (T), la cytosine (C) et la guanine (G) ; la segmentation de la séquence d'ADN entière en une pluralité de segments d'ADN et la construction, à l'aide de la pluralité de segments d'ADN respectifs, d'une pluralité de séquences d'ADN de sortie ; et la synthèse de la pluralité de séquences d'ADN de sortie en une séquence d'ADN artificielle correspondante et le stockage de ladite séquence d'ADN artificielle.
PCT/CN2018/076721 2017-02-17 2018-02-13 Procédé de stockage et de lecture d'informations WO2018149405A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710086096.1 2017-02-17
CN201710086096.1A CN106845158A (zh) 2017-02-17 2017-02-17 一种利用dna进行信息存储的方法

Publications (1)

Publication Number Publication Date
WO2018149405A1 true WO2018149405A1 (fr) 2018-08-23

Family

ID=59128444

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/076721 WO2018149405A1 (fr) 2017-02-17 2018-02-13 Procédé de stockage et de lecture d'informations

Country Status (2)

Country Link
CN (1) CN106845158A (fr)
WO (1) WO2018149405A1 (fr)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845158A (zh) * 2017-02-17 2017-06-13 苏州泓迅生物科技股份有限公司 一种利用dna进行信息存储的方法
CN109300508B (zh) * 2017-07-25 2020-08-11 南京金斯瑞生物科技有限公司 一种dna数据存储编码解码方法
WO2019037117A1 (fr) * 2017-08-25 2019-02-28 深圳华大基因研究院 Procédé de codage et de décodage, dispositif et dispositif de traitement de données
CN111279422B (zh) * 2017-10-25 2023-12-22 深圳华大生命科学研究院 编码/解码方法、编码/解码器和存储方法、装置
WO2019196439A1 (fr) * 2018-04-13 2019-10-17 The Hong Kong Polytechnic University Stockage de données utilisant des peptides
GB2576304B (en) 2018-07-26 2020-09-09 Evonetix Ltd Accessing data storage provided using double-stranded nucleic acid molecules
TWI770247B (zh) * 2018-08-03 2022-07-11 大陸商南京金斯瑞生物科技有限公司 核酸用於資料儲存之方法、及其非暫時性電腦可讀儲存介質、系統及電子裝置
CN109460822B (zh) * 2018-11-19 2021-11-12 天津大学 基于dna的信息存储方法
CN113228193B (zh) * 2018-12-26 2023-06-09 深圳华大生命科学研究院 一种定点编辑存储有数据的核酸序列的方法及装置
CN109830263B (zh) * 2019-01-30 2023-04-07 东南大学 一种基于寡核苷酸序列编码存储的dna存储方法
US10956806B2 (en) 2019-06-10 2021-03-23 International Business Machines Corporation Efficient assembly of oligonucleotides for nucleic acid based data storage
CN110289083A (zh) * 2019-06-26 2019-09-27 苏州泓迅生物科技股份有限公司 一种图像重构方法及装置
WO2021056167A1 (fr) * 2019-09-24 2021-04-01 深圳华大生命科学研究院 Procédé et appareil de codage d'informations, procédé et appareil de décodage d'informations, support de stockage, et procédé de stockage et d'interprétation d'informations
CN112749247B (zh) * 2019-10-31 2023-08-18 中国科学院深圳先进技术研究院 文本信息存储和读取方法及其装置
CN110684791A (zh) * 2019-11-15 2020-01-14 天津大学 一种利用dna在体内存储信息的方法
CN111091876B (zh) * 2019-12-16 2024-05-17 中国科学院深圳先进技术研究院 一种dna存储方法、系统及电子设备
CN111243670A (zh) * 2020-01-23 2020-06-05 天津大学 一种满足生物约束的dna信息存储编码方法
CN111680797B (zh) * 2020-05-08 2023-06-06 中国科学院计算技术研究所 一种dna活字印刷机、基于dna的数据存储设备和方法
CN114058471A (zh) * 2020-07-29 2022-02-18 东南大学 负载了dna存储数据的数据存储装置、制备方法和读数方法
CN112079893B (zh) * 2020-09-23 2022-05-03 南京原码科技合伙企业(有限合伙) 一种基于固相化学合成法合成dna存储所需文本的方法
CN112527736B (zh) * 2020-12-09 2024-03-29 中国科学院深圳先进技术研究院 基于dna的数据存储方法、数据恢复方法及终端设备
CN112711935B (zh) * 2020-12-11 2023-04-18 中国科学院深圳先进技术研究院 编码方法、解码方法、装置及计算机可读存储介质
CN112582030B (zh) * 2020-12-18 2023-08-15 广州大学 一种基于dna存储介质的文本存储方法
WO2023272499A1 (fr) * 2021-06-29 2023-01-05 中国科学院深圳先进技术研究院 Procédé de codage, procédé de décodage, appareil, dispositif terminal et support de stockage lisible
WO2023015550A1 (fr) * 2021-08-13 2023-02-16 深圳先进技术研究院 Procédé et appareil de stockage de données d'adn, dispositif et support de stockage lisible
CN113782102B (zh) * 2021-08-13 2022-12-13 中科碳元(深圳)生物科技有限公司 Dna数据的存储方法、装置、设备及可读存储介质
CN115312128A (zh) * 2022-03-14 2022-11-08 深圳先进技术研究院 Dna编码方法、解码方法、装置、终端设备及介质
CN117542391A (zh) * 2022-08-01 2024-02-09 上海交通大学 一种数据存储介质及其应用

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104520864A (zh) * 2012-06-01 2015-04-15 欧洲分子生物学实验室 Dna中数字信息的高容量存储
CN104662544A (zh) * 2012-07-19 2015-05-27 哈佛大学校长及研究员协会 利用核酸存储信息的方法
CN105022935A (zh) * 2014-04-22 2015-11-04 中国科学院青岛生物能源与过程研究所 一种利用dna进行信息存储的编码方法和解码方法
CN106845158A (zh) * 2017-02-17 2017-06-13 苏州泓迅生物科技股份有限公司 一种利用dna进行信息存储的方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014197377A2 (fr) * 2013-06-03 2014-12-11 Good Start Genetics, Inc. Procédés et systèmes pour stocker des données de lecture de séquence
CN104850760B (zh) * 2015-03-27 2016-12-21 苏州泓迅生物科技有限公司 人工合成dna存储介质的信息存储读取方法
CN106055927B (zh) * 2016-05-31 2018-08-17 广州麦仑信息科技有限公司 mRNA信息的二进制存储方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104520864A (zh) * 2012-06-01 2015-04-15 欧洲分子生物学实验室 Dna中数字信息的高容量存储
CN104662544A (zh) * 2012-07-19 2015-05-27 哈佛大学校长及研究员协会 利用核酸存储信息的方法
CN105022935A (zh) * 2014-04-22 2015-11-04 中国科学院青岛生物能源与过程研究所 一种利用dna进行信息存储的编码方法和解码方法
CN106845158A (zh) * 2017-02-17 2017-06-13 苏州泓迅生物科技股份有限公司 一种利用dna进行信息存储的方法

Also Published As

Publication number Publication date
CN106845158A (zh) 2017-06-13

Similar Documents

Publication Publication Date Title
WO2018149405A1 (fr) Procédé de stockage et de lecture d'informations
Anavy et al. Data storage in DNA with fewer synthesis cycles using composite DNA letters
Rautiainen et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko
Lopez et al. DNA assembly for nanopore data storage readout
Choi et al. High information capacity DNA-based data storage with augmented encoding characters using degenerate bases
Ping et al. Towards practical and robust DNA-based data archiving using the yin–yang codec system
JP6141335B2 (ja) コンパクトな次世代シーケンシングデータセット及び該データセットを使用した効率的な配列の処理
EP2724278B1 (fr) Procédés et systèmes pour analyse de données
US20170199962A1 (en) Method and systems for processing polymeric sequence data and related information
US10566077B1 (en) Re-writable DNA-based digital storage with random access
US20180373839A1 (en) Systems and methods for encoding genomic graph information
US20180052953A1 (en) Methods and systems for processing genomic data
US8762073B2 (en) Transcript mapping method
Cevallos et al. A brief review on DNA storage, compression, and digitalization
WO2020042582A1 (fr) Procédé et dispositif de stockage de données d'adn
WO2024125260A1 (fr) Procédé de stockage d'informations d'adn, fondé sur des bases naturelles et non naturelles
Wang et al. Oligo design with single primer binding site for high capacity DNA-based data storage
CN105940611B (zh) 数据缩合器、设备、系统及方法、数据恢复设备及方法
Shang et al. Characterization and comparative analysis of mitochondrial genomes among the Calliphoridae (Insecta: Diptera: Oestroidea) and phylogenetic implications
EP2921979B1 (fr) Codage et décodage de données d'ARN
TW202008302A (zh) 以dna為基礎之資料存取
Yang et al. Mitogenome of Alaudala emcheleensis (Passeriformes: Alaudidae) and comparative analyses of Sylvioidea mitogenomes
Wang et al. DNA Digital Data Storage based on Distributed Method
최영재 High Information Capacity and Low Cost DNA-based Data Storage through Additional Encoding Characters
CN115374937A (zh) 数据存储方法、解码方法、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18753930

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18753930

Country of ref document: EP

Kind code of ref document: A1