WO2023015550A1 - Procédé et appareil de stockage de données d'adn, dispositif et support de stockage lisible - Google Patents

Procédé et appareil de stockage de données d'adn, dispositif et support de stockage lisible Download PDF

Info

Publication number
WO2023015550A1
WO2023015550A1 PCT/CN2021/112465 CN2021112465W WO2023015550A1 WO 2023015550 A1 WO2023015550 A1 WO 2023015550A1 CN 2021112465 W CN2021112465 W CN 2021112465W WO 2023015550 A1 WO2023015550 A1 WO 2023015550A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
dna
fragments
base
search
Prior art date
Application number
PCT/CN2021/112465
Other languages
English (en)
Chinese (zh)
Inventor
戴俊彪
黄小罗
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Priority to PCT/CN2021/112465 priority Critical patent/WO2023015550A1/fr
Publication of WO2023015550A1 publication Critical patent/WO2023015550A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Definitions

  • the present application relates to the technical field of data storage, and in particular to a DNA data storage method, device, equipment and readable storage medium.
  • Deoxyribose nucleic acid (DeoxyriboNucleic Acid, DNA), as an information storage medium developed in recent years, is considered to be one of the most potential mediums for future information storage.
  • DNA molecules have four bases, which are: adenine (A), cytosine (Cytosine, C), guanine (Guanine, G) and thymine (Thymine, T).
  • A adenine
  • C cytosine
  • G guanine
  • Thymine Thymine
  • the DNA-based data storage technology uses the above four base sequences to represent the data series composed of binary "0" and "1". Compared with traditional storage media, DNA data storage has the characteristics of high storage density, long storage time, low maintenance cost, and good biocompatibility.
  • 1g of DNA can store more than one million high-definition movies, and its data storage density is more than 7 orders of magnitude higher than the current silicon-based storage media such as traditional hard disks; at the same time, DNA can store data stably for more than a thousand years, which is a hundred times the storage time of existing storage media above.
  • the maintenance cost of DNA is low, and the maintenance cost of storing for a hundred years is only one ten-thousandth of the current existing media.
  • the DNA data storage process usually includes the following steps: (1) Extract binary information from computer information such as pictures, videos, and texts; (2) According to the preset correspondence between binary and bases A, T, C, and G, the Binary sequence information is converted into A/T/C/G sequences (that is, DNA sequences) formed by base A, T, C, and G codes and stored with data information; (3) DNA synthesis technology or other technologies are used to convert the coded The A/T/C/G sequence is converted into a DNA chemical polymer molecule and stored in a suitable environment.
  • the following steps can be performed: (4) Utilize DNA sequencing technology to interpret the stored DNA chemical polymer molecules into A/T/C/G sequences; (5) Utilize appropriate The decoding method converts A/T/C/G sequences into binary information; (6) converts binary information into computer information such as pictures, videos, and texts.
  • the data encoding problem is the core problem in the current DNA data storage methods.
  • One of the purposes of the embodiments of the present application is to provide a DNA data storage method, device, device and readable storage medium, aiming at solving the data encoding problem in the DNA data storage technology.
  • a method for storing DNA data including:
  • each sequence unit includes a plurality of segmented sequence fragments, wherein, the S sequence units contain K sequence fragments in total, and the sequence fragments
  • the length of is n, and n, S and K are all integers greater than or equal to 2;
  • the K sequence fragments and the S sequence units are marked by using the preset index information to obtain K mark sequence fragments and S mark sequence units, wherein the index information includes information used to represent the S sequence units A first search sequence for the arrangement order of sequence units in the base sequence, and a second search sequence for indicating the arrangement order of multiple sequence fragments belonging to the same sequence unit in the sequence unit , K pieces of the marker sequence fragments are used to synthesize K first DNA molecules storing the target data.
  • using the second retrieval sequence to mark multiple sequence fragments belonging to the same sequence unit includes:
  • the search base groups are spliced simultaneously on both sides of the sequence fragment, and the search base groups on both sides form the second search sequence.
  • the first search sequence includes i DNA sequence fragments, i is an integer greater than or equal to 1, and each of the DNA sequence fragments includes the first base sequence used as an index mark and the Mark the second base sequence of the sequence unit number.
  • the DNA sequence fragments corresponding to the first search sequence and the second search sequence are obtained using DNA synthesis technology.
  • DNA synthesis techniques include, but are not limited to, enzymatic synthesis, phosphoramidite synthesis, and the like.
  • the DNA sequence fragments corresponding to the first search sequence and the second search sequence can be amplified from a pre-synthesized DNA universal molecular library, such as PCR technology.
  • the storage method also includes:
  • the first DNA molecules corresponding to the marker sequence fragments of the unit are stored in the same first physical space, and the first DNA molecules corresponding to the marker sequence fragments that do not belong to the same sequence unit are stored in Different from said first physical space.
  • the s first physical spaces are integrated into one DNA hard disk.
  • the storage method also includes:
  • the second DNA molecule storing the index information.
  • the decoding method for the K first DNA molecules includes:
  • the base sequence is converted into the target data.
  • a DNA data storage device including a data processing module,
  • the data processing module is used to obtain the base sequence corresponding to the binary sequence of the target data; segment the base sequence to obtain K sequence fragments with a length of n, and the K sequence fragments are divided into S Sequence units, S and K are both integers greater than or equal to 2; use preset index information to mark K said sequence fragments and S said sequence units to obtain K marked sequence fragments and S marked sequence units , wherein, the index information includes a first search sequence used to represent the arrangement order of the S sequence units in the base sequence, and a first search sequence used to represent multiple sequences belonging to the same sequence unit
  • the second retrieval sequence of the arrangement order of the fragments in the sequence unit, the K fragments of the marker sequence are used to synthesize the K first DNA molecules storing the target data.
  • the device further includes: a DNA synthesis module, configured to synthesize K first DNA molecules storing the target data from the K marker sequence fragments.
  • the device further includes: a DNA molecule storage module, configured to store K first DNA molecules in S first physical spaces, wherein the markers belonging to one sequence unit The first DNA molecules corresponding to the sequence fragments are stored in the same first physical space, and the first DNA molecules corresponding to the marker sequence fragments that do not belong to the same sequence unit are stored in different first physical spaces. a physical space.
  • the DNA molecule storage module is also used to store the second DNA molecule in the second physical space.
  • it also includes a DNA molecule sequencing module, configured to sequence a plurality of first DNA molecules stored in each of the first physical spaces to obtain a plurality of marker sequence fragments; the data The processing module is further configured to splice the sequence fragments corresponding to each of the tag sequence fragments belonging to the same tag sequence unit according to the second retrieval sequence, to obtain the sequence unit;
  • the base sequence is converted into the target data.
  • a DNA data storage device including a terminal device, the terminal device includes a memory, a processor, and a computer program stored in the memory and operable on the processor, characterized in That is, when the processor executes the computer program, the method for storing DNA data as described in the first aspect is realized.
  • a computer-readable storage medium stores a computer program.
  • the computer program is executed by a processor, the method for storing DNA data in the first aspect is implemented.
  • a DNA hard disk including a plurality of physical spaces, the physical spaces are made of physical materials, and each of the physical spaces is used to store DNA molecules.
  • the physical material is at least one of SiO 2 , metal oxides, and polymer materials. These physical materials form a physical space that wraps DNA molecules, and at the same time, isolates different DNA molecules.
  • the beneficial effect of the DNA data storage method, device, device and readable storage medium provided by the embodiment of the present application is that: the present application divides the base sequence corresponding to the binary sequence of the target data into S sequences, and each sequence unit includes A plurality of segmented sequence fragments, the S sequence units contain K sequence fragments in total, and the length of the sequence fragments is n, n, S and K are all integers greater than or equal to 2, using the preset
  • the index information marks the position information of sequence fragments and sequence units, and the marked sequence fragments are synthesized into DNA molecules and stored separately.
  • the length of the first retrieval sequence is different from that of the marker sequence fragment (the sequence fragment with the second retrieval sequence), and the first retrieval sequence and the marker sequence fragment are distinguished from the sequence units by length difference.
  • m represents the base number of the marker sequence fragment
  • q represents the base number of the second retrieval sequence
  • i represents the number of DNA sequence fragments in the first retrieval sequence
  • p represents the number of DNA sequence fragments in the first retrieval sequence
  • the length of the first search sequence is the same as the length of the marker sequence fragment (sequence fragment with the second search sequence);
  • the first base sequence used as an index mark in the first search sequence can be It is a part of the second search base sequence, and the base number of the first base sequence is the same as that of the second search base sequence.
  • m represent the base number of the marker sequence fragment
  • q represent the base number of the second search sequence
  • i represent the number of DNA sequence fragments in the first search sequence
  • p represent the first search sequence
  • the number of bases in the second base sequence in the sequence can be stored by the method provided by this application, and the DNA data containing D bases can be stored, wherein, the calculation formula of D is as follows:
  • Figure 1 is a schematic diagram of the composition of the DNA storage device provided by the embodiment of the present application.
  • Fig. 2 is the process flow diagram of the storage and writing of DNA data provided by the embodiment of the present application.
  • Fig. 3 is a schematic diagram of a sequence unit comprising multiple sequence fragments formed after the base sequence in S101 provided in the embodiment of the present application is segmented;
  • Fig. 4 is the sequence unit in S103 provided by the embodiment of the present application after the first search sequence mark, and each sequence fragment in the sequence unit has been marked by the second search sequence, to obtain information sequence units respectively containing multiple information sequence fragments schematic diagram;
  • Fig. 5 is a schematic diagram of DNA storage after K first DNA molecules are stored in S different first physical spaces provided by the embodiment of the present application;
  • Fig. 6 is the flow chart of the interpretation process of the DNA data provided by the embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • references to “one embodiment” or “some embodiments” or the like described in the specification of the present application mean that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the present application .
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically stated otherwise.
  • the terms “including”, “comprising”, “having” and variations thereof mean “including but not limited to”, unless specifically stated otherwise.
  • the present application provides a method for storing DNA data.
  • the method first converts the binary sequence of the data to be stored into the corresponding base sequence, and then divides the base sequence twice.
  • the base sequence is divided into S sequence units, and each sequence unit includes a plurality of segmented sequence fragments, wherein the S sequence units contain K total of the sequence fragments.
  • the preset index information to mark the K sequence fragments and the S sequence units.
  • the K marker sequence fragments are synthesized into DNA molecules and then stored. Based on the data encoding method provided by this application, DNA data storage can be effectively realized.
  • the data encoding method provided by the present application has obvious advantages in terms of data storage capacity.
  • the storage method is realized by the DNA data storage device shown in FIG. 1 .
  • the DNA data storage device includes a data processing module, a DNA molecule synthesis module, a DNA molecule storage module, and a DNA molecule sequencing module.
  • the data processing module is used to implement data encoding and decoding.
  • the data to be stored is converted into binary information, and the binary information is converted into a base sequence according to the preset correspondence between binary data and bases.
  • the base sequence is encoded according to the preset index information to obtain the base sequence that is finally used to generate the DNA molecule.
  • the DNA molecule synthesis module is used to synthesize DNA molecules according to the encoded base sequence.
  • the DNA molecule storage module can store DNA molecules.
  • the DNA molecular sequencing module is used to translate DNA molecules into base sequences.
  • the data processing module can also decode the base sequence obtained by sequencing in the DNA molecule according to the index information, and obtain the data stored in the DNA molecule through data conversion.
  • the DNA data storage device may be a complete DNA data storage device, that is, a device integrated with multiple functional modules.
  • the device can realize the complete process of data encoding, DNA molecule synthesis, DNA molecule storage, DNA molecule sequencing and data decoding.
  • the DNA data storage device may also be a system composed of individual devices.
  • the data processing module may be computer equipment such as computers, servers, and robots. Used to implement data encoding and decoding.
  • the DNA molecular synthesis module can be a DNA molecular synthesizer, which is used to synthesize DNA molecules according to the coded base sequence.
  • the DNA molecule storage module may be a DNA hard disk capable of storing DNA molecules.
  • the DNA molecular sequencing module may be a DNA molecular sequencer, capable of realizing the function of DNA molecular sequencing.
  • Fig. 2 is a schematic diagram of the implementation flow of a DNA data storage method provided by the embodiment of the present application, specifically including:
  • obtaining the base sequence corresponding to the binary sequence of the data to be stored refers to converting the binary sequence of the data to be stored into a base sequence formed by A, T, C, and G codes and storing data information .
  • obtaining the base sequence corresponding to the binary sequence of the data to be stored includes:
  • the data to be stored is any data information that may exist in the terminal device, and may include text, pictures, sound, video, software, programs and other information, but is not limited thereto.
  • the coding information corresponding to the data to be stored may be obtained, and the corresponding coding information is converted into binary coding information, thereby obtaining the corresponding binary sequence.
  • the text in the text information can be converted into the corresponding ASCII (English full name is American Standard Code for Information Interchange, Chinese full name is American Standard Code for Information Interchange) encoding, UNICODE (English full name is Universal Character Set, Chinese full name is Universal Character Set) set) encoding, and then convert the encoded information into a binary sequence.
  • the preset mapping rule refers to the preset mapping rule between binary and base.
  • the binary sequence consisting of 0/1 is converted into a code formed by the bases A, T, C, G, and stored with data information base sequence.
  • the preset correspondence between the binary sequence and bases A, T, C, and G is as follows: a base A represents a 00, a base T represents a 01, a base C represents a 10, and a base C represents a 10.
  • the base G represents an 11.
  • the binary data information is converted into a base sequence into a DNA sequence of AGTCCGACCGTCGAAGACT.
  • the preset correspondence between the binary code and the bases A, T, C, and G is not limited to the above example, for example, it can also be: a base T represents a 00, a base A represents a 01, and a base The base G represents a 10, and the base C represents an 11, but not limited thereto.
  • the preset mapping rules between binary sequences and bases A, T, C, and G only need to be able to convert binary sequences into base sequences according to the preset mapping rules, and are not limited to the above examples.
  • the mapping rule between binary and base is: a base A represents a 11, a base T represents a 10, a base C represents a 01, and a base G represents a 00. ⁇ , ⁇ “ ⁇ , ⁇ ” ⁇ “11100110 10011000 10100101 11101111 10111100 10001100 11100101 10110111 10110010 11100100 10111000 10001101 11100101 10000110 10001101 11100110 10011000 10101111 11100110 10000011 10110011 11101000 10110001 10100001 11100100 10111001 10001011 11100101 10100100 10010110 11100111 10011010 10000100 11101001 10000010 10100011 11100101 10100100 10010110 11100111 10011010 10000100 11101001 10000010 10100011 11100101 10001111 1010101010 11101000 10011101 10110100 11101000 10011101 10110100 11101000 10011101 10110110 111000 1110000000 10000010” ⁇ “ATCT TCTG TTCC ATAA TAAG
  • each sequence unit includes a plurality of segmented sequence fragments, wherein, the S sequence units contain K sequence fragments in total, and the length of the sequence fragments is n, n , S and K are all integers greater than or equal to 2.
  • the base sequence is segmented, and the segmented base sequence becomes s sequence units, and each sequence unit includes a plurality of segmented sequence fragments, thereby reducing the sequence length and facilitating subsequent steps to Sequence units are stored separately.
  • the length referred to in the embodiments of the present application refers to the base length, which can be understood as the number of bases.
  • each sequence unit obtained after base sequence segmentation includes multiple sequence fragments with a length of n, therefore, the length of the sequence unit is an integer multiple of the sequence fragments.
  • the sequence units formed after segmentation have the same length, that is, the base sequence is divided into s sequence units of the same length, and the s sequence units of the same length contain the same number of sequences of length n fragment.
  • the lengths of the sequence units formed after segmentation are different.
  • the base sequence is sequentially segmented according to the preset sequence unit length to obtain S-1 sequence units, and the remaining The length of the base sequence is less than the preset length. At this time, the remaining base sequence is regarded as a sequence unit, and its length is less than the length of other sequence units.
  • the base sequence can be segmented according to other preset segmentation rules to obtain s sequence units with different sequence unit lengths.
  • the base sequence is segmented to obtain s sequence units, and each sequence unit includes a plurality of segmented sequence fragments, including:
  • the base sequence when the base sequence is divided into multiple sequence units, starting from one end of the base sequence, the base sequence is divided every other preset sequence unit length to obtain s sequence units, wherein , the default length is an integer multiple of n.
  • the base sequence can also be segmented starting from other positions in the base sequence. For example, starting from the middle position of the base sequence, the base sequence is sequentially divided towards both ends at the same time, to obtain sequence units whose length is an integer multiple of n.
  • dividing each sequence unit into multiple sequence fragments with a length of n includes: starting from one end of the sequence unit, dividing the sequence unit every length n to obtain multiple sequence fragments.
  • the sequence unit can also be segmented starting from other positions of the sequence unit. For example, starting from the middle position of the sequence unit, the sequence unit is sequentially segmented towards both ends at the same time to obtain a sequence fragment with a length of 1.
  • the base sequence in S101 according to the standard that the length of the sequence unit is 60 bases, the base sequence is sequentially segmented from one end of the base sequence, and three groups of lengths are obtained.
  • the sequence unit of 60 bases was divided into 15 groups of sequence fragments, and the sequence unit of 12 bases was divided into 3 groups of sequence fragments.
  • the base sequence is segmented to obtain s sequence units, and each sequence unit includes multiple sequence fragments with a length of n, including:
  • sequence fragments are combined according to a preset combination rule to obtain s sequence units.
  • the base sequence is divided into multiple sequence fragments with a length of n, starting from one end of the base sequence, the base sequence is divided every length n to obtain multiple sequence fragments.
  • the preset combination rules refer to the rules for assigning multiple sequence fragments as a sequence unit, including the sequence fragments assigned to a sequence unit in the base The position in the base sequence, the number of sequence fragments and the arrangement order of the sequence fragments when the sequence fragments are combined into a sequence unit.
  • the lengths of the combined sequence units can be the same or different, but they are all integer multiples of 1. Exemplarily, according to the order of the sequence fragments in the base sequence, 20 sequence fragments with a number of 6 bases are sequentially combined to form a sequence unit.
  • the length of the sequence segment is 4, and the sequence unit includes 15 sequence segments.
  • the base sequence of step S101 "ATCT TCTG TTCC ATAA TAAG TGAG ATCC TACA TAGT ATCG TATG TGAC ATCC TGCT TGAC ATCT TCTG TTAA ATCT TGGA TAGA ATTG TAGC TTGC ATCG TATC TGTA ATCC TTCG TCCT T G ATCA TCTT GA TGCG ATTT TCAC TACG ATTG TCAC TACT ATGA TGGG TGGT” to obtain 48 sequence fragments, namely: ATCT, TCTG, TTCC, ATAA, TAAG, TGAG, ATCC, TACA, TAGT, ATCG, TATG, TGAC, ATCC, TGCT, TGAC, ATCT , TCTG, TTAA, ATCT, TGGA, TAGA, ATTG, TAGC, TTGC, ATCG, TATCTG TTCC
  • the first sequence unit includes the following 15 sequence fragments: ATCT, TCTG, TTCC, ATAA, TAAG, TGAG, ATCC, TACA, TAGT, ATCG, TATG, TGAC, ATCC, TGCT, TGAC;
  • the second sequence unit includes the following 15 sequence fragments: ATCT, TCTG, TTAA, ATCT, TGGA, TAGA, ATTG, TAGC, TTGC, ATCG, TATC, TGTA, ATCC, TTCG, TCCT;
  • the third sequence unit includes the following 15 sequence fragments: ATCA, TCTT, TGCG, ATTC, TGGT, TTGA, ATCC, TGAA, TTTT, ATTG, TCAC, TACG, ATTG, TCAC, TACT;
  • the fourth sequence unit includes the following three sequence fragments: ATGA, TGGG, TGGT.
  • S sequence units are obtained, and the S sequence units contain K sequence fragments with a length of n in total.
  • K marker sequence fragments are used to synthesize and store the target The K first DNA molecules of the data.
  • the preset index information is used to mark the sequence unit, so as to facilitate the decoding of subsequent DNA storage data.
  • Using the preset index information to mark the sequence units includes: using the first search sequence to mark the arrangement order of the S sequence units in the base sequence, and using the second search sequence to mark the sequence units belonging to the same sequence unit The order in which the multiple sequence fragments are arranged in the sequence unit is marked to obtain K marked sequence fragments and S marked sequence units. In this case, the order of the S sequence units in the base sequence is recorded, and at the same time, the order of a plurality of sequence fragments belonging to the same sequence unit is also recorded.
  • the second search sequence is a sequence formed of bases, and the base sequence of the second search sequence can be determined according to preset rules.
  • a single base can be used to represent the second search sequence.
  • the sequence number 1 corresponds to the double base A
  • the sequence number 2 corresponds to the double base C
  • the sequence number 3 corresponds to the double base G
  • the sequence number 4 corresponds to the double base T.
  • the correspondence between the sequence number and the base is not limited to this method.
  • two bases may be used to represent the second search sequence.
  • sequence number 1 corresponds to the two bases AA
  • sequence number 2 corresponds to the two bases AC
  • sequence number 3 corresponds to the two bases AG
  • sequence number 4 corresponds to the two bases AT
  • sequence number 5 corresponds to the two bases CA
  • sequence number 6 corresponds to the two bases CC.
  • the sequence number 7 corresponds to the two-base CG
  • the sequence number 8 corresponds to the two-base CT
  • the sequence number 9 corresponds to the two-base GA
  • the sequence number 10 corresponds to the two-base GC
  • the sequence number 11 corresponds to the two-base GG
  • the sequence number 12 corresponds to the two-base GT
  • the sequence number 13 Corresponding to the two-base TA
  • the sequence number 14 corresponds to the two-base TC
  • the sequence number 15 corresponds to the two-base TG
  • the sequence number 16 corresponds to the two-base TT....
  • the number of bases in the reference base group is not limited to 2.
  • the number of bases in the second search sequence used increases correspondingly, such as when the sequence fragments in the sequence unit
  • the number of is less than or equal to 64
  • three bases can be used to represent the second search sequence. According to the rule that the power of 4 bases in the second search sequence is greater than or equal to the number of sequence fragments in the sequence unit, and so on.
  • the second search sequence when used to mark the sequence fragment, can be spliced at a specific position of the sequence fragment.
  • using the second search sequence to mark multiple sequence fragments belonging to the same sequence unit includes: splicing the second search sequence on either side of the sequence fragment.
  • the second search sequence is spliced at the start end, that is, the left end of the sequence fragment; or, the second search sequence is spliced at the end end, that is, the right end of the sequence fragment.
  • the method of using the second search sequence to mark multiple sequence fragments belonging to the same sequence unit includes: simultaneously splicing base groups on both sides of the sequence fragment, and the base groups on both sides form the second search sequence .
  • K sequence fragments are marked by the second search sequence to form K marked sequence fragments, which are also called information sequence fragments.
  • K sequence fragments are marked by the second search sequence to form K marked sequence fragments, which are also called information sequence fragments.
  • AA is marked before the sequence fragment ATGC
  • a marked sequence fragment of AAATGC is formed.
  • the position of the S sequence units in the base sequence is marked by using the first search sequence.
  • the first search sequence includes i DNA sequence fragments, where i is an integer greater than or equal to 1, that is, the first search sequence may be one DNA sequence fragment or multiple DNA sequence fragments.
  • each DNA sequence fragment includes a first base sequence used as an index mark and a second base sequence used to indicate the sequence unit number.
  • the first base sequence can be placed in a specific position of the first search sequence according to preset settings.
  • the first base sequence is located at the initial segment (left end) of the first search sequence; exemplary, the first The base sequence is located at the termination segment (right end) of the first search sequence; exemplary, the first base sequence is located at a specific position in the first search sequence, such as the third and fourth bases of the first search sequence, Not limited to this.
  • the first base sequence can be set in advance.
  • TT is placed as the first base sequence at the initial segment of the first search sequence, and is used as an index mark, indicating the first search sequence of the DNA sequence fragment starting with TT in the sequence unit.
  • the first base sequence is different from the initial base sequence of the second search sequence, so as to avoid misidentifying sequence fragments during the identification process is the first search sequence.
  • the second base sequence can also be preset.
  • the sequence number 1 corresponds to the four-base AAAA
  • the sequence number 2 corresponds to the four-base AAAC
  • the sequence number 3 corresponds to the four-base AAAG
  • the sequence number 4 corresponds to the four-base AAAT.
  • the correspondence between the sequence number and the base is not limited to this
  • the number of bases in the reference base set is not limited to four.
  • the sequence unit is marked by the first search sequence
  • each sequence fragment in the sequence unit is marked by the second search sequence
  • the marked sequence unit is obtained, which is also called the formation information sequence unit. Referring to Fig. 4, after the sequence unit in step S102 (shown on the left side of the arrow in Fig. 4) is marked with the first search sequence, and each sequence segment in the sequence unit is marked with the second search sequence, a sequence containing multiple information is obtained respectively.
  • Four sets of marker sequence units for the fragment shown to the right of the arrow in Figure 4).
  • the DNA sequence fragments corresponding to the first search sequence and the second search sequence are obtained using DNA synthesis technology.
  • DNA synthesis techniques include, but are not limited to, enzymatic synthesis, phosphoramidite synthesis, and the like.
  • the DNA sequence fragments corresponding to the first search sequence and the second search sequence can be amplified from a pre-synthesized DNA universal molecular library, such as PCR technology.
  • the storage method also includes:
  • the K marker sequence fragments are respectively synthesized into K first DNA molecules storing the target data, which is realized by the DNA synthesis module in the device shown in FIG. 1 .
  • the K marker sequence fragments can be synthesized respectively through the existing synthesis technology to obtain K first DNA molecules.
  • FIG. 1 A schematic diagram of DNA storage after storing K first DNA molecules in S different first physical spaces is shown in FIG. 5 , wherein each small square represents a first physical space.
  • the synthesized K first DNA molecules are respectively stored in different first physical spaces, so as to realize the separate storage of each information sequence unit.
  • the S first physical spaces are integrated into one DNA hard disk to realize the storage of the base sequence storing the target data.
  • the K first DNA molecules in the S first physical spaces form a complete whole to be preserved, and it is not easy to omit and lose information during the preservation process, which is conducive to improving the integrity of data preservation.
  • the DNA base can be completely recovered and the integrity of the data can be maintained.
  • the storage method further includes: storing the second DNA molecule in a second physical space corresponding to the first physical space, where index information is stored in the second DNA molecule.
  • the second DNA molecule storing the index information is stored in a second physical space different from the first physical space to realize preservation of the index information.
  • the base sequence corresponding to the binary sequence of the target data is divided into S sequences, and each sequence unit includes multiple segmented sequence fragments, the S sequence units contain K sequence fragments in total, and the length of the sequence fragments is n, n, S and K are all integers greater than or equal to 2, using a preset index
  • the information marks the position information of sequence fragments and sequence units, and the marked sequence fragments are synthesized into DNA molecules and stored separately.
  • the length of the first retrieval sequence and the length of the marker sequence fragment are different, and then the first retrieval sequence and the marker sequence fragment are distinguished from the sequence unit by length difference .
  • m represents the base number of the marker sequence fragment
  • q represents the base number of the second retrieval sequence
  • i represents the number of DNA sequence fragments in the first retrieval sequence
  • p represents the number of DNA sequence fragments in the first retrieval sequence
  • the length of the first search sequence is the same as the length of the marker sequence fragment (sequence fragment with the second search sequence); the first base sequence used as an index mark in the first search sequence can be It is a part of the second search sequence and has the same number of bases as the second search sequence.
  • sequence fragment Even if the sequence fragment contains 8 bases, it can realize data storage much larger than the current storage capacity.
  • the method can successfully realize data recovery from sequence fragment to sequence unit, and from sequence unit to base sequence.
  • the first DNA molecules are stored separately, and amplified and stored according to actual needs.
  • the backup information of the first DNA molecule can be extracted according to actual needs, thereby avoiding the trouble of re-synthesis each time and greatly reducing storage costs.
  • the decoding method of the first K DNA molecules includes:
  • the method of sequencing the K first DNA molecules includes any method that can read DNA products, such as second-generation sequencing, third-generation sequencing, etc., to obtain multiple information sequence units to be decoded.
  • K first DNA molecules are sequenced to obtain K marker sequence fragments respectively.
  • the sequencing further includes: sequencing the second DNA molecule, obtaining the A retrieval sequence and index information of a second retrieval sequence.
  • the above step S201 can be implemented by the DNA sequencing module in the storage device shown in FIG. 1 .
  • S202 Splicing the sequence fragments corresponding to each marker sequence fragment belonging to the same marker sequence unit according to the second retrieval sequence to obtain S sequence units; splicing the obtained S sequence units according to the first retrieval sequence to obtain bases sequence.
  • the K retrieval fragments belonging to the S sequence units are spliced into base sequences by retrieval information.
  • the position information of the marker sequence unit in the base sequence is obtained according to the first search sequence; the position information or numbering information of multiple marker sequence fragments belonging to the same sequence unit is obtained according to the second search sequence.
  • the multiple sequence fragments are spliced into a sequence unit; according to the obtained position information or numbering information of the sequence units, S sequence units are spliced into a base sequence.
  • the sequence fragments corresponding to each marker sequence fragment belonging to the same marker sequence unit are spliced to obtain S sequence units.
  • the sequence fragment corresponding to each marker sequence fragment, and the position of the sequence fragment in the sequence unit; according to the position of the sequence fragment in the sequence unit, the sequence fragments are spliced into a sequence unit.
  • the obtained S sequence units are spliced according to the first search sequence to obtain a base sequence, including:
  • the s sequence units are spliced into a base sequence.
  • the base sequence can be converted into a binary sequence by a preset mapping relationship that matches the data written in S101. ⁇ , ⁇ A ⁇ 11,T ⁇ 10,C ⁇ 01,G ⁇ 00, ⁇ S102 ⁇ :“11100110 10011000 10100101 11101111 10111100 10001100 11100101 10110111 10110010 11100100 10111000 10001101 11100101 10000110 10001101 11100110 10011000 10101111 11100110 10000011 10110011 11101000 10110001 10100001 11100100 10111001 10001011 11100101 10100100100 10010110 11100111 10011010 10000100 11101001 10000010 10100011 11100101 10100100100 10010110 11100111 10011010 10000100 11101001 10000010 10100011 11100101 10001111 1010101010 11101000 10011101 10110100 11101000 10011101 10110100 11101000 10011101 10110100 11101000 10011101 10110100 11101000 10011101 10110100 11101000 10011101 10110100
  • a sequence of computer information is then generated from the binary sequence.
  • the binary sequence can be converted into corresponding data files, including files such as pictures, text, programs, audio, and video.
  • a computer program is used to recover the binary sequence obtained in the above step S203 into the text information of "Chun, no longer the unimaginable butterfly.”
  • the embodiment of the present application provides a DNA data storage device, including a data processing module, and a data processing module for obtaining the base sequence corresponding to the binary sequence of the target data; segmenting the base sequence to obtain S sequence units, and each sequence unit includes a plurality of segmented sequence fragments, wherein, the S sequence units contain K sequence fragments in total, and the length of the sequence fragments is n, and n, S and K are all greater than or equal to 2 Integer; use preset index information to mark K sequence fragments and S sequence units to obtain K mark sequence fragments and S mark sequence units.
  • the first search sequence of the arrangement order in the base sequence, and the second search sequence used to indicate the arrangement order of multiple sequence fragments belonging to the same sequence unit in the sequence unit, K marker sequence fragments are used to synthesize and store the target The K first DNA molecules of the data.
  • the data processing module is further configured to splice the sequence segments corresponding to each tag sequence segment belonging to the same tag sequence unit according to the second search sequence to obtain the sequence unit; the obtained S according to the first search sequence The sequence units are spliced to obtain the base sequence; the base sequence is converted into the target data.
  • the storage device further includes a DNA synthesis module for synthesizing the K marker sequence fragments into K first DNA molecules storing target data.
  • the storage device further includes a DNA molecule storage module, configured to store K first DNA molecules in S first physical spaces, wherein the first DNA molecules corresponding to the marker sequence fragments belonging to one sequence unit Stored in the same first physical space, the first DNA molecules corresponding to marker sequence fragments that do not belong to the same sequence unit are stored in different first physical spaces.
  • the DNA molecule storage module is also used to store a second DNA molecule in a second physical space.
  • the storage device further includes a DNA molecule sequencing module, which performs sequencing on a plurality of first DNA molecules stored in each first physical space to obtain a plurality of marker sequence fragments;
  • the device containing the above modules in FIG. 1 can use DNA for data storage and writing, which corresponds to the method for DNA data storage and writing shown in FIG. 2 .
  • the embodiment of the present application also provides a DNA data storage device.
  • the terminal device 70 provided in this embodiment includes: a processor 710 , a memory 720 , and a computer program 721 stored in the memory 720 and operable on the processor 710 .
  • the processor 710 executes the computer program 721 , the steps in the above-mentioned embodiments of the method for storing DNA data are implemented, such as steps S101 to S103 shown in FIG. 2 .
  • the computer program 721 can be divided into one or more modules/units, and one or more modules/units are stored in the memory 720 and executed by the processor 710 to complete the present application.
  • One or more modules/units may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments may be used to describe the execution process of the computer program 721 in the terminal device.
  • the computer program 721 may be divided into data processing modules.
  • the computer program 721 can also be divided into a data processing module, a DNA molecule synthesis module, a DNA molecule storage module and a DNA molecule storage module, and the specific functions of each module are as described in the text. In order to save space, details are omitted here.
  • FIG. 7 is only an example of a terminal device 70, and does not constitute a limitation to the terminal device 70. It may include more or less components than those shown in the figure, or combine certain components, or be different. parts.
  • the processor 710 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the storage 720 may be an internal storage unit of the terminal device 70 , such as a hard disk or memory of the terminal device 70 .
  • the memory 720 can also be an external storage device of the terminal device 70, such as a plug-in hard disk equipped on the terminal device 70, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash memory card (Flash Card) and so on. Further, the memory 720 may also include both an internal storage unit of the terminal device 70 and an external storage device.
  • the memory 720 is used to store a computer program 721 and other programs and data required by the terminal device 70 .
  • the memory 720 can also be used to temporarily store data that has been output or will be output.
  • An embodiment of the present application also provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the processing methods of the foregoing embodiments are implemented.
  • the embodiment of the present application also provides a computer program product, which, when the computer program product is run on the terminal device, enables the terminal device to execute the methods for storing DNA data in the foregoing embodiments.
  • An embodiment of the present application provides a DNA hard disk. Referring to FIG. 5 , it includes multiple physical spaces made of physical materials, and each physical space is used to store DNA molecules.
  • the physical space made of physical materials wraps DNA molecules, and the DNA molecules are isolated by physical materials.
  • the shape of the physical space is not strictly limited, and may be circular, square, or any other shape.
  • the DNA molecules stored in the physical space include the above-mentioned first DNA molecule, and in this case, the physical space is the first physical space.
  • the first DNA molecule includes the second search sequence and sequence fragments.
  • Each first physical space also stores the above-mentioned second retrieval sequence.
  • the DNA molecules stored in the physical space include the above-mentioned second DNA molecules, and in this case, the physical space is the second physical space.
  • the physical material is at least one selected from SiO 2 , metal oxides, and polymer materials.
  • the high molecular polymer material includes resin, but is not limited to resin.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé de stockage de données d'ADN, comprenant : acquérir une séquence de base correspondant à une séquence binaire de données cibles ; segmenter la séquence de base pour obtenir S unités de séquence, chaque unité de séquence comprenant une pluralité de segments de séquence segmentés, les S unités de séquence partageant K segments de séquence, et la longueur des segments de séquence étant n ; et à utiliser des informations d'indice prédéfinies pour marquer les K segments de séquence et les S unités de séquence afin d'obtenir K segments de séquence marqués et S unités de séquence marquées, les informations d'indice comprenant une première séquence de recherche utilisée pour indiquer l'ordre d'agencement des S unités de séquence dans la séquence de base, et une seconde séquence de recherche utilisée pour indiquer l'ordre d'agencement dans l'unité de séquence de multiples segments de séquence appartenant à la même unité de séquence. Le procédé décrit dans la présente invention peut permettre le stockage d'informations de données à grande échelle dans l'ADN.
PCT/CN2021/112465 2021-08-13 2021-08-13 Procédé et appareil de stockage de données d'adn, dispositif et support de stockage lisible WO2023015550A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/112465 WO2023015550A1 (fr) 2021-08-13 2021-08-13 Procédé et appareil de stockage de données d'adn, dispositif et support de stockage lisible

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/112465 WO2023015550A1 (fr) 2021-08-13 2021-08-13 Procédé et appareil de stockage de données d'adn, dispositif et support de stockage lisible

Publications (1)

Publication Number Publication Date
WO2023015550A1 true WO2023015550A1 (fr) 2023-02-16

Family

ID=85199755

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/112465 WO2023015550A1 (fr) 2021-08-13 2021-08-13 Procédé et appareil de stockage de données d'adn, dispositif et support de stockage lisible

Country Status (1)

Country Link
WO (1) WO2023015550A1 (fr)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682226A (zh) * 2012-04-18 2012-09-19 盛司潼 一种核酸测序信息处理系统及方法
CN102867134A (zh) * 2012-08-16 2013-01-09 盛司潼 一种对基因序列片段进行拼接的系统和方法
CN106845158A (zh) * 2017-02-17 2017-06-13 苏州泓迅生物科技股份有限公司 一种利用dna进行信息存储的方法
US20190311782A1 (en) * 2016-08-30 2019-10-10 Tsinghua University Method for biologically storing and restoring data
CN112382340A (zh) * 2020-11-25 2021-02-19 中国科学院深圳先进技术研究院 用于dna数据存储的二进制信息到碱基序列的编解码方法和编解码装置
CN112749247A (zh) * 2019-10-31 2021-05-04 中国科学院深圳先进技术研究院 文本信息存储和读取方法及其装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682226A (zh) * 2012-04-18 2012-09-19 盛司潼 一种核酸测序信息处理系统及方法
CN102867134A (zh) * 2012-08-16 2013-01-09 盛司潼 一种对基因序列片段进行拼接的系统和方法
US20190311782A1 (en) * 2016-08-30 2019-10-10 Tsinghua University Method for biologically storing and restoring data
CN106845158A (zh) * 2017-02-17 2017-06-13 苏州泓迅生物科技股份有限公司 一种利用dna进行信息存储的方法
CN112749247A (zh) * 2019-10-31 2021-05-04 中国科学院深圳先进技术研究院 文本信息存储和读取方法及其装置
CN112382340A (zh) * 2020-11-25 2021-02-19 中国科学院深圳先进技术研究院 用于dna数据存储的二进制信息到碱基序列的编解码方法和编解码装置

Similar Documents

Publication Publication Date Title
US20210050074A1 (en) Systems and methods for sequence encoding, storage, and compression
CN109830263B (zh) 一种基于寡核苷酸序列编码存储的dna存储方法
US10566077B1 (en) Re-writable DNA-based digital storage with random access
JP2020500383A (ja) リファレンスシーケンスを用いたバイオインフォマティクスデータの表現及び処理のための方法及びシステム
Wang et al. High capacity DNA data storage with variable-length Oligonucleotides using repeat accumulate code and hybrid mapping
CN113782102B (zh) Dna数据的存储方法、装置、设备及可读存储介质
JP2017538234A (ja) データ保管システム
JP6902104B2 (ja) バイオインフォマティクス情報表示のための効率的データ構造
Cevallos et al. A brief review on DNA storage, compression, and digitalization
Ezekannagha et al. Design considerations for advancing data storage with synthetic DNA for long-term archiving
CN110168652B (zh) 用于存储和访问生物信息学数据的方法和系统
Zhang et al. A high storage density strategy for digital information based on synthetic DNA
Li et al. Img-dna: approximate dna storage for images
WO2023015550A1 (fr) Procédé et appareil de stockage de données d'adn, dispositif et support de stockage lisible
Beck et al. Finding data in DNA: computer forensic investigations of living organisms
Goel A compression algorithm for DNA that uses ASCII values
Wang et al. Mainstream encoding–decoding methods of DNA data storage
JP2020509473A (ja) 複数のゲノム記述子を用いた生体情報データのコンパクト表現方法及び装置
Wei et al. Dna storage: A promising large scale archival storage?
Wu et al. HD-code: End-to-end high density code for DNA storage
Bharti et al. A biological sequence compression based on cross chromosomal similarities using variable length lut
WO2017085245A1 (fr) Procédés de codage et de décodage d'une chaîne binaire, et système correspondant
WO2019080653A1 (fr) Procédé de codage/décodage, codeur/décodeur, et procédé et appareil de mémorisation
WO2022082573A1 (fr) Procédé et appareil de traitement d'une séquence adn stockant des informations de données
Jain et al. An information security-based literature survey and classification framework of data storage in DNA

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21953165

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21953165

Country of ref document: EP

Kind code of ref document: A1