WO2022109879A1 - 用于dna数据存储的二进制信息到碱基序列的编解码方法和编解码装置 - Google Patents

用于dna数据存储的二进制信息到碱基序列的编解码方法和编解码装置 Download PDF

Info

Publication number
WO2022109879A1
WO2022109879A1 PCT/CN2020/131511 CN2020131511W WO2022109879A1 WO 2022109879 A1 WO2022109879 A1 WO 2022109879A1 CN 2020131511 W CN2020131511 W CN 2020131511W WO 2022109879 A1 WO2022109879 A1 WO 2022109879A1
Authority
WO
WIPO (PCT)
Prior art keywords
binary
base
encoding
data
base sequence
Prior art date
Application number
PCT/CN2020/131511
Other languages
English (en)
French (fr)
Inventor
黄小罗
戴俊彪
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Priority to PCT/CN2020/131511 priority Critical patent/WO2022109879A1/zh
Publication of WO2022109879A1 publication Critical patent/WO2022109879A1/zh

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures

Definitions

  • the present application relates to the technical field of data storage, and in particular, to a method and an encoding and decoding device for encoding and decoding DNA data from binary information to base sequences.
  • DNA Deoxyribonucleic acid
  • DNA molecules have four bases, which are: adenine (A), cytosine (Cytosine, C), guanine (Guanine, G) and thymine (Thymine, T).
  • the DNA-based data storage technology uses the above four base sequences, namely A/C/G/T, to represent the data sequence composed of binary "0" and "1". This process is called encoding.
  • the process of converting the base sequence into binary data is called decoding.
  • a single-stranded DNA molecular structure is generated by DNA synthesis technology for preservation.
  • the generated single-stranded DNA molecule can detect the base sequence composed of four bases in the single-stranded DNA molecule through DNA sequencing technology, and obtain the final binary data stream through the decoding mechanism.
  • the existing coding methods are single, and there are few possible coding rules, which further causes GC inhomogeneity in the obtained base sequences, single base repeats, etc., which affect the synthesis and sequencing of base sequences.
  • One of the purposes of the embodiments of this application is to provide an encoding and decoding method and an encoding and decoding device from binary information for DNA data storage to base sequences, aiming to solve the problem that the existing binary data processing methods are single and selectable encoding There are few rules, resulting in uneven GC in the obtained base sequence and single base repeat, thus affecting the problem of base sequence synthesis and sequencing.
  • the application provides a method for encoding and decoding binary information for DNA data storage to base sequences, the method comprising:
  • mapping coding rule library based on the reference base set-reference binary unit is constructed, wherein the number of bases in the reference base set is M, the number of bits in the binary unit is 2M, and M is an integer greater than or equal to 2;
  • the binary data includes a plurality of binary units
  • the base sequence corresponding to the binary data is obtained, where N is an integer greater than or equal to 2, and the base sequence is used to synthesize and store the corresponding binary data.
  • DNA of data information the N different mapping coding rules are selected from the rules in the base set-reference binary unit mapping coding rule library.
  • two binary units that are separated by N-1 binary units among the plurality of binary units are encoded using the same mapping encoding rule among N different mapping encoding rules.
  • N 2
  • N different mapping coding rules include a first mapping coding rule and a second mapping coding rule; using N different mapping coding rules to encode multiple binary units to obtain the same binary data
  • Corresponding base sequences including:
  • the first mapping coding rule is used to encode the binary units located in the odd ranks in the binary data
  • the second mapping coding rules are used to encode the binary units located in the even ranks in the binary data to obtain the base sequence corresponding to the binary data.
  • the same reference base group included in the N different mapping coding rules corresponds to N different reference binary units.
  • the number of single base repeats is less than 6, and the number percentage of G and C in the base sequence is 40%-60%.
  • the base sequence includes J subsequences of bases.
  • the base subsequence is provided with an index marker for marking the position of the base subsequence in the base sequence.
  • the base subsequences are provided with error correction codes and/or linker sequences.
  • the processing method further includes:
  • the base sequence is interpreted according to the mapping coding rules to obtain binary data, including:
  • the base sequence includes multiple base groups, and the base groups correspond to binary units
  • the base sets are decoded using N different mapping coding rules to obtain binary data.
  • the base sequence when the base sequence includes J base subsequences, the base sequence is obtained from the synthetic DNA by sequencing, including:
  • the J base subsequences are spliced into a base sequence.
  • the base subsequence is provided with an index mark, and the J base subsequences are spliced into a base sequence, including:
  • the J base subsequences are spliced into a base sequence.
  • the present application provides an encoding and decoding device from binary data to base sequences for DNA data storage, the encoding and decoding device comprising:
  • the building module is used to represent the reference binary unit by using the reference base set, and construct a mapping coding rule library based on the reference base set-reference binary unit, wherein the number of bases in the reference base set is M, and the bits of the binary unit are M.
  • the number is 2M, and M is an integer greater than or equal to 2;
  • an acquisition module used to acquire binary data to be encoded, the binary data includes a plurality of binary units
  • the encoding module is used to encode multiple binary units using N different mapping encoding rules to obtain base sequences corresponding to the binary data, where N is an integer greater than or equal to 2, and the base sequences are used to synthesize and store The DNA of the data information corresponding to the binary data; the N different mapping coding rules are selected from the rules in the base set-reference binary unit mapping coding rule library.
  • the encoding and decoding apparatus further includes:
  • a sequencing module for obtaining base sequences from synthesized DNA by sequencing
  • the decoding module is used to decode the base sequence according to N different mapping coding rules to obtain binary data.
  • the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor implements the first aspect when executing the computer program. method.
  • the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, characterized in that, when the computer program is executed by a processor, the method of the first aspect is implemented.
  • the encoding and decoding method from binary information for DNA data storage to base sequences provided by the present application first utilizes the correspondence between the reference base set and the reference binary unit to construct a mapping code based on the reference base set-reference binary unit A rule base; then the binary data is encoded using a plurality of different mapping coding rules in the base set-reference binary unit mapping coding rule base.
  • the number of bases in the reference base set is greater than or equal to 2, and the number of mapping coding rules in the base set-reference binary unit mapping coding rule library formed by this can be very large, which is the binary information to bases.
  • the encoding of the sequence provides a flexible and optional encoding rule library, which greatly enriches the encoding method of binary information to base sequences; at the same time, the encoding rule uses M bases in the base base set to represent 2M binary information , realizing the information coding efficiency of 2bits/nt, reaching the theoretical coding limit; this M bases represent 2M, and it can also realize the possibility of 1 or more binary information being encoded into 1 base sequence, which further expands the The encoding method of binary information to base sequence; based on these flexible encoding methods, in the encoding process, not only can a variety of encoding rules or encoding methods be selected to prevent the problem of single-base repeat and GC inhomogeneity in the base sequence, improve the Base sequence synthesability and sequencing convenience; moreover, encoding binary data through N different mapping encoding rules or mixed encoding methods provides more possibilities for data encryption and secure storage, and improves data security performance.
  • Fig. 1 is a kind of coding flow schematic diagram of binary information for DNA data storage to base sequence provided by the embodiment of the present application;
  • FIG. 2 is a schematic diagram of a combination mode of binary data combination coding provided by an embodiment of the present application
  • 3A is a schematic diagram of the information composition of a base subsequence provided in an embodiment of the present application.
  • 3B is a schematic diagram of the information composition of another base subsequence provided in the embodiment of the present application.
  • FIG. 4 is a schematic diagram of a decoding process from binary information for DNA data storage to base sequences provided by an embodiment of the present application
  • FIG. 5 is a schematic structural diagram of an encoding unit of an encoding and decoding device for encoding and decoding DNA data from binary data to base sequences provided by an embodiment of the present application;
  • FIG. 6 is a schematic structural diagram of a decoding unit of an encoding and decoding device for storing binary data to base sequences for DNA data storage provided by an embodiment of the present application;
  • FIG. 7 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • references to "one embodiment” or “some embodiments” and the like described in the specification of this application mean that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the application .
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically emphasized otherwise.
  • the terms “including”, “including”, “having” and their variants mean “including but not limited to” unless specifically emphasized otherwise.
  • the information storage technology based on DNA medium usually includes the following four steps:
  • Information coding that is, the coding process from 0/1 binary data of computer-stored data to DNA base sequence.
  • the process converts the binary data information into DNA bases encoded by the bases A, T, C, and G, which store the data information.
  • base sequence Exemplarily, the method published by George Church et al. in Science in 2012 uses 2 bases to represent 1 bits, where 0 represents A or C, and 1 represents G or T. The base sequence obtained by this method has a small amount of information carried by a single base, so the coding density is low, only 1 bit/nt.
  • the DNA base sequence is synthesized to obtain DNA.
  • a high-throughput DNA synthesizer including a combination of enzymatic splicing technology, is used to synthesize DNA sequences with stored data information.
  • Information decoding that is, the process of extracting the 0/1 binary data of the stored data from the DNA base sequence.
  • This step combines the coding rules to decode the DNA base sequence into 0/1 binary data.
  • the information coding and mapping coding rules of steps 1 and 4 are the core of the entire storage process, and directly determine the difficulty of steps 2 and 3 and the efficiency of DNA storage technology.
  • the current coding method has low coding density and few optional coding rules, resulting in GC inhomogeneity and single nucleotide repetition in the obtained base sequence, thus affecting the synthesis and sequencing of the base sequence.
  • the embodiments of the present application provide a method for encoding and decoding binary information to base sequences for DNA data storage. And the processing device, by using the M bases of the reference base set to correspond to the 2M binary bits of the reference binary unit, the coding density of 2 bits/nt is realized, and a flexible and optional coding rule library is generated at the same time.
  • the encoding method of binary information to base sequence; using N different mapping encoding rules in the base set-reference binary unit mapping encoding rule library to encode multiple data units in binary data, by using mapping encoding is beneficial to obtain base sequences with suitable GC content and reduced single base repeats, and increases the security in the coding process.
  • the encoding and decoding method of binary information to base sequence for DNA data storage includes the encoding method of binary information to base sequence for DNA data storage and the binary information to base sequence for DNA data storage.
  • the decoding method of the sequence includes the encoding method of binary information to base sequence for DNA data storage and the binary information to base sequence for DNA data storage. The decoding method of the sequence.
  • FIG. 1 is a schematic flowchart of a method for encoding binary information to base sequences for DNA data storage provided by an embodiment of the present application.
  • the method includes at least step S101 and step S103, and the details are as follows:
  • mapping coding rule library based on the reference base group-reference binary unit, wherein the base number of the reference base group is M, and the number of bits of the binary unit is 2M, and M is an integer greater than or equal to 2.
  • the reference base set-reference binary unit mapping coding rule library involved in the embodiment of the present application uses the reference base set to represent the reference binary unit, and according to the corresponding relationship between the reference base set and the reference binary unit, the The set of all mapping encoding rules for .
  • Each mapping coding rule in the reference base set-reference binary unit mapping coding rule library is a mapping coding rule formed by the reference base set and the reference binary unit according to a preset one-to-one correspondence.
  • the number of bases in the reference base group is M, and M is an integer greater than or equal to 2, that is, the reference base combination is a base combination formed by two or more single bases.
  • the number of bases in the reference base set is two, three, four or even more.
  • the reference base set may be two bases, three bases, four bases or even more. Multiple bases are used as the base base set.
  • the reference base combination contains two bases such as AT, three bases such as ACT, four bases such as AGCT, or even more bases. It should be understood that when the number of bases in the reference base group is greater, the number of reference base groups formed by the arrangement and combination of bases is greater, and the mapping method between the correspondingly formed reference base group and the reference binary unit is determined. more.
  • the number of bits of the binary unit is greater than the number of bases of the reference base group, and is twice the number of bases of the reference base group, so that the coding density is equal to 2 bit/nt, that is, each base Corresponding to two bits, thus improving the information coding density.
  • the reference base set contains two bases, and the number of bits in the binary unit is 4; the reference base set is a base set composed of three bases, and the number of bits in the binary unit is 6, At this time, the coding density is 2 bit/nt.
  • a binary unit of 4 bits is represented by two bases, and the mapping coding rules provided by the present application will be described.
  • Table 1 there are 16 two-base combinations for the four bases A, T, C, and G, that is, 16 reference base groups, specifically AA, AT, AC, AG, TA, TT, TC, TG, CA, CT, CC, CG, GA, GT, GC, GG.
  • the 16 reference base sets correspond to 16 reference binary units.
  • the set formed by any mapping manner between the 16 reference base sets and the 16 reference binary units constitutes the base set-reference binary unit mapping coding rule library of the embodiment of the present application.
  • mapping method of the 16 benchmark base sets corresponding to the 16 binary units is an arbitrary combination. It is not limited to the mapping methods listed above. Therefore, 16 kinds of base sets can generate mapping coding rules corresponding to A(16, 16) kinds of permutations and combinations, and there are a total of 20922789888000 kinds.
  • the number of bases in the reference base set is greater than 2 (such as 3, 4 or even more), the corresponding number of generated reference base sets is larger, and the resulting reference base set There are more encoding rules for mapping to and from the base binary unit.
  • the reference base set is a base set formed by a combination of 3 bases
  • the number of bits of the reference binary unit may be 6 bits, but not limited to this, four bases A, T, C, and G
  • the 64 benchmark base sets correspond to 64 benchmark binary units, resulting in a total of A(64, 64) permutations and combinations corresponding to the mapping coding rules, which will far exceed 2 trillion.
  • the embodiment of the present application encodes binary data by using multiple optional mapping encoding rules, which greatly enriches the encoding rules from binary data to base sequences, gives full play to the degree of freedom of base sequences on corresponding binary data, and improves the
  • the flexibility of binary data to base sequence encoding provides a rich set of optional mapping encoding rules for the secure encryption of DNA data storage.
  • Step S102 acquiring binary data to be encoded, the binary data having multiple binary units.
  • the data information to be stored needs to be converted into corresponding binary data, for example, text
  • the corresponding binary data is "11100110 10011000 10100101 11101111 10111100 10001100 11100101 10110111 10110010 11100100 10111000 10001101 11100101 10000110 10001101 11100110 10011000 10101111 11100110 10000011 10110011 11101000 10110001 10100001 11100100 10111001 10001011 11100101 10100100 10010110 11100111 10011010 10000100 11101001 10000010 10100011 11100101 10001111 1010101010 11101000 10011101 10110100 11101000 10011101 10110100 11101000 10011101 1011011 10000000 10000010".
  • the binary data is divided into a plurality of binary units according to the number of bits of the reference binary unit corresponding to the reference base group in the selected mapping coding rule for encoding the binary data. It should be understood that the division of the binary unit in the binary data is related to the number of bits of the reference binary unit in the corresponding mapping coding rule for encoding the binary unit. Specifically, the number of bits of the binary unit is the same as the number of bits of the reference binary unit in the corresponding mapping coding rule for encoding the binary unit.
  • the number of bits of the reference binary units in all the mapping coding rules used to encode the binary data is the same, the number of bits of all the binary units in the binary data is the same, and is the same as the number of bits of the reference binary unit
  • the applicable bits when the number of bits of the reference binary units in the N different mapping coding rules used to encode the binary data are not exactly the same, in the binary data, the applicable bits The number of bits of the binary units of the reference binary units with the same number of bits is the same, and the number of bits of the binary units of the applicable reference binary units with different number of bits is different; in another possible implementation, when used for encoding binary data When the number of bits of the reference binary units in the N different mapping coding rules are completely different, the number of bits of the binary units of the applicable reference binary units with different number of bits is different.
  • all binary units in the binary data have the same number of bits, which can reduce the complexity of the decoding procedure of the resulting base sequence under the premise of imparting good security to the encoding.
  • the binary units in the binary data have the same number of bits.
  • the number of bits of the binary unit is 4 bits, and the above binary data "11100110 10011000 10100101" can be divided into binary units “1110", “0110", “1001", 1000", and “1010” in turn. , "0101"... .
  • the binary data when the binary data is divided into binary units, when the end of the binary data to be encoded is less than the number of bits of a binary unit, in order to enable the binary unit to be encoded into a corresponding base, in the binary
  • the end of the data is padded with 0 or 1 until the number of bits of the last binary unit is consistent with the number of bits of the reference binary unit in the corresponding mapping coding rule.
  • the process of adding 0 or 1 to the end may be additionally recorded in a computer program or other text as the key content of the decoding step.
  • the binary data to be encoded when the binary data to be encoded is relatively large, the binary data can be split into multiple binary information files, and each binary information file is marked to indicate the order of the binary information files, and In the encoding process of the following steps, encoding processing is performed on each binary information file respectively.
  • Step S103 using N different mapping coding rules to encode the plurality of binary units to obtain a base sequence corresponding to the binary data, wherein N is an integer greater than or equal to 2, and the base sequence uses for synthesizing DNA storing data information corresponding to the binary data; the N different mapping coding rules are selected from the rules in the base set-reference binary unit mapping coding rule library.
  • the embodiment of the present application uses N different mapping encoding rules to encode multiple binary units.
  • N different mapping coding rules it is possible to flexibly select coding rules that are conducive to improving GC uniformity and reducing single-base repetition, thereby improving GC uniformity in the base sequence and reducing the number and repetition of single-base repetitions. purpose of probability.
  • multiple mapping encoding rules are mixed to encode binary data, the security of encoded information can be improved.
  • binary data can be divided into several binary unit sets, wherein, all binary units encoded by using the same mapping coding rule are called a Collection of binary units.
  • the rules for forming a set of binary units are not strictly limited. Therefore, the number of binary units in a set of binary units is not strictly limited, which can be one, two, the total number of binary units/N, or more, or even each The binary units independently constitute a set of binary units (in this case, each binary unit is encoded with different mapping encoding rules). It should be understood that since the number of bits of the reference binary units in the same coding rule is the same, the number of bits of the binary units in the same set of binary units is the same.
  • each binary unit in the binary data is encoded using different mapping encoding rules.
  • the corresponding numbered mapping coding rules are used for coding.
  • the first binary unit corresponds to the first mapping coding rule
  • the second binary unit corresponds to the second mapping coding rule
  • the third and second binary units correspond to the second mapping coding rule.
  • the base unit corresponds to the third mapping encoding rule, and so on. This can effectively improve the GC uniformity of the base sequence when there are enough mapping coding rules, and reduce the number and probability of repeats of a single base, thereby improving the synthesability and sequencing readability of DNA.
  • the encoding security is also improved accordingly.
  • mapping encoding rules are used to encode two adjacent binary units in the binary data.
  • the first binary unit adopts the first mapping encoding rule
  • the second binary unit adopts any mapping encoding rule except the first mapping encoding rule
  • the third binary unit adopts the second binary encoding rule. Any one of the mapping encoding rules other than the mapping encoding rules adopted by the unit, and so on.
  • two binary units separated by N-1 binary units among the multiple binary units are encoded by using the same mapping encoding rule among N different mapping encoding rules, that is, in the binary data, each The binary units separated by N-1 binary units form a binary unit set, which is encoded by the same mapping encoding rule among N different mapping encoding rules.
  • the odd-numbered and even-numbered positions of the binary units in the binary data may be arranged to form two binary unit sets, and the odd-ranked binary units and the even-ranked binary units may be encoded respectively by using two mapping encoding rules.
  • three sets of binary units can be formed according to the positions where the number of arrangement bits of the binary units in the binary data is S-2, S-1, and S (S is an integer multiple of 3), and three mappings are adopted.
  • the encoding rule encodes the binary units in the three binary unit sets respectively.
  • four sets of binary units may be formed according to the positions where the number of arrangement bits of the binary units in the binary data is L-3, L-2, L-1, and L (L is an integer multiple of 4),
  • the binary units in the four binary unit sets are respectively encoded by using four mapping coding rules, etc., which are not limited to the above embodiments.
  • N different mapping coding rules include a first mapping coding rule and a second mapping coding rule; adopt N different mapping coding rules to encode multiple binary units to obtain the corresponding binary data.
  • the base sequence includes: using the first mapping encoding rule to encode the binary units located in the odd ranks in the binary data, and using the second mapping encoding rule to encode the binary units located in the even ranks in the binary data. the corresponding base sequence.
  • the N different mapping coding rules for coding binary units can be screened, so that the obtained base sequence has the desired sequence characteristics, such as GC Uniform, no single base repeats of more than 3 bases, minimal local secondary structure, no large fragment base repeats, etc.
  • the criteria for screening the mapping coding rules (coding rules for short) used for coding in the embodiment of the present application are preferably as follows: minimum base repetition and uniform GC.
  • the N different mapping encoding rules for encoding binary data satisfy: in the base sequence obtained by encoding, the number of single base repeats is less than 6, and the percentage of the number of G and C in the base sequence The content is 40%-60%, thus, the synthesibility and sequencing feasibility of the obtained base sequence can be improved.
  • the binary data uses 4 bits as a binary unit, and two binary unit sets are formed according to the odd-numbered and even-numbered arrangement order of the binary units, different combinations of encoding rules for odd-numbered and even-numbered bits can be repeatedly selected to perform encoding on the information. Coding test, until other desired sequence features are obtained, such as single base repeats less than 6, less than 5, less than 4, etc.; the number percentage of G and C in the base sequence is 40%-60%, 45 %-55%, 48%-52%, etc.
  • the same reference base group included in N different mapping coding rules corresponds to N different reference binary units. That is, in different mapping coding rules, the reference binary units corresponding to the same reference base group are different. In this case, when the base sequence is subsequently decoded, the mapping coding rule corresponding to the corresponding base can be more clearly defined, thereby further improving the decoding accuracy.
  • each binary sub-data may be encoded according to the above method to obtain a plurality of base sub-sequences. That is, at least two different mapping encoding rules are used to encode each binary subdata, so as to balance the GC content in each base sequence obtained and reduce single-base repeats.
  • 2, 3, 4 or even more pieces of binary sub-data can be mixed into one piece of binary data according to preset rules, and then the mixed binary data can be processed using the above-mentioned method. encoding rules.
  • FIG. 2 from left to right, the three columns respectively show that after 2, 3, and 4 pieces of binary sub-data are mixed into a piece of binary data in the form of odd-numbered bits and even-numbered bits, encoded into a single base sequence.
  • the 2 pieces of binary data are mixed into one piece of binary data in the form of odd and even bits, and then encoded into a base sequence, including: 2 pieces of binary data
  • the data are marked as the first binary sub-data and the second binary sub-data respectively, and the first binary sub-data and the second binary sub-data are divided into a plurality of sub-units according to 2 bits as a unit, and Sort the sub-units; combine the sub-units of the same order bit in the first binary sub-data and the second binary sub-data, and integrate the two binary sub-data into the third binary data; use 4 bits as One unit divides the third binary data into a plurality of binary units, and after the odd and even ranks are performed on the binary units, the first mapping coding rule is used to encode the binary units located in the odd ranks, and the second The mapping coding rules encode secondary units located in even ranks to obtain base sequences.
  • the encoded base sequence can be split into J small sequence fragments according to the length that can be synthesized by the synthesis technology, that is, the base sequence includes J base subsequences.
  • J is a positive integer greater than 0 and less than 200 nt.
  • the base subsequence is provided with an index marker for marking the position of the base subsequence in the base sequence.
  • an index marker for marking the position of the base subsequence in the base sequence.
  • Index markers can be added to the left or right of the split segment.
  • a linker sequence preferably a linker sequence of 16-25 bases, may be added to the two ends of the split J sequence fragments, wherein, represents The linker sequences can be different for small fragments of different sequences.
  • FIG. 3A it is a schematic diagram of the information composition of a base subsequence.
  • a left linker sequence and second index information are added to the left side of the base subsequence, and a right linker sequence is added to the right side of the base subsequence.
  • an error correction code may also be added to the split sequence segments, for example, such as Reed-Solomon , Hamming code, etc.
  • FIG. 3B it is a schematic diagram of the information composition of a base subsequence. Wherein, the left linker sequence and the second index information are added to the left side of the base subsequence, and the error correction code and the right linker sequence are added to the right side of the base subsequence.
  • the base sequences obtained in the examples of the present application are used to synthesize DNA in which data information corresponding to binary data is stored.
  • the base sequences provided in the examples of this application can all be synthesized by chemical DNA synthesis method or enzymatic DNA synthesis method.
  • the form of the DNA synthesized in the examples of the present application is not limited, and may be any form such as primers, genes, or plasmids.
  • the storage medium for synthetic DNA is also not strictly limited.
  • the storage medium for synthetic DNA can be a centrifuge tube, glass, or can be inserted into plant/microorganism/animal cells for stable storage by transformation.
  • the form of dry powder when the synthetic DNA is stored as in vitro synthetic DNA, the form of dry powder can be selected and stored in a refrigerator at -20°C/-80°C; / The specific preservation requirements of animal cells are stored.
  • a decoding method from binary information for DNA data storage to base sequence is provided, further comprising steps S201 and S202, as follows:
  • step S201 the base sequence is obtained from the synthesized DNA by sequencing.
  • the synthesized DNA needs to be sequenced to obtain the base sequence in the DNA.
  • the sequencing method used includes but is not limited to the existing first-generation sequencing, Any method of next-generation sequencing and third-generation sequencing. After sequencing the DNA, the base sequence of the DNA will be obtained.
  • primers can be used to amplify the synthesized DNA to enrich the DNA content to ensure sufficient sequencing volume.
  • the base sequence is obtained from the synthesized DNA by sequencing, including:
  • the J base subsequences are spliced into the base sequence.
  • the base subsequence is provided with an index mark
  • the J base subsequences are spliced into the base sequence, including: according to the index marks in the J base subsequences, determining that the J base subsequences are in the base sequence The position in the sequence; according to the position of the J base subsequence in the base sequence, the J base subsequences are spliced into a base sequence.
  • obtaining the base sequence from the synthesized DNA by sequencing includes: performing PCR amplification on the J base subsequences, respectively, and reading the amplified fragments; sorting the amplified fragment sequences according to the index marks , spliced into a complete base sequence.
  • concentration of each base subsequence is increased by PCR amplification, and the sequencing recognition efficiency of each base subsequence is improved.
  • Step S202 Decode the base sequence according to N different mapping coding rules to obtain binary data.
  • the decoding process of the base sequence is a reverse process of encoding the binary data corresponding to the base sequence. Therefore, when the base sequence is obtained by using N mapping coding rules to encode binary data, the base sequence needs to be decoded correspondingly using N mapping coding rules.
  • the base sequence is interpreted according to the mapping coding rules to obtain binary data, including:
  • the base sequence includes multiple base groups, and the base groups correspond to binary units
  • the base sets are decoded using N different mapping coding rules to obtain binary data.
  • mapping encoding rules when N different mapping encoding rules are used to encode the N groups of sub-data formed by the binary data according to the preset ranking rules, in the decoding process, according to the mapping encoding rules of the reference base group Based on the number of bases, the base groups in the base sequence are determined, and based on the same ranking rule as the binary unit, the base sequence is decoded into binary data by using the mapping coding rule that encodes the corresponding binary unit.
  • mapping coding rules and sorting when different mapping coding rules are used to encode each binary unit in the binary data, in the decoding process, according to the adopted mapping coding rules and sorting, the number of nucleotides that make up the base sequence is determined. A base group is used, and the corresponding binary unit is decoded using the mapping encoding rule that changes the binary unit into the base group.
  • the two adjacent binary units in the binary data are encoded using different mapping coding rules.
  • the constituent bases are determined. Multiple base sets of the sequence, and then use different mapping coding rules to decode the adjacent two base combinations in the multiple base combinations, that is, when the base sequence corresponds to multiple data units in the binary data
  • two adjacent base combinations are also decoded using different mapping coding rules.
  • the first According to the adopted mapping coding rules and ordering, multiple base groups constituting the base sequence are determined, and then the same mapping coding rule is used to decode the base groups that are separated by N-1 base groups among the multiple base groups.
  • the odd-ranked binary units in the binary data are encoded using the first mapping encoding rule
  • the even-ranked binary units in the binary data are encoded using the second mapping encoding rule
  • the decoding process firstly, according to the number of bases in the base base set in the adopted mapping coding rule, the multiple base sets that make up the base sequence are determined, and then the odd-numbered and even-numbered base sets in the base sequence are determined.
  • the odd-ranked base group is decoded using the first mapping encoding rule
  • the even-ranked base group is decoded using the second mapping encoding rule
  • the base sequence is decoded into binary data.
  • the decoding process when the base sequence is obtained by encoding a plurality of binary sub-data, in the decoding process, firstly, according to the number of bases in the reference base group in the adopted mapping coding rule, Determine the multiple base groups that make up the base sequence; then use the mapping encoding rules corresponding to the encoded binary data to decode the base groups to obtain binary data; finally, according to the rules of integrating multiple binary sub-data into one binary data, the The binary data is decoded into pieces of binary sub-data.
  • the number of zero-padding in the data unit corresponding to the base combination needs to be deleted.
  • a method for encoding binary data which stores a sentence "Spring, no longer the butterfly beyond imagination.” into DNA, and decodes it from the DNA sequence using a corresponding decoding method process, including the following steps:
  • Step 1 Data encoding
  • TA CT CT TG GG CC TA AA CG AG CA AG TA CC CG CA CG GT TA CG CG TG CA AT TA CC CA CT CA AT TA CT CT TG GG AA TA CT CA GA CG GA TA TG CG GC GG GC TA CG TC CA TT TA CC GG CG CT CT TA CA CT TA CA CG TA TC CA GT GG GA TA CC CA AA TA TA TG CT AT CG CG TA TG CT AT CG CT TA GA CA GG CA GT”
  • Dissolve the freeze-dried plasmid take an appropriate amount according to the sample volume requirements of the sequencing company, and use next-generation sequencing to determine the base sequence of the freeze-dried plasmid as follows: "TA CT CT TG GG CC TA AA CG AG CA AG TA CC CG CA CG GT TA CG CG TG CA AT TA CC CA CT CA AT TA CT CT TG GG AA TA CT CA GA CG GA TA TG CG GC GG GC TA CG TC CA TT TA CC GG CG CT CT TA CA CT TA CA CG TA TC CA GT GG GA TA CC CA AA TA TA TG CT AT CG CG TA TG CT AT CG CT TA GA CA GG CA GT”
  • the first mapping encoding rule encodes odd-digit binary units
  • the second mapping encoding rule encodes even-digit binary units
  • the first mapping encoding rule and the second mapping encoding rule The number of bases of the reference base group of the base group.
  • the information coding density of the DNA sequence is 2 bits/nt, which reaches the theoretical limit of DNA data storage; the GC content of the DNA sequence obtained by coding is 51%, and the single base repeats up to 3, which is an effective implementation.
  • the purpose of "single base repeat ⁇ 6, GC content 40%-60%" is convenient for subsequent synthesis and sequencing.
  • the above method provided by the embodiment of the present invention not only achieves the theoretical limit of information storage density, but also avoids the problems of synthesis and sequencing, and is simpler and easier to implement in practical operation and application.
  • the embodiments of the present application further provide device embodiments for implementing the above method embodiments.
  • FIG. 5 is a schematic diagram of encoding/decoding from binary data to base sequences for DNA data storage provided by an embodiment of the present application.
  • the included modules are used to execute the steps in the embodiment corresponding to FIG. 1 .
  • the encoding and decoding 5 of binary data for DNA data storage to base sequences includes coding units, and the coding units include:
  • the building module 51 is used to represent the reference binary unit by using the reference base set, and construct a mapping coding rule library based on the reference base set-reference binary unit, wherein the base number of the reference base set is M, and the binary base set is M.
  • the number of bits of the unit is 2M, and M is an integer greater than or equal to 2;
  • an acquisition module 52 for acquiring binary data to be encoded, the binary data includes a plurality of binary units;
  • the encoding module 53 is used to encode multiple binary units using N different mapping encoding rules to obtain a base sequence corresponding to the binary data, wherein N is an integer greater than or equal to 2, and the base sequence is used for synthesis and storage DNA with data information corresponding to binary data; N different mapping coding rules are selected from the rules in the base set-reference binary unit mapping coding rule library.
  • the encoding module is configured to: encode each binary unit in the binary data using different mapping encoding rules.
  • the encoding module is configured to: encode two adjacent binary units in the binary data using different mapping encoding rules.
  • the encoding module is configured to: encode two binary units separated by N-1 binary units among the multiple binary units using the same mapping encoding rule in N different mapping encoding rules.
  • the encoding module is configured to: use the first mapping encoding rule to encode the binary units located in the odd ranks in the binary data, and use the second mapping encoding rule to encode the binary units located in the even ranks in the binary data.
  • the unit is encoded to obtain the base sequence corresponding to the binary data.
  • the encoding module is specifically configured to: when the binary data includes multiple binary sub-data, perform encoding processing on each binary sub-data to obtain multiple base sub-sequences.
  • the encoding module is further used for:
  • the encoding module is further used for:
  • the base sequence is divided into J base subsequences, and an index mark is added to each base subsequence to mark the position of the base subsequence in the base sequence.
  • the encoding module is further used for:
  • a linker sequence is added at both ends of the J base subsequence, and the linker sequence is used for splicing.
  • FIG. 6 is a schematic diagram of an encoding/decoding apparatus for DNA data storage from binary data to base sequences provided by an embodiment of the present application.
  • the modules included are used to execute the corresponding modules in FIG. 4 . steps in the examples.
  • the encoding and decoding apparatus further includes a decoding unit, and the decoding unit allocates:
  • the sequencing module 61 is used to obtain the base sequence from the synthesized DNA by sequencing
  • the decoding module 62 is used for decoding the base sequence according to N different mapping coding rules to obtain binary data.
  • the decoding module is specifically used to:
  • mapping encoding rules are used for decoding.
  • the decoding module is specifically used to:
  • mapping coding rules are used to decode.
  • the decoding module is specifically used to:
  • the first mapping coding rule is used to decode the odd-ranked base group among the multiple base groups
  • the second mapping coding rule is used to decode the even-ranked base group among the multiple base groups.
  • the base sequence includes N pieces, and the first mapping coding rules adopted by adjacent two of the N pieces are different, and/or the second mapping coding rules adopted by adjacent two of the N pieces are different , and N is an integer greater than 2.
  • FIG. 7 is a schematic diagram of a terminal device provided by an embodiment of the present application.
  • the terminal device 7 of this embodiment includes: a processor 70 , a memory 71 , and a computer program 72 , such as a coding program, stored in the memory 71 and executable on the processor 70 .
  • the processor 70 executes the computer program 72, the steps in each of the foregoing embodiments of the encoding method based on the DNA base sequence are implemented, for example, steps 101-102 shown in FIG. 1 .
  • the processor 70 executes the computer program 72, the functions of the modules/units in the above device embodiments are implemented, for example, the functions of the modules 51-52 shown in FIG. 5 and the functions of the modules 51-54 shown in FIG. 6 .
  • the computer program 72 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 71 and executed by the processor 70 to complete the present application.
  • One or more modules/units may be a series of computer program instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer program 72 in the DNA base sequence-based encoding 7 .
  • the computer program 72 may be divided into a data unit acquisition unit 51 and an encoding module 52.
  • encoding module 52 For specific functions of each module, please refer to the relevant descriptions in the embodiment corresponding to FIG. 1, which will not be repeated here.
  • the terminal device may include, but is not limited to, the processor 70 and the memory 71 .
  • FIG. 7 is only an example of the terminal device 7, and does not constitute a limitation on the terminal device 7, and may include more or less components than the one shown, or combine some components, or different components
  • the terminal device may also include an input and output device, a network access device, a bus, and the like.
  • the so-called processor 70 may be a central processing unit (Central Processing Unit, CPU), and may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuits) Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 71 may be an internal storage unit of the terminal device 7 , such as a hard disk or a memory of the terminal device 7 .
  • the memory 71 may also be an external storage device of the terminal device 7, for example, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash memory card (Flash card) equipped on the terminal device 7 Card), etc.
  • the memory 71 may also include both an internal storage unit of the terminal device 7 and an external storage device.
  • the memory 71 is used to store computer programs and other programs and data required by the terminal device.
  • the memory 71 can also be used to temporarily store data that has been output or is to be output.
  • the embodiments of the present application also provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the above-mentioned terminal device can convert the binary information used for DNA data storage to the base sequence codec method.
  • the embodiment of the present application provides a computer program product.
  • the above-mentioned encoding method based on DNA base sequence of the terminal device can be implemented when the terminal device executes.
  • the computer program product is executed on the decoding device When running, the decoding device can implement the above-mentioned method for encoding and decoding the binary information used for DNA data storage to the base sequence.
  • Module completion means dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
  • Each functional unit and module in the embodiment may be integrated in one processing unit, or each unit may exist physically alone, or two or more units may be integrated in one unit, and the above-mentioned integrated units may adopt hardware. It can also be realized in the form of software functional units.
  • the specific names of the functional units and modules are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present application. For the specific working processes of the units and modules in the above-mentioned system, reference may be made to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本申请公开一种用于DNA数据存储的二进制信息到碱基序列的编解码方法,包括:利用基准碱基组代表基准二进制单元,构建基准碱基组-基准二进制单元的映射编码规则库;获取待编码的二进制数据,所述二进制数据包括多个二进制单元;采用N个不同的映射编码规则对所述多个二进制单元进行编码,得到与所述二进制数据对应的碱基序列。本申请提供的用于DNA数据存储的二进制信息到碱基序列的编解码方法,可以选择性地预防单碱基重复和碱基序列中GC不均一的问题,从而提高碱基序列可合成性和测序便利性;而且为数据的加密和安全存储提供了更多的可能。

Description

用于DNA数据存储的二进制信息到碱基序列的编解码方法和编解码装置 技术领域
本申请涉及数据存储技术领域,尤其涉及一种用于DNA数据存储的二进制信息到碱基序列的编解码方法和编解码装置。
背景技术
互联网以及大数据等领域的快速发展,促使当今社会的信息呈现出爆发性的增长。传统的存储介质,包括硬盘、磁带、光盘等由于存储寿命短,维护成本高,存储密度低,远远无法满足未来大规模数据存储的需求。脱氧核糖核酸(DeoxyriboNucleic Acid, DNA)作为一种近年来发展起来的信息存储介质,具有存储密度高,存储时间长,维护成本低等优势,被认为是未来信息存储最有潜力的介质之一。
DNA分子具有四种碱基,它们分别是:腺嘌呤 (Adenine,A)、胞嘧啶 (Cytosine,C)、鸟嘌呤 (Guanine,G)和胸腺嘧啶 (Thymine,T)。基于DNA的数据存储技术是利用上述四种碱基序列即A/C/G/T来表示二进制“0”和“1”组成的数据序列,这个过程被称为编码。而将碱基序列转换成二进制数据的过程,则称为解码。当二进制数据转换成碱基序列之后,通过DNA合成技术生成单链的DNA分子结构保存。生成的单链DNA分子可以通过DNA测序技术,检测出单链DNA分子中的四种碱基组成的碱基序列,并通过解码机制得出最终的二进制数据流。
目前,已有的编码方法单一,可选择的编码规则少的问题,进一步造成得到的碱基序列中GC不均一,单碱基重复等,影响碱基序列合成及测序的问题。
技术问题
本申请实施例的目的之一在于:提供一种用于DNA数据存储的二进制信息到碱基序列的编解码方法和编解码装置,旨在解决现有的二进制数据处理方法单一,可选择的编码规则少,导致得到的碱基序列中GC不均一,单碱基重复,从而影响碱基序列合成及测序的问题。
技术解决方案
为解决上述技术问题,本申请实施例采用的技术方案是:
第一方面,本申请提供一种用于DNA数据存储的二进制信息到碱基序列的编解码方法,方法包括:
利用基准碱基组代表基准二进制单元,构建基于基准碱基组-基准二进制单元的映射编码规则库,其中,基准碱基组的碱基数为M,二进制单元的比特数为2M,且M为大于或等于2的整数;
获取待编码的二进制数据,二进制数据包括多个二进制单元;
采用N个不同的映射编码规则对多个二进制单元进行编码,得到与二进制数据对应的碱基序列,其中,N为大于或等于2的整数,碱基序列用于合成存储有与二进制数据对应的数据信息的DNA;N个不同的映射编码规则选自基准碱基组-基准二进制单元的映射编码规则库中的规则。
在一些实施例中,多个二进制单元中相隔N-1个二进制单元的两个二进制单元采用N个不同的映射编码规则中的同一个映射编码规则进行编码。
在一些实施例中,N=2,N个不同的映射编码规则包括第一映射编码规则和第二映射编码规则;采用N个不同的映射编码规则对多个二进制单元进行编码,得到与二进制数据对应的碱基序列,包括:
采用第一映射编码规则对二进制数据中位于奇数排位的二进制单元进行编码,采用第二映射编码规则对二进制数据中位于偶数排位的二进制单元进行编码,得到与二进制数据对应的碱基序列。
在一些实施例中,N个不同的映射编码规则中包含的同一种基准碱基组对应N个不同的基准二进制单元。
在一些实施例中,编码得到的碱基序列中,单碱基重复小于6个, G、C在碱基序列中的数量百分含量为40%-60%。
在一些实施例中,碱基序列包括J个碱基子序列。
在一些实施例中,碱基子序列设置有索引标记,用于标记碱基子序列在碱基序列中的位置。
在一些实施例中,碱基子序列设置有纠错码和/接头序列。
在一些实施例中,处理方法还包括:
通过测序从合成的DNA中获取碱基序列;
根据N个不同的映射编码规则解码碱基序列,得到二进制数据。
在一些实施例中,根据映射编码规则解读碱基序列,得到二进制数据,包括:
碱基序列包括多个碱基组,碱基组与二进制单元相对应;
采用N个不同的映射编码规则对碱基组进行解码,得到二进制数据。
在一些实施例中,当碱基序列包括J个碱基子序列时,通过测序从合成的DNA中获取碱基序列,包括:
通过测序从合成的DNA中获取J个碱基子序列;
将J个碱基子序列拼接为碱基序列。
在一些实施例中,碱基子序列设置有索引标记,将J个碱基子序列拼接为碱基序列,包括:
根据索引标记,确定碱基子序列在碱基序列中的位置;
按照碱基子序列在碱基序列中的位置,将J个碱基子序列拼接为碱基序列。
第二方面,本申请提供一种用于DNA数据存储的二进制数据到碱基序列的编解码装置,编解码装置包括:
构建模块,用于利用基准碱基组代表基准二进制单元,构建基于基准碱基组-基准二进制单元的映射编码规则库,其中,基准碱基组的碱基数为M,所述二进制单元的比特数为2M,且M为大于或等于2的整数;
获取模块,用于获取待编码的二进制数据,二进制数据包括多个二进制单元;
编码模块,用于采用N个不同的映射编码规则对多个二进制单元进行编码,得到与二进制数据对应的碱基序列,其中,N为大于或等于2的整数,碱基序列用于合成存储有与二进制数据对应的数据信息的DNA;N个不同的映射编码规则选自所述基准碱基组-基准二进制单元的映射编码规则库中的规则。
在一些实施例中,编解码装置还包括:
测序模块,用于通过测序从合成的DNA中获取碱基序列;
解码模块,用于根据N个不同的映射编码规则解码碱基序列,得到二进制数据。
第三方面,本申请提供一种终端设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,其特征在于,处理器执行计算机程序时实现如上述第一方面的方法。
第四方面,本申请提供一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,其特征在于,计算机程序被处理器执行时实现如上述第一方面的方法。
本申请提供的用于DNA数据存储的二进制信息到碱基序列的编解码方法,先利用基准碱基组与基准二进制单元之间的对应关系,构建基于基准碱基组-基准二进制单元的映射编码规则库;然后采用来源于基准碱基组-基准二进制单元的映射编码规则库中的多个不同映射编码规则,对二进制数据进行编码。本申请中,基准碱基组的碱基数大于或等于2,由此形成的基准碱基组-基准二进制单元的映射编码规则库中的映射编码规则数量可以非常庞大,为二进制信息到碱基序列的编码提供了一个灵活,可选的编码规则库,极大地丰富了二进制信息到碱基序列的编码方式;同时,该编码规则利用基准碱基组中的M个碱基代表2M个二进制信息,实现了2bits/nt的信息编码效率,达到了理论编码极限;这种M个碱基代表2M个,还可以实现1条到多条二进制信息编入1条碱基序列的可能,进一步扩展了二进制信息到碱基序列的编码方式;基于这些灵活的编码方式,使得编码过程中,不仅可以选择多种编码规则或者编码方式来预防单碱基重复和碱基序列中GC不均一的问题,提高碱基序列可合成性和测序便利性;而且,通过N个不同的映射编码规则或者混合编码方式对二进制数据进行编码,为数据的加密和安全存储提供了更多的可能,提高了数据的安全性能。
可以理解的是,上述第二方面至第四方面的有益效果可以参见上述第一方面中的相关描述,在此不再赘述。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种用于DNA数据存储的二进制信息到碱基序列的编码流程示意图;
图2是本申请实施例提供的一种二进制数据组合编码的组合方式示意图;
图3A是本申请实施例提供的一种碱基子序列的信息组成示意图;
图3B是本申请实施例提供的另一种碱基子序列的信息组成示意图;
图4是本申请实施例提供的一种用于DNA数据存储的二进制信息到碱基序列的解码流程示意图;
图5是本申请实施例提供的用于DNA数据存储的二进制数据到碱基序列的编解码装置的编码单元的结构示意图;
图6是本申请实施例提供的用于DNA数据存储的二进制数据到碱基序列的编解码装置的解码单元的结构示意图;
图7是本申请实施例提供的一种终端设备的结构示意图。
本发明的实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。
还应当理解,在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
目前,基于DNA介质的信息存储技术,通常包括如下四个步骤:
 1)信息编码;即从计算机存储数据的0/1二进制数据到DNA碱基序列的编码过程。
该过程根据二进制代码与碱基A、T、C、G之间的预设对应关系,将二进制数据信息转换为由碱基A、T、C、G编码形成的,存储有数据信息的DNA碱基序列。示例性的,2012年George Church等人在Science杂志上发表的方法,利用2个碱基代表1个比特(bits),其中,0代表A或C,1 代表G或T。通过这种方法得到的碱基序列,单个碱基承载的信息量少,因此编码密度较低,仅为1 bit/nt。
2)DNA合成;根据DNA碱基序列合成DNA的过程。
该步骤,在步骤1)获得的DNA碱基序列的基础上,对DNA碱基序列进行合成,得到DNA。示例性的,采用高通量DNA合成仪,包括结合酶拼接技术,合成存储有数据信息的DNA序列。
3)DNA测序;即合成DNA碱基序列的解读过程。
4)信息解码;即从DNA碱基序列提取存储数据的0/1二进制数据的过程。
该步骤结合编码规则,将DNA碱基序列解码为0/1二进制数据。
上述方法中,步骤1和步骤4的信息编码和映射编码规则是整个存储流程的核心,直接决定了步骤2和步骤3的难易程度及DNA存储技术的效率。但目前的编码方法,编码密度低,且可选择的编码规则少,导致得到的碱基序列中GC不均一,单碱基重复,从而影响碱基序列合成及测序。
为了提高信息编码密度以及碱基序列中GC的均一性,减少单碱基重复,本申请实施例提供了一种本申请提供一种用于DNA数据存储的二进制信息到碱基序列的编解码方法和处理装置,通过利用基准碱基组的M个碱基对应基准二进制单元2M个二进制位的方式,实现了2 bits/nt的编码密度,同时产生了一个灵活,可选地编码规则库,丰富了二进制信息到碱基序列的编码方式;利用基准碱基组-基准二进制单元的映射编码规则库中的N个不同的映射编码规则对二进制数据中的多个数据单元进行编码,通过采用映射编码规则对数据单元混编的方式,有利于得到GC含量合适、单碱基重复减少的碱基序列,并增加了编码过程中的安全性。
下面通过具体实施例,对本申请提供的用于DNA数据存储的二进制信息到碱基序列的编解码方法和编解码装置进行示例性说明。
应当说明的是,用于DNA数据存储的二进制信息到碱基序列的编解码方法,包括用于DNA数据存储的二进制信息到碱基序列的编码方法和用于DNA数据存储的二进制信息到碱基序列的解码方法。
请参阅图1,图1是本申请实施例提供的一种用于DNA数据存储的二进制信息到碱基序列的编码方法的流程示意图,该方法至少包括步骤S101和步骤S103,具体如下:
S101. 利用基准碱基组代表基准二进制单元,构建基于基准碱基组-基准二进制单元的映射编码规则库,其中,所述基准碱基组的碱基数为M,所述二进制单元的比特数为2M,且M为大于或等于2的整数。
本申请实施例中涉及的基准碱基组-基准二进制单元的映射编码规则库,是利用基准碱基组代表基准二进制单元,根据基准碱基组与基准二进制单元之间的对应关系,构建获得形成的所有映射编码规则的集合。该基准碱基组-基准二进制单元的映射编码规则库中的每个映射编码规则为基准碱基组与基准二进制单元按照预设的一一对应关系形成的映射编码规则。
本申请实施例中,基准碱基组的碱基数为M,且M为大于或等于2的整数,即基准碱基组合为两个或两个以上单碱形成的碱基组合。在一些实施例中,基准碱基组的碱基数为两个、三个、四个甚至更多个,对应的,基准碱基组可以采用双碱基、三碱基、四碱基甚至更多的碱基作为基准碱基组。示例性的,基准碱基组合中含有两个碱基如AT、三个碱基如ACT、四个碱基如AGCT,甚至更多碱基。应当理解的是,基准碱基组中的碱基数量越多时,碱基排列组合形成的基准碱基组的数量就越多,对应形成的基准碱基组与基准二进制单元之间的映射方式就越多。
本申请实施例中,二进制单元的比特数大于基准碱基组的碱基数,且为基准碱基组的碱基数的2倍,以使得编码密度等于2 bit/nt,即每个碱基对应两个比特位数,从而提高了信息编码密度。示例性的,基准碱基组合中含有两个碱基,二进制单元的比特位数为4个;基准碱基组为三个碱基组成的碱基组,二进制单元中的比特数为6个,此时,编码密度为2 bit/nt。
下面以双碱基表示一个4 bits的二进制单元,对本申请实提供的映射编码规则进行相关说明。如表1所示,A、T、C、G四种碱基存在16种双碱基组合方式,即16种基准碱基组,具体为AA、AT、AC、AG、TA、TT、TC、TG、CA、CT、CC、CG、GA、GT、GC、GG。16种基准碱基组对应16个基准二进制单元。16种基准碱基组到16个基准二进制单元之间的任意映射方式形成的集合,构成本申请实施例的基准碱基组-基准二进制单元的映射编码规则库。示例性,第一映射编码规则可以为:AA=0000,AT=0001,AC=0010,AG=0011,TA=0100,TT=0101,TC=0110,TG=0111,CA=1000,CT=1001,CC=1010,CG=1011,GA=1100,GT=1101,GC=1110,GG=1111;又比如,第二映射编码规则:AT=0000,AC=0001,AG=0010,AA=0011,TA=0100,TT=0101,TC=0110,TG=0111,CA=1000,CT=1001,CC=1010,CG=1011,GA=1100,GT=1101,GC=1110,GG=1111等等。应当理解,每个映射编码规则中16种基准碱基组与16个基准二进制单元之间存在一一对应关系,但16种基准碱基组对应16个二进制单元的映射方式为任意的组合方式,并不限于上述列举的映射方式。于是,16种基准碱基组可以产生A(16,16)种排列组合对应的映射编码规则,共有20922789888000种。
表1
Figure dest_path_image001
需要说明的是,基准碱基组中的碱基数量大于2个(如3个、4个甚至更多)时,对应产生的基准碱基组的数量更多,由此产生的基准碱基组与基准二进制单元之间的映射编码规则更多。示例性的,当基准碱基组为3个碱基组合形成的碱基组时,基准二进制单元的比特位数可以为6 bits,但不限于此,A、T、C、G四种碱基形成的基准碱基组合共有64种,64种基准碱基组对应64个基准二进制单元,共产生A(64,64)种排列组合对应的映射编码规则,将远远超过2万亿个。
本申请实施例通过多个可选的映射编码规则对二进制数据进行编码,极大地丰富了二进制数据到碱基序列的编码规则,充分发挥了碱基序列在对应二进制数据上的自由度,提升了二进制数据到碱基序列编码的灵活度,为DNA数据存储安全加密提供了一套丰富可选的映射编码规则。
步骤S102,获取待编码的二进制数据,二进制数据多个二进制单元。
本申请实施例中,在将各类数据信息比如文本信息、图像信息、语音信息、音视频信息等数据信息进行DNA存储时,需要先将要存储的数据信息转换为对应的二进制数据,比如,文本信息“春,已不再是想象之外的那只蝴蝶。”对应的二进制数据为“11100110 10011000 10100101 11101111 10111100 10001100 11100101 10110111 10110010 11100100 10111000 10001101 11100101 10000110 10001101 11100110 10011000 10101111 11100110 10000011 10110011 11101000 10110001 10100001 11100100 10111001 10001011 11100101 10100100 10010110 11100111 10011010 10000100 11101001 10000010 10100011 11100101 10001111 10101010 11101000 10011101 10110100 11101000 10011101 10110110 11100011 10000000 10000010”。
本申请实施例中,根据选择的编码二进制数据的映射编码规则中基准碱基组对应的基准二进制单元的比特数,将二进制数据划分为多个二进制单元。应当理解的是,二进制数据中二进制单元的划分,与对应的用于编码该二进制单元的映射编码规则中基准二进制单元的比特位数有关。确切来讲,二进制单元的比特位数,与对应的用于编码该二进制单元的映射编码规则中基准二进制单元的比特位数相同。在一种可能的实施方式中,当用于编码二进制数据的所有映射编码规则中的基准二进制单元的比特位数相同时,二进制数据中的所有二进制单元的比特位数相同,且与基准二进制单元的比特位数一致;在另一种可能的实施方式中,当用于编码二进制数据的N个不同映射编码规则中的基准二进制单元的比特位数不完全相同时,二进制数据中,适用比特位数相同的基准二进制单元的二进制单元的比特位数相同,适用比特位数不同的基准二进制单元的二进制单元的比特位数不同;在另一种可能的实施方式中,当用于编码二进制数据的N个不同映射编码规则中的基准二进制单元的比特位数完全不同时,适用比特位数不同的基准二进制单元的二进制单元的比特位数不同。
在一些实施例中,二进制数据中的所有二进制单元具有相同的比特位数,可以在赋予编码良好安全性的前提下,降低得到的碱基序列的解码程序的复杂度。
在一些实施例中,二进制数据中的二进制单元具有相同比特位数。示例性的,二进制单元的比特位数为4位,上述二进制数据“11100110 10011000 10100101...",可以依次划分为二进制单元“1110”、“0110”、“1001”、1000”、“1010”、“0101”……。
在一种可能的实施方式中,将二进制数据划为二进制单元时,当待编码的二进制数据末尾不足一个二进制单元的比特位数时,为了使得该二进制单元能够编码成对应的碱基,在二进制数据末尾用0或1补齐,直至最后一个二进制单元的比特位数与其对应的映射编码规则中的基准二进制单元的比特位数一致。同时,为了提高解码的准确率,在编码时,需要记录该最后一个数据单元补零的数量。在一些实施例中,末端补0或1的过程可以被额外记录在电脑程序或者其它文本上,作为解码步骤的密钥内容。
在一种可能的实施方式中,当待编码的二进制数据较大,可以将二进制数据拆分成多个二进制信息文件,并对每一个二进制信息文件进行标记,以表示二进制信息文件的顺序,并在下述步骤的编码过程中,分别对每一个二进制信息文件进行编码处理。
步骤S103,采用N个不同的映射编码规则对所述多个二进制单元进行编码,得到与所述二进制数据对应的碱基序列,其中,N为大于或等于2的整数,所述碱基序列用于合成存储有与所述二进制数据对应的数据信息的DNA;所述N个不同的映射编码规则选自所述基准碱基组-基准二进制单元的映射编码规则库中的规则。
该步骤中,为了提高编码后得到的碱基序列中的GC均一度,降低单碱基的重复数量和重复概率,本申请实施例采用N个不同的映射编码规则对多个二进制单元进行编码。通过采用N个不同的映射编码规则,可以灵活选择有利于提高GC均一度、减少单碱基重复的编码规则,从而达到提高碱基序列中的GC均一度,降低单碱基的重复数量和重复概率的目的。此外,采用多个映射编码规则混用对二进制数据进行编码时,可以提高编码信息的安全度。
本申请实施例中,按照预设的映射编码规则的数量,即N的取值,二进制数据可以分成若干个二进制单元集合,其中,采用同一个映射编码规则进行编码的所有二进制单元,称为一个二进制单元集合。二进制数据中,二进制单元集合的形成规则没有严格限定,因此,二进制单元集合中二进制单元的数量没有严格限制,可以为一个、两个、二进制单元的总数/N,或更多,甚至可以每个二进制单元单独构成一个二进制单元集合(此时,每一个二进制单元分别采用不同的映射编码规则进行编码)。应当理解的是,由于同一个编码规则中的基准二进制单元的比特位数相同,因此,同一个二进制单元集合中的二进制单元的比特位数相同。
在一种可能的实现方式中,对二进制数据中的每一个二进制单元,采用不同的映射编码规则进行编码。比如,按照各个二进制单元的顺序,采用对应编号的映射编码规则进行编码,例如,第一二进制单元对应第一映射编码规则,第二二进制单元对应第二映射编码规则,第三二进制单元对应第三映射编码规则,以此类推。这在映射编码规则足够多的情况下,可以有效地提高碱基序列GC均一度,降低单碱基的重复数量和重复概率的目的,从而提高DNA的可合成性和测序可读性。此外,由于对每一个数据单元采用不同的映射编码规则进行编码,相应也提高编码的安全性。
在一种可能的实施方式中,对二进制数据中的中相邻的两个二进制单元,采用不同的映射编码规则进行编码。比如,第一二进制单元采用第一映射编码规则,第二二进制单元采用除第一映射编码规则之外的任意一个映射编码规则,第三二进制单元采用除第二二进制单元采用的映射编码规则之外的任意一个映射编码规则,以此类推。
在一种可能的实施方式中,多个二进制单元中相隔N-1个二进制单元的两个二进制单元采用N个不同的映射编码规则中的同一个映射编码规则进行编码,即二进制数据中,每相隔N-1个二进制单元的二进制单元形成一个二进制单元集合,采用N个不同的映射编码规则中的同一个映射编码规则进行编码。
在一些实施例中,可以按照二进制数据中二进制单元的奇数和偶数位置排列,形成两个二进制单元集合,采用两个映射编码规则分别对奇数排位的二进制单元和偶数排位的二进制单元进行编码。在一些实施例中,可以按照二进制数据中二进制单元的排列位数为S-2、S-1、S(S为3的整数倍数)的位置排列,形成三个二进制单元集合,采用三个映射编码规则分别对三个二进制单元集合中的二进制单元进行编码。在一些实施例中,可以按照二进制数据中二进制单元的排列位数为L-3、L-2、L-1、L(L为4的整数倍数)的位置排列,形成四个二进制单元集合,采用四个映射编码规则分别对四个二进制单元集合中的二进制单元进行编码,等等,不限于上述实施例。
示例性的,N=2,N个不同的映射编码规则包括第一映射编码规则和第二映射编码规则;采用N个不同的映射编码规则对多个二进制单元进行编码,得到与二进制数据对应的碱基序列,包括:采用第一映射编码规则对二进制数据中位于奇数排位的二进制单元进行编码,采用第二映射编码规则对二进制数据中位于偶数排位的二进制单元进行编码,得到与二进制数据对应的碱基序列。
采用N个不同的映射编码规则对多个二进制单元进行编码时,可以对用于编码二进制单元的N个不同的映射编码规则进行筛选,以使得得到的碱基序列具有期望的序列特征,比如GC均一,没有超过3个碱基的单碱基重复,局部二级结构最少,无大片段碱基重复等。本申请实施例筛选用于编码的映射编码规则(简称编码规则)的标准,优选为:碱基重复最少,GC均一。在一些实施例中,用于编码二进制数据的N个不同的映射编码规则,满足:编码得到的碱基序列中,单碱基重复小于6个, G、C在碱基序列中的数量百分含量为40%-60%,由此,可以提高得到的碱基序列的可合成性及测序可行性。示例性的,当二进制数据以4 bits作为一个二进制单元,且按照二进制单元的奇数、偶数排列顺序形成两个二进制单元集合时,可以反复挑选不同的奇数位和偶数位的编码规则组合对信息进行编码测试,直到获得其它期望的序列特征,比如单碱基重复小于6个,小于5个,小于4个等;G、C在碱基序列中的数量百分含量为40%-60%,45%-55%,48%-52%等。
在一些实施例中,采用N个不同的映射编码规则中包含的同一种基准碱基组对应N个不同的基准二进制单元。即不同的映射编码规则中,相同的基准碱基组对应的基准二进制单元各不相同。在这种情况下,可以在后续对碱基序列进行解码时,更加明确对应的碱基所对应的映射编码规则,从而进一步提高解码准确性。
在一种可能的实施方式中,当二进制数据包括多个二进制子数据时,可以按照上述方法,分别对每一个二进制子数据进行编码处理,得到多个碱基子序列。即:至少采用两个不同的映射编码规则对每一个二进制子数据进行编码,以平衡得到的每一条碱基序列中的GC含量,并减少单碱基重复。
在一种可能的实施方式中,可以把2条、3条、4条甚至更多条二进制子数据,按照预设的规则混编成一条二进制数据后,再对混编后的二进制数据采用上述的编码规则进行编码。示例性的,如图2所示,从左往右,三列分别示出了将2条、3条、4条二进制子数据,以奇数位和偶数位的形式混编成一条二进制数据后,编码到一条碱基序列中。以最左列的2条二进制子数据为例,将2条二进制子数据,以奇数位和偶数位的形式混编成一条二进制数据后,编码到一条碱基序列中,包括:2条二进制子数据分别标记为第一二进制子数据和第二二进制子数据,按2 bits作为一个单元将将第一二进制子数据和第二二进制子数据划分成多个子单元,并对子单位进行排序;将第一二进制子数据和第二二进制子数据中相同排序位的子单元进行合并,将两条二进制子数据整合第三二进制数据;按4 bits作为一个单元将将第三二进制数据划分成多个二进制单元,对二进制单元进行奇数和偶数排位后,采用第一映射编码规则对位于奇数排位的二级制单元进行编码,采用第二映射编码规则对位于偶数排位的二级制单元进行编码,得到碱基序列。应当理解的是,对于不同条二进制子数据的编码,可以针对一部分二进制子数据采用一种奇数位和偶数位的编码规则组合,针对另外一部分二进制子数据采用另外一种奇数位和偶数位的编码规则组合。当然,也可以采用多种编码规则混用的方式。对于待转化的二进制数据末尾不足一个二进制单元时,用0或1补齐后再进行编码。类似的,将3条、4条甚至更多条二进制子数据编码到一条碱基序列中的方式,可以采用类似方式实现。应当理解的是,通过将多条二进制数据编码进入同一条碱基序列的方式,有利于进一步提升二进制信息到碱基序列的编码灵活性,提升在DNA数据存储实际应用中,按照预期,比如对合成和测序有利的方式进行任意序列转换的可能。同时,通过多条二进制数据编码进入同一条碱基序列,以及多种编码规则混用的方式,能够进一步提升编码信息的安全度。另外,多条二进制数据编码进入同一条碱基序列的方式,还能够实现多个数据文件合并编入一条碱基序列的可能,扩展了DNA数据存储应用方式。
在一些实施例中,由于二进制数据的长度太长,编码得到的碱基序列对应也较长,有可能会超出合成DNA时选定的合成技术所能合成的最大长度。在这种情况下,可以根据合成技术能够合成的长度,将编码得到的碱基序列拆分成J个序列小片段,即:碱基序列包括J个碱基子序列。在一些实施例中,J为大于0且小于200 nt的正整数。
在一些实施例中,碱基子序列设置有索引标记,用于标记碱基子序列在碱基序列中的位置。示例性的,AAAA=1, AAAT=2...AGCT=101等。索引标记可以添加在拆分的小片段左边或者右边。在一些实施例中,为了合成DNA的测序以及随机读取方便,在拆分的J个序列小片段两端还可以加上接头序列,优选为16-25 个碱基的接头序列,其中,代表不同序列小片段的接头序列可以不同。如图3A所示,为一种碱基子序列的信息组成示意图。其中,在该条碱基子序列的左侧添加有左接头序列、第二索引信息,在该条碱基子序列的右侧添加有右接头序列。在一些实施例中,为了降低编码过程中出现的错误,提高解码的准确率,在一些实施例中,还可以在拆分的序列小片段中加入纠错码,示例性的,如Reed-Solomon、汉明码等。如图3B所示,为一种碱基子序列的信息组成示意图。其中,在该条碱基子序列的左侧添加有左接头序列、第二索引信息,在该条碱基子序列的右侧添加有纠错码和右接头序列。
本申请实施例得到的碱基序列用于合成存储有与二进制数据对应的数据信息的DNA。本申请实施例提供的碱基序列,均可以通过化学DNA合成法或酶DNA合成法进行合成。
应当理解的是,本申请实施例合成得到的DNA的形式没有限定,可以是引物、基因或者质粒等任意形式。合成DNA的存储介质也没有严格限定,示例性的,合成DNA的存储介质可以是离心管、玻璃,也可以通过转化的方式插入到植物/微生物/动物细胞中进行稳定存放。
在一些实施例中,合成DNA以体外合成DNA存放时,可以选择干粉形式,存在-20℃/-80℃的冰箱中;在一些实施例中,合成DNA以细胞进行存放,可以根据植物/微生物/动物细胞具体的保存要求进行存放。
请参阅图4,在一些实施例中,提供了用于DNA数据存储的二进制信息到碱基序列的解码方法,还包括步骤S201和步骤S202,具体如下:
步骤S201,通过测序从合成的DNA中获取碱基序列。
本申请实施例中,在获取待解码的碱基序列之前,需要对所合成的DNA进行测序,以获得该DNA中的碱基序列,所使用的测序方法包括但不限于现有的一代测序,二代测序和三代测序的任意方法。在对DNA测序后,将获得该DNA的碱基序列。
在一些实施例中,可以使用引物对合成的DNA进行扩增,以丰富DNA的含量,从而保证充足的测序量。
当碱基序列包括J个碱基子序列时,通过测序从合成的所述DNA中获取所述碱基序列,包括:
通过测序从合成的DNA中获取J个碱基子序列;
将J个碱基子序列拼接为所述碱基序列。
其中,碱基子序列设置有索引标记,将J个碱基子序列拼接为所述碱基序列,包括:按照J个碱基子序列中的索引标记,确定J个碱基子序列在碱基序列中的位置;按照J个碱基子序列在碱基序列中的位置,将J个碱基子序列拼接为碱基序列。
在一些实施例中,通过测序从合成的DNA中获取碱基序列,包括:对J个碱基子序列分别进行PCR扩增,分别读取扩增片段;根据索引标记对扩增片段序列进行排序,拼接成完整的碱基序列。在这种情况下,通过PCR扩增提高各碱基子序列的浓度,提高各碱基子序列的测序识别效率。
步骤S202,根据N个不同的映射编码规则解码碱基序列,得到二进制数据。
该步骤中,对碱基序列的解码过程为对该碱基序列对应的二进制数据进行编码的反向过程。因此,当碱基序列为使用N个映射编码规则对二进制数据进行编码得到时,也就需要对应使用N个映射编码规则对该碱基序列进行解码。
在一些实施例中,根据映射编码规则解读碱基序列,得到二进制数据,包括:
碱基序列包括多个碱基组,碱基组与二进制单元相对应;
采用N个不同的映射编码规则对碱基组进行解码,得到二进制数据。
即:当编码过程中,采用N种不同的映射编码规则对二进制数据根据预设排位规则形成的N组子数据分别进行编码时,解码过程中,根据映射编码规则中的基准碱基组的碱基数量,确定组成碱基序列中的碱基组,并基于与二进制单元相同的排位规则,采用编码对应二进制单元的映射编码规则,将碱基序列解码为二进制数据。
在一种可能的实施方式中,当对二进制数据中的每一个二进制单元,采用不同的映射编码规则进行编码时,解码过程中,根据采用的映射编码规则以及排序,确定组成碱基序列的多个碱基组,采用将二进制单元变为为该碱基组的映射编码规则对对应的二进制单元进行解码。
在一种可能的实施方式中,对二进制数据中的中相邻的两个二进制单元,采用不同的映射编码规则进行编码,解码过程中,先根据采用的映射编码规则以及排序,确定组成碱基序列的多个碱基组,然后对多个碱基组合中相邻的两个碱基组合,采用不同的映射编码规则进行解码,也即当碱基序列对应的二进制数据中的多个数据单元中相邻的两个数据单元,是采用不同的映射编码规则进行编码时,相应的,相邻的两个碱基组合也相应采用不同的映射编码规则进行解码。
在一种可能的实施方式中,多个二进制单元中相隔N-1个二进制单元的两个二进制单元采用N个不同的映射编码规则中的同一个映射编码规则进行编码时,解码过程中,先根据采用的映射编码规则以及排序,确定组成碱基序列的多个碱基组,然后对多个碱基组中相隔N-1个碱基组的碱基组采用同一个映射编码规则进行解码。
在一些实施例中,当编码过程中,采用第一映射编码规则对二进制数据中奇数排位的二进制单元进行编码,采用第二映射编码规则对二进制数据中偶数排位的二进制单元进行编码时,解码过程中,先根据采用的映射编码规则中基准碱基组中的碱基数,确定组成碱基序列的多个碱基组,然后基于碱基组在碱基序列中的奇数排位和偶数排位,对奇数排位的碱基组采用第一映射编码规则进行解码,对偶数排位的碱基组采用第二映射编码规则进行解码,将碱基序列解码为二进制数据。
在一种可能的实施方式中,当碱基序列为由多条二进制子数据进行编码得到碱基序列时,解码过程中,先根据采用的映射编码规则中基准碱基组中的碱基数,确定组成碱基序列的多个碱基组;然后采用编码二进制数据对应的映射编码规则将个碱基组解码,得到二进制数据;最后,根据多条二进制子数据整合成一条二进制数据的规则,将二进制数据解码为多条二进制子数据。
在一些实施例中,当多个碱基组合中的最后一个碱基组合对应的数据单元有补零记录时,需要删除碱基组合对应的数据单元中补零的数量。
在一些实施例中,提供一种二进制数据的编码方法,将一句话“春,已不再是想象之外的那只蝴蝶。”存储到DNA,以及用对应的解码方法从DNA序列中解读的过程,包括如下步骤:
步骤一、数据编码
1.利用计算机程序提取“春,已不再是想象之外的那只蝴蝶。”的二进制数据“11100110 10011000 10100101 11101111 10111100 10001100 11100101 10110111 10110010 11100100 10111000 10001101 11100101 10000110 10001101 11100110 10011000 10101111 11100110 10000011 10110011 11101000 10110001 10100001 11100100 10111001 10001011 11100101 10100100 10010110 11100111 10011010 10000100 11101001 10000010 10100011 11100101 10001111 10101010 11101000 10011101 10110100 11101000 10011101 10110110 11100011 10000000 10000010”。
2.以四个bits为一个单位,将二进制数据划分成若干个二进制单元:“1110 0110 1001 1000 1010 0101 1110 1111 1011 1100 1000 1100 1110 0101 1011 0111 1011 0010 1110 0100 1011 1000 1000 1101 1110 0101 1000 0110 1000 1101 1110 0110 1001 1000 1010 1111 1110 0110 1000 0011 1011 0011 1110 1000 1011 0001 1010 0001 1110 0100 1011 1001 1000 1011 1110 0101 1010 0100 1001 0110 1110 0111 1001 1010 1000 0100 1110 1001 1000 0010 1010 0011 1110 0101 1000 1111 1010 1010 1110 1000 1001 1101 1011 0100 1110 1000 1001 1101 1011 0110 1110 0011 1000 0000 1000 0010”;然后将二进制单元按照奇数位、偶数位依次排列。
3.以“单碱基重复<6,GC含量40%-60%”为标准,筛选如下表2所示的两种映射编码规则,将上述二进制信息编码成为一条完整的碱基序列:
“TA CT CT TG GG CC TA AA CG AG CA AG TA CC CG CA CG GT TA CG CG TG CA AT TA CC CA CT CA AT TA CT CT TG GG AA TA CT CA GA CG GA TA TG CG GC GG GC TA CG CG TC CA TT TA CC GG CG CT CT TA CA CT TA CA CG TA TC CA GT GG GA TA CC CA AA GG TA TA TG CT AT CG CG TA TG CT AT CG CT TA GA CA GG CA GT”
Figure dest_path_image002
4.将上述碱基序列送基因合成公司合成为双链DNA,并克隆到质粒载体中,以冻干质粒的形式保存在离心管中,放置于-20℃保存。
步骤二、数据解码
1. 将冻干质粒溶解,根据测序公司的送样量要求,取合适的量,利用一代测序进行测定冻干质粒的碱基序列如下:“ TA CT CT TG GG CC TA AA CG AG CA AG TA CC CG CA CG GT TA CG CG TG CA AT TA CC CA CT CA AT TA CT CT TG GG AA TA CT CA GA CG GA TA TG CG GC GG GC TA CG CG TC CA TT TA CC GG CG CT CT TA CA CT TA CA CG TA TC CA GT GG GA TA CC CA AA GG TA TA TG CT AT CG CG TA TG CT AT CG CT TA GA CA GG CA GT”
2. 利用上述编码规则(第一映射编码规则编码奇数位二进制单元,第二映射编码规则编码偶数位二进制单元,且第一映射编码规则和第二映射编码规则的基准碱基组的碱基数量均为两个),将测序获得的序列解码为二进制数据“1110 0110 1001 1000 1010 0101 1110 1111 1011 1100 1000 1100 1110 0101 1011 0111 1011 0010 1110 0100 1011 1000 1000 1101 1110 0101 1000 0110 1000 1101 1110 0110 1001 1000 1010 1111 1110 0110 1000 0011 1011 0011 1110 1000 1011 0001 1010 0001 1110 0100 1011 1001 1000 1011 1110 0101 1010 0100 1001 0110 1110 0111 1001 1010 1000 0100 1110 1001 1000 0010 1010 0011 1110 0101 1000 1111 1010 1010 1110 1000 1001 1101 1011 0100 1110 1000 1001 1101 1011 0110 1110 0011 1000 0000 1000 0010”。
3. 从上述二进制数据中提取数信息“春,已不再是想象之外的那只蝴蝶。”。
本实施例提供的方法,成功将一句话存入到DNA序列中,验证了本发明实施例所述方法具有可行性。同时,本实施例中,DNA序列的信息编码密度为2 bits/nt,达到了DNA数据存储的理论极限;编码获得的DNA序列中GC含量51%,单碱基重复最多3个,有效的实现了“单碱基重复<6, GC含量40%-60%”的目的,便于后续的合成和测序。相比较传统的方法,本发明实施例提供的上述方法,不仅实现了信息存储密度的理论极限,避免了合成和测序的问题,同时在实际操作和应用中,更简单易行。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
基于上述实施例所提供的基于DNA碱基序列的编码方法、解码方法及设备,本申请实施例进一步给出实现上述方法实施例的装置实施例。
请参见图5,图5是本申请实施例提供的用于DNA数据存储的二进制数据到碱基序列的编解码的示意图。包括的各模块用于执行图1对应的实施例中的各步骤。具体请参阅图1对应的实施例中的相关描述。为了便于说明,仅示出了与本实施例相关的部分。参见图5,用于DNA数据存储的二进制数据到碱基序列的编解码5包括编码单元,编码单元包括:
构建模块51,用于利用基准碱基组代表基准二进制单元,构建基于基准碱基组-基准二进制单元的映射编码规则库,其中,所述基准碱基组的碱基数为M,所述二进制单元的比特数为2M,且M为大于或等于2的整数;
获取模块52,用于获取待编码的二进制数据,二进制数据包括多个二进制单元;
编码模块53,用于采用N个不同的映射编码规则对多个二进制单元进行编码,得到与二进制数据对应的碱基序列,其中,N为大于或等于2的整数,碱基序列用于合成存储有与二进制数据对应的数据信息的DNA;N个不同的映射编码规则选自所述基准碱基组-基准二进制单元的映射编码规则库中的规则。
在一种可能的实现方式中,编码模块用于:对二进制数据中的每一个二进制单元,采用不同的映射编码规则进行编码。
在一种可能的实施方式中,编码模块用于:对二进制数据中的中相邻的两个二进制单元,采用不同的映射编码规则进行编码。
在一种可能的实施方式中,编码模块用于:对多个二进制单元中相隔N-1个二进制单元的两个二进制单元采用N个不同的映射编码规则中的同一个映射编码规则进行编码。
在一种可能的实施方式中,编码模块用于:采用第一映射编码规则对二进制数据中位于奇数排位的二进制单元进行编码,采用第二映射编码规则对二进制数据中位于偶数排位的二进制单元进行编码,得到与二进制数据对应的碱基序列。
在一种可能的实施方式中,编码模块,具体用于:二进制数据包括多个二进制子数据时,分别对每一个二进制子数据进行编码处理,得到多个碱基子序列。
在一种可能的实施方式中,编码模块,具体还用于:
将碱基序列划分为J个碱基子序列。
在一种可能的实施方式中,编码模块,具体还用于:
将碱基序列划分为J个碱基子序列,并为每一碱基子序列添加索引标记,用于标记碱基子序列在碱基序列中的位置。
在另一种可能的实施方式中,编码模块,具体还用于:
在J个碱基子序列的两端添加接头序列,接头序列用于拼接。
在一些实施例中,请参见图6,图6是本申请实施例提供的用于DNA数据存储的二进制数据到碱基序列的编解码装置的示意图,包括的各模块用于执行图4对应的实施例中的各步骤。具体请参阅图4对应的实施例中的相关描述。如图6,编解码装置还包括解码单元,解码单元拨款:
测序模块61,用于通过测序从合成的DNA中获取碱基序列;
解码模块62,用于根据N个不同的映射编码规则解码碱基序列,得到二进制数据。
在一些可能的实现方式中,解码模块,具体用于:
对多个碱基组中的每一个碱基组,采用不同的映射编码规则进行解码。
在一些可能的实现方式中,解码模块,具体用于:
对多个碱基组中相邻的两个碱基组,采用不同的映射编码规则进行解码。
在一些可能的实现方式中,解码模块,具体用于:
采用第一映射编码规则对多个碱基组中位于奇数排位的碱基组进行解码,采用第二映射编码规则对多个碱基组中位于偶数排位的碱基组进行解码。
在一些实施例中,碱基序列包括N条,N条中相邻两条所采用的第一映射编码规则不同,和/或,N条中相邻两条所采用的第二映射编码规则不同,N为大于2的整数。
图7是本申请实施例提供的终端设备的示意图。如图7所示,该实施例的终端设备7包括:处理器70、存储器71以及存储在存储器71中并可在处理器70上运行的计算机程序72,例如编码程序。处理器70执行计算机程序72时实现上述各个基于DNA碱基序列的编码方法实施例中的步骤,例如图1所示的步骤101-102。或者,处理器70执行计算机程序72时实现上述各装置实施例中各模块/单元的功能,例如图5所示模块51-52的功能,图6所示模块51-54的功能。
示例性的,计算机程序72可以被分割成一个或多个模块/单元,一个或者多个模块/单元被存储在存储器71中,并由处理器70执行,以完成本申请。一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述计算机程序72在基于DNA碱基序列的编码7中的执行过程。例如,计算机程序72可以被分割成数据单元获取单元51、编码模块52,各模块具体功能请参阅图1对应的实施例中地相关描述,此处不赘述。
终端设备可包括,但不仅限于,处理器70、存储器71。本领域技术人员可以理解,图7仅仅是终端设备7的示例,并不构成对终端设备7的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如终端设备还可以包括输入输出设备、网络接入设备、总线等。
所称处理器70可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器 (Digital Signal Processor,DSP)、专用集成电路 (Application Specific Integrated Circuit,ASIC)、现成可编程门阵列 (Field-Programmable Gate Array,FPGA) 或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
存储器71可以是终端设备7的内部存储单元,例如终端设备7的硬盘或内存。存储器71也可以是终端设备7的外部存储设备,例如终端设备7上配备的插接式硬盘,智能存储卡(Smart Media Card, SMC),安全数字(Secure Digital, SD)卡,闪存卡(Flash Card)等。进一步地,存储器71还可以既包括终端设备7的内部存储单元也包括外部存储设备。存储器71用于存储计算机程序以及终端设备所需的其他程序和数据。存储器71还可以用于暂时地存储已经输出或者将要输出的数据。
本申请实施例还提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序被处理器执行时可实现上述终端设备对用于DNA数据存储的二进制信息到碱基序列的编解码方法。
本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行时可实现上述终端设备基于DNA碱基序列的编码方法,当计算机程序产品在解码设备上运行时,使得解码设备执行时可实现上述对用于DNA数据存储的二进制信息到碱基序列的编解码方法。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (16)

  1. 一种用于DNA数据存储的二进制信息到碱基序列的编解码方法,其特征在于,所述方法包括:
      利用基准碱基组代表基准二进制单元,构建基于基准碱基组-基准二进制单元的映射编码规则库,其中,所述基准碱基组的碱基数为M,所述二进制单元的比特数为2M,且M为大于或等于2的整数;
    获取待编码的二进制数据,所述二进制数据包括多个二进制单元;
    采用N个不同的映射编码规则对所述多个二进制单元进行编码,得到与所述二进制数据对应的碱基序列,其中,N为大于或等于2的整数,所述碱基序列用于合成存储有与所述二进制数据对应的数据信息的DNA;所述N个不同的映射编码规则选自所述基准碱基组-基准二进制单元的映射编码规则库中的规则。
  2. 如权利要求1所述的用于DNA数据存储的二进制信息到碱基序列的编解码方法,其特征在于,所述多个二进制单元中相隔N-1个二进制单元的两个二进制单元采用所述N个不同的映射编码规则中的同一个映射编码规则进行编码。
  3. 如权利要求2所述的用于DNA数据存储的二进制信息到碱基序列的编解码方法,其特征在于,N=2,所述N个不同的映射编码规则包括第一映射编码规则和第二映射编码规则;采用N个不同的映射编码规则对所述多个二进制单元进行编码,得到与所述二进制数据对应的碱基序列,包括:
    采用所述第一映射编码规则对所述二进制数据中位于奇数排位的二进制单元进行编码,采用所述第二映射编码规则对所述二进制数据中位于偶数排位的二进制单元进行编码,得到与所述二进制数据对应的碱基序列。
  4. 如权利要求1所述的用于DNA数据存储的二进制信息到碱基序列的编解码方法,其特征在于,所述N个不同的映射编码规则中包含的同一种基准碱基组对应N个不同的基准二进制单元。
  5. 如权利要求1所述的用于DNA数据存储的二进制信息到碱基序列的编解码方法,其特征在于,编码得到的所述碱基序列中,单碱基重复小于6个,G、C在所述碱基序列中的数量百分含量为40%-60%。
  6. 如权利要求1所述的用于DNA数据存储的二进制信息到碱基序列的编解码方法,其特征在于,所述碱基序列包括J个碱基子序列。
  7. 如权利要求6所述的用于DNA数据存储的二进制信息到碱基序列的编解码方法,其特征在于,所述碱基子序列设置有索引标记,用于标记所述碱基子序列在所述碱基序列中的位置。
  8. 如权利要求7所述的用于DNA数据存储的二进制信息到碱基序列的编解码方法,其特征在于,所述碱基子序列设置有纠错码和/接头序列。
  9. 如权利要求1至8任一项所述的用于DNA数据存储的二进制信息到碱基序列的编解码方法,其特征在于,所述处理方法还包括:
    通过测序从合成的所述DNA中获取所述碱基序列;
    根据所述N个不同的映射编码规则解码所述碱基序列,得到所述二进制数据。
  10. 如权利要求8所述的用于DNA数据存储的二进制信息到碱基序列的编解码方法,其特征在于,所述根据所述映射编码规则解读所述碱基序列,得到所述二进制数据,包括:
    所述碱基序列包括多个碱基组,所述碱基组与所述二进制单元相对应;
    采用所述N个不同的映射编码规则对所述碱基组进行解码,得到与所述碱基序列对应的所述二进制数据。
  11. 如权利要求9所述的用于DNA数据存储的二进制信息到碱基序列的编解码方法,其特征在于,当所述碱基序列包括J个碱基子序列时,所述通过测序从合成的所述DNA中获取所述碱基序列,包括:
    通过测序从合成的所述DNA中获取J个碱基子序列;
    将所述J个碱基子序列拼接为所述碱基序列。
  12. 如权利要求9所述的用于DNA数据存储的二进制信息到碱基序列的编解码方法,其特征在于,所述碱基子序列设置有索引标记,所述将所述J个碱基子序列拼接为所述碱基序列,包括:
    根据所述索引标记,确定所述碱基子序列在所述碱基序列中的位置;
    按照所述碱基子序列在所述碱基序列中的位置,将所述J个碱基子序列拼接为所述碱基序列。
  13. 一种用于DNA数据存储的二进制数据到碱基序列的编解码装置,其特征在于,所述编解码装置包括:
    构建模块,用于利用基准碱基组代表基准二进制单元,构建基于基准碱基组-基准二进制单元的映射编码规则库,其中,所述基准碱基组的碱基数为M,所述二进制单元的比特数为2M,且M为大于或等于2的整数;
    获取模块,用于获取待编码的二进制数据,所述二进制数据包括多个二进制单元;
    编码模块,用于采用N个不同的映射编码规则对所述多个二进制单元进行编码,得到与所述二进制数据对应的碱基序列,其中,N为大于或等于2的整数,所述碱基序列用于合成存储有与所述二进制数据对应的数据信息的DNA;所述N个不同的映射编码规则选自所述基准碱基组-基准二进制单元的映射编码规则库中的规则。
  14. 根据权利要求13所述的处理装置,其特征在于,所述处理装置还包括:
    测序模块,用于通过测序从合成的所述DNA中获取所述碱基序列;
    解码模块,用于根据所述N个不同的映射编码规则解码所述碱基序列,得到所述二进制数据。
  15. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至12任一项所述的方法。
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至12任一项所述的方法。
PCT/CN2020/131511 2020-11-25 2020-11-25 用于dna数据存储的二进制信息到碱基序列的编解码方法和编解码装置 WO2022109879A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/131511 WO2022109879A1 (zh) 2020-11-25 2020-11-25 用于dna数据存储的二进制信息到碱基序列的编解码方法和编解码装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/131511 WO2022109879A1 (zh) 2020-11-25 2020-11-25 用于dna数据存储的二进制信息到碱基序列的编解码方法和编解码装置

Publications (1)

Publication Number Publication Date
WO2022109879A1 true WO2022109879A1 (zh) 2022-06-02

Family

ID=81755261

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/131511 WO2022109879A1 (zh) 2020-11-25 2020-11-25 用于dna数据存储的二进制信息到碱基序列的编解码方法和编解码装置

Country Status (1)

Country Link
WO (1) WO2022109879A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023240950A1 (zh) * 2022-06-14 2023-12-21 深圳先进技术研究院 基于dna分子介质的数据信息存储方法
CN116187435B (zh) * 2022-12-19 2024-01-05 武汉大学 基于大小喷泉码及mrc算法利用dna进行信息存储方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110708076A (zh) * 2019-09-25 2020-01-17 东南大学 一种基于混合模型的dna存储编解码方法
CN110867213A (zh) * 2018-08-28 2020-03-06 华为技术有限公司 一种dna数据的存储方法和装置
CN110932736A (zh) * 2019-11-09 2020-03-27 天津大学 一种基于Raptor码及四进制RS码的DNA信息存储方法
CN111091876A (zh) * 2019-12-16 2020-05-01 中国科学院深圳先进技术研究院 一种dna存储方法、系统及电子设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110867213A (zh) * 2018-08-28 2020-03-06 华为技术有限公司 一种dna数据的存储方法和装置
CN110708076A (zh) * 2019-09-25 2020-01-17 东南大学 一种基于混合模型的dna存储编解码方法
CN110932736A (zh) * 2019-11-09 2020-03-27 天津大学 一种基于Raptor码及四进制RS码的DNA信息存储方法
CN111091876A (zh) * 2019-12-16 2020-05-01 中国科学院深圳先进技术研究院 一种dna存储方法、系统及电子设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023240950A1 (zh) * 2022-06-14 2023-12-21 深圳先进技术研究院 基于dna分子介质的数据信息存储方法
CN116187435B (zh) * 2022-12-19 2024-01-05 武汉大学 基于大小喷泉码及mrc算法利用dna进行信息存储方法及系统

Similar Documents

Publication Publication Date Title
CN112382340B (zh) 用于dna数据存储的编解码方法和编解码装置
Song et al. New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing
Kiah et al. Codes for DNA sequence profiles
CN110945595B (zh) 基于dna的数据存储和检索
Buschmann et al. Levenshtein error-correcting barcodes for multiplexed DNA sequencing
CN109830263B (zh) 一种基于寡核苷酸序列编码存储的dna存储方法
US10370246B1 (en) Portable and low-error DNA-based data storage
Wang et al. High capacity DNA data storage with variable-length Oligonucleotides using repeat accumulate code and hybrid mapping
Ping et al. Towards practical and robust DNA-based data archiving using the yin–yang codec system
US10566077B1 (en) Re-writable DNA-based digital storage with random access
US9830553B2 (en) Code generation method, code generating apparatus and computer readable storage medium
WO2022109879A1 (zh) 用于dna数据存储的二进制信息到碱基序列的编解码方法和编解码装置
Zhang et al. Information stored in nanoscale: Encoding data in a single DNA strand with Base64
WO2020132935A1 (zh) 一种定点编辑存储有数据的核酸序列的方法及装置
US20170134045A1 (en) Method and apparatus for encoding information units in code word sequences avoiding reverse complementarity
CN112749247B (zh) 文本信息存储和读取方法及其装置
CN111858507B (zh) 基于dna的数据存储方法、解码方法、系统和装置
Wang et al. Oligo design with single primer binding site for high capacity DNA-based data storage
Zhang et al. A high storage density strategy for digital information based on synthetic DNA
Song et al. Super-robust data storage in DNA by de Bruijn graph-based decoding
Cevallos et al. A brief review on DNA storage, compression, and digitalization
Ezekannagha et al. Design considerations for advancing data storage with synthetic DNA for long-term archiving
Lau et al. Magnetic DNA random access memory with nanopore readouts and exponentially-scaled combinatorial addressing
Wang et al. Mainstream encoding–decoding methods of DNA data storage
Erlich et al. Capacity-approaching DNA storage

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20962770

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20962770

Country of ref document: EP

Kind code of ref document: A1