WO2022082573A1 - 存有数据信息的dna序列的处理方法及装置 - Google Patents

存有数据信息的dna序列的处理方法及装置 Download PDF

Info

Publication number
WO2022082573A1
WO2022082573A1 PCT/CN2020/122721 CN2020122721W WO2022082573A1 WO 2022082573 A1 WO2022082573 A1 WO 2022082573A1 CN 2020122721 W CN2020122721 W CN 2020122721W WO 2022082573 A1 WO2022082573 A1 WO 2022082573A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
base
dna
information
compressed
Prior art date
Application number
PCT/CN2020/122721
Other languages
English (en)
French (fr)
Inventor
黄小罗
戴俊彪
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Priority to PCT/CN2020/122721 priority Critical patent/WO2022082573A1/zh
Publication of WO2022082573A1 publication Critical patent/WO2022082573A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models

Definitions

  • the present application relates to the technical field of biological information, in particular to the technical field of DNA information storage, and more particularly to a method and device for processing DNA sequences containing data information.
  • IDC Internet Data Center Center
  • Deoxyribonucleic acid is a macromolecular polymer composed of deoxynucleotides, which are composed of bases, deoxyribose sugars and phosphates. There are four bases that make up deoxynucleotides, namely adenine (A), guanine (G), thymine (T), and cytosine (C).
  • DNA-based data storage technology uses the above four base sequences to represent the data series composed of binary "0" and "1". Compared with traditional storage media, DNA data storage has the characteristics of high storage density, long storage time, low maintenance cost and good biocompatibility. According to theoretical calculations, 1 g of DNA can achieve 455 EB of data storage, which is 6-7 orders of magnitude higher than traditional media.
  • DNA can store data stably for more than a thousand years, and at the same time, it requires very low maintenance resources, such as land occupation and electricity. Since DNA itself is the genetic material in nature, DNA storage data can also be put into animal and plant microbial cells to achieve permanent data storage that is passed down from generation to generation.
  • the DNA data storage process usually includes the following steps: (1) According to the preset correspondence between the binary and the bases A, T, C, G, the binary data information is converted into a code formed by the bases A, T, C, and G. (2) Using a high-throughput DNA synthesizer, combined with enzyme splicing technology, to synthesize the above-mentioned DNA sequences with stored data information; (3) Using a first- or second-generation high-throughput sequencer to Synthesize the DNA sequence for sequencing; 4) Convert the DNA sequence formed by A/T/C/G into binary data information according to the preset correspondence.
  • the number of bases in the converted DNA sequence is It is directly related to the number of DNA sequences synthesized in step (2) and the storage density of the entire DNA data storage.
  • the size of binary data information is given, the more bases in the DNA sequence, the smaller the data information carried by a single base on average, and the smaller the storage density of DNA data storage; on the contrary, the less the number of bases in the DNA sequence, The more data information carried by a single base on average, the greater the storage density of DNA data storage.
  • the reported method can achieve ⁇ 2 bits/nt after converting binary 0/1 information to A/T/C/G DNA sequence information (bits/nt means: bit/base or bit/bit base) data storage densities, but for higher data storage densities, it has not been reported.
  • One of the purposes of the embodiments of the present application is to provide a method and device for processing DNA sequences containing data information, aiming to solve the problem of low data storage density in the existing DNA data storage technology.
  • a method for processing a DNA sequence stored with data information comprising:
  • the DNA sequence is converted according to the data information to be stored.
  • the DNA sequence includes M repeating segments of bases, and each repeating segment of bases includes continuous and repeating base units. M ⁇ 1, M is integer;
  • each coding fragment includes a corresponding base unit, a reference base group representing the number of repetitions of the corresponding base unit, and at least one marker, and the marker is used to mark the start position and end position of the base unit in the coding fragment;
  • the compressed sequence is split to obtain a decoded sequence and an information sequence.
  • the decoded sequence includes the reference base sets in the M coding segments, and the information sequence includes other substances other than the reference base set in the compressed sequence.
  • the decoded sequence and the information sequence are used for Synthesize DNA that stores data information.
  • the label is a modified base.
  • the label is a methylated base C.
  • the method further includes:
  • the information sequence is divided into J first sub-segments, and the decoding sequence is divided into K second sub-segments; wherein, J and K are both positive integers greater than 0 and less than 200 nt.
  • each first sub-segment is provided with a first index mark, which is used to mark the position of the first sub-segment in the information sequence; and each second sub-segment is provided with a second index mark, used for Marks the position of the second sub-segment in the decoded sequence.
  • the first index label and the second index label are label units formed by one or more of the four bases.
  • the method further includes:
  • the compressed sequence is obtained
  • the compressed sequence is decompressed according to the corresponding relationship to obtain the DNA sequence.
  • the decoding sequence includes K second sub-segments; wherein, J and K are both positive integers greater than 0 and less than 200 nt,
  • Decoding sequences and informative sequences are obtained from the synthesized DNA by sequencing, including:
  • the J first sub-segments are spliced into an information sequence; according to the positional correspondence between the K second sub-segments, the K second sub-segments are spliced into decoding sequence.
  • a compressed sequence is obtained according to the decoding sequence and the information sequence, including:
  • the decoded sequence and the information sequence are combined to obtain a compressed sequence.
  • the compressed sequence is decompressed according to the corresponding relationship to obtain a DNA sequence, including:
  • the coding segment in the compressed sequence is decoded into the base repeat segment to obtain the DNA sequence.
  • a processing device for storing a DNA sequence of data information includes:
  • the acquisition module is used to acquire the DNA sequence to be compressed.
  • the DNA sequence is converted according to the data information to be stored.
  • the DNA sequence includes M repeating segments of bases, and each repeating segment of bases includes continuous and repeating base units, and M ⁇ 1, M is an integer;
  • the encoding module is used to encode the DNA sequence according to the corresponding relationship between the preset number of repetitions and the reference base group to obtain a compressed sequence, where the compressed sequence includes M coding fragments, M base repeating fragments and M codes
  • the fragments are in one-to-one correspondence, and each coding fragment includes a corresponding base unit, a reference base group representing the number of repetitions of the corresponding base unit, and at least one marker, and the marker is used to mark the starting point of the base unit in the coding fragment position and end position;
  • the splitting module is used for splitting the compressed sequence to obtain a decoded sequence and an information sequence.
  • the decoded sequence includes the reference base groups in the M coding fragments
  • the information sequence includes the base units of the compressed sequence other than the reference base group, Other bases and labels, decoding sequences and informative sequences are used to synthesize DNA in which data information is stored.
  • the processing system further includes:
  • Sequencing module for obtaining decoded sequences and informative sequences from synthetic DNA by sequencing
  • the decoding module is used to obtain the compressed sequence according to the decoding sequence and the information sequence;
  • the decompression module is used to decompress the compressed sequence according to the corresponding relationship to obtain the DNA sequence.
  • a terminal device including a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the processing method of the first aspect when the processor executes the computer program.
  • a computer-readable storage medium where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the processing method according to the first aspect is implemented.
  • an embodiment of the present application provides a computer program product, which, when the computer program product runs on a terminal device, enables the terminal device to execute the processing method of the above-mentioned first aspect.
  • the method for processing a DNA sequence with data information stored in the present application according to the correspondence between the preset repetition times and the reference base group, one or more of the DNA sequences to be compressed that store the data information
  • the base repeat segment is reduced to the coding segment to obtain a compressed sequence.
  • This process is equivalent to encrypting the DNA data storage, which can improve the data security of the DNA data storage information;
  • the number of bases in the DNA sequence to be compressed is greatly reduced, so the DNA synthesis and sequencing time can be saved in the subsequent DNA storage process, and the storage synthesis efficiency and identification efficiency of DNA data storage are improved.
  • FIG. 1 is a schematic flowchart of a method for compressing a DNA sequence provided with stored data information according to an embodiment of the present application
  • FIG. 2 is a schematic flow chart of a method for decompressing a DNA sequence that stores data information according to an embodiment of the present application
  • FIG. 3 is a schematic flowchart of a method for compressing DNA sequences provided with stored data information according to an embodiment of the present application
  • FIG. 4 is a schematic flowchart of a method for decompressing a DNA sequence stored with data information according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a processing system for storing a DNA sequence with data information provided by an embodiment of the present application
  • FIG. 6 is a schematic diagram of a processing system for storing a DNA sequence with data information provided by another embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • the term “if” may be contextually interpreted as “when” or “once” or “in response to determining” or “in response to detecting “.
  • the phrases “if it is determined” or “if the [described condition or event] is detected” may be interpreted, depending on the context, to mean “once it is determined” or “in response to the determination” or “once the [described condition or event] is detected. ]” or “in response to detection of the [described condition or event]”.
  • references in this specification to "one embodiment” or “some embodiments” and the like mean that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically emphasized otherwise.
  • the terms “including”, “including”, “having” and their variants mean “including but not limited to” unless specifically emphasized otherwise.
  • DNA sequences that store data information are usually obtained by the following methods:
  • the data type is any text, picture, sound, video, software, program and other information that can be displayed in the terminal device, but is not limited to this.
  • the binary data information is converted into a DNA sequence encoded by the bases A, T, C, and G and storing the data information.
  • the preset correspondence between binary codes and bases A, T, C, and G is: a base A represents a 00, a base T represents a 01, a base C represents a 10, and a base C represents a 10.
  • the base G represents an 11.
  • the secondary data information is 00110110101100101011011011000011001001
  • the binary data information is converted into the DNA sequence of the base sequence of AGTCCGACCGTCGAAGACT.
  • the preset correspondence between binary codes and bases A, T, C, and G is not limited to the above examples.
  • a base T represents a 00
  • a base A represents a 01
  • a base A represents a 01.
  • a base G represents a 10
  • a base C represents an 11, but not limited thereto.
  • the preset correspondence between the binary code and the bases A, T, C, and G only needs to be able to convert the binary data information into a DNA sequence according to the preset correspondence, and is not limited to the above examples.
  • This method stores data, and data information can be stored for a long time through DNA.
  • the DNA sequence obtained by representing a binary code by one base has a large number of bases, so the storage density is not high.
  • the present application provides a method for processing a DNA sequence with data information.
  • the DNA sequence stored with data information is simplified, and the number of stored data information is reduced.
  • the number of bases in DNA so that the long-sequence DNA storing data information is simplified into a compressed sequence of short sequences, and the storage density of unit bases is improved, so that the storage density of the obtained DNA is improved.
  • Some embodiments of the present application provide a method for processing DNA stored with data information, including a method for compressing and decompressing a DNA sequence with stored data information.
  • a method for compressing DNA sequences with stored data information including:
  • the DNA sequence is converted according to the data information to be stored.
  • the DNA sequence includes M repeating segments of bases, and each repeating segment of bases includes continuous and repeating base units, M ⁇ 1, M is an integer.
  • the DNA sequence to be compressed is obtained by conversion according to the data information to be stored and the preset correspondence between the binary code and the bases A, T, C, and G.
  • the preset correspondence between the binary code and the bases A, T, C, and G follows the rule of the maximum number of bases that can be repeated in the DNA sequence, so as to encode the DNA sequence into a
  • the compressed sequence with the smallest number of bases maximizes the amount of stored information carried by a single base and improves the storage density of DNA data.
  • the DNA sequence includes M repeating segments, where M is an integer greater than or equal to 1. It should be understood that the number of base repeats in the DNA sequence may be M or more than M. In some embodiments, the DNA sequence contains N base repeats, and N is greater than M; in some embodiments, the DNA sequence has and only M base repeats.
  • Each base repeat segment includes continuous and repeating base units, wherein the base units can be mono-base, di-base units, tri-base units and tetra-base units.
  • the base repeat segment when the base repeat segment includes continuous and repeated single bases, it refers to several identical single base repeat arrangements, such as AAAAA, TTTTTTTTTT, GGGGGGGG, CCCC, and the corresponding base unit is a single base A. , single base T, single base G, single base C.
  • the number of base repeats in the base repeat segment formed by continuous and repeated single bases is not limited to the above examples.
  • the base repeat segment when the base repeat segment includes continuous and repeated dibasic units, it refers to the repeated arrangement of the dibasic units formed by the combination of any two of the four bases A, T, G, and C, Such as ATATATAT, TCTCTCTCTCTCTCTCTC, GCGCGCGCGCGCGC, CTCTCTCT, the corresponding base units are dibasic unit AT, dibasic unit TC, dibasic unit GC, and dibasic unit CT.
  • the combination type of dibases and the number of repetitions of dibases in the second sequence fragment are not limited to the above examples.
  • the base repeat segment when the base repeat segment includes a continuous and repeated three-base unit, it means that any two or three of the four bases A, T, G, and C are combined into three bases.
  • Repeated arrangement of three base units such as AGTAGTAGTAGT, TCATCATCATCATCATCATCATCA, GTCGTCGTCGTCGTCGTCGTC, CGTCGTCGTCGT, AATAATAAAT, the corresponding base units are three base units AGT, three base units TCA, three base units GTC, and three base units CGT.
  • the combination type of three bases and the number of repetitions of three bases in the third sequence fragment are not limited to the above examples.
  • the base repeat segment when the base repeat segment includes continuous and repeated four-base units, it means that any two of the four bases A, T, G, and C are combined into four bases in a non-ABAB manner.
  • a repeating arrangement of four base units formed, or a repeating arrangement of four base units formed after any three of the four deoxynucleotides A, T, G, and C are combined into four bases, or four deoxynucleotides A repeating arrangement of four base units formed by random combinations of nucleotides A, T, G, and C.
  • four-base units such as AGGAAGGAA, ATCAATCA, AGTCAGTCAGTC, TGCATGCATGCATGCA, GATCGATCGATC, CGATCGATCGATCGAT
  • the corresponding base units are respectively four-base unit AGGA, four-base unit TATCA, four-base unit AGTC, four-base unit TGCA, four-base unit GATC, four-base unit CGAT.
  • the combination type of four bases and the number of repetitions of four bases in the fourth sequence fragment are not limited to the above examples.
  • the M repeating segments correspond to repeating segments formed by continuous and repeating single-base units, repeating segments formed by continuous and repeating two-base units, and repeating segments formed by continuous and repeating double-base units in the DNA sequence, respectively. Repeated fragments formed by three base units. That is, the M-base repeat segment does not include the repeat segment formed by continuous and repeated four-base units.
  • the following step S20 can use a reference base group with a smaller number of bases to encode the repetition number of a single base or base unit in the DNA sequence, such as using a reference base consisting of two bases
  • the number of repetitions of a single base or base unit in the DNA sequence is encoded by groups, thereby further reducing the number of bases in the resulting compressed sequence and increasing the average storage density of single bases.
  • each coding fragment includes a corresponding base unit, a reference base group representing the number of repetitions of the corresponding base unit, and at least one marker, and the marker is used to mark the starting position and the end point of the base unit in the coding fragment Location.
  • the DNA sequence is encoded according to the corresponding relationship between the preset number of repetitions and the reference base group, and the M base repeating fragments in the DNA sequence are encoded as M bases with a shorter number of bases Base repeats, thereby reducing long-sequence DNA that stores data information into short-sequence DNA.
  • the number of repetitions refers to the number of repetitions of base units in the base repeat segment.
  • the number of repetitions of the base unit T is 10; in the base repeat segment ATATATAT, the number of repeats of the base unit AT is 4; in the base repeat segment TGCTGCTGCTGCTGCTGC, the base unit TGC is repeated.
  • the number of repetitions is 6; in the base repeat segment ATCGATCG, the number of repetitions of the base unit ATCG is 2.
  • the type and number of bases in the reference base set may be selected according to the maximum repetition number of base units in the M base repeating fragments to be encoded in the DNA sequence.
  • the reference base set can be selected from single base, double base, three bases or four bases.
  • the smaller the number of bases in the reference base set the more conducive to reducing the number of bases in the compressed sequence, and thus the more conducive to improving the storage density of unit bases.
  • the number of bases in the reference base set is 1, that is, when a single base is selected for the reference base set, the number of bases in the compressed sequence obtained is the least, and correspondingly, the storage density of the unit base is the highest.
  • the reference base set can be selected from two bases, three bases or four bases.
  • the number of bases in the reference base set is 2, that is, when double bases are selected for the reference base set, the number of bases in the compressed sequence obtained is the least, and correspondingly, the storage density of unit bases is the highest.
  • the obtained compressed sequence has the least number of bases, where s is an integer greater than or equal to 3.
  • the corresponding relationship between the preset number of repetitions and the reference base set refers to the equivalent relationship between the preset number of repetitions of the base unit in the base repeat segment and the reference base set.
  • the number of bases in the reference base group is 2, and the correspondence between the number of repetitions and the reference base group can be preset as follows: 5 corresponds to AT, 6 corresponds to AC; 7 corresponds to AG; 8 corresponds to TA, and 9
  • Corresponds to TC, 10 corresponds to TG, 11 corresponds to CA, 12 corresponds to CT, 13 corresponds to CG, 14 corresponds to GA, 15 corresponds to GT, 16 corresponds to GC....
  • the number of bases in the reference base group is not limited to 2
  • the correspondence between the number of repetitions and the reference base group is not limited to the above-mentioned correspondence.
  • the DNA sequence is encoded according to the corresponding relationship between the preset number of repetitions and the reference base set to obtain a compressed sequence, including:
  • the number n is used to represent the number of repetitions of the base unit in the M-base repeating fragment in the DNA sequence, and encodes other repeating base units except one base unit in the base-repeating fragment, that is, retaining the M-base repeating fragment in the repeating base unit. of a base unit.
  • the code when the base repeat segment is AAAAAAAA, the code is 8A; when the base repeat segment is GCGCGCGCGCGCGC, the code is 7GC; when the base repeat segment is AGTAGTAGTAGT, the code is 4AGT; when the base repeat segment is AGCTAGCTAGCTAGCTAGCT , the code is 5AGCT.
  • a marker is used to mark the start position and the end position of the base unit retained in the encoded M-base repeat fragment, and the marker is a marker with synthesizable and identifiable properties.
  • the markers in the test sequence can be automatically identified during the sequencing process, and then the type of the base repeat fragment can be identified, so as to realize the decompression of the DNA sequence.
  • the label is a modified base.
  • the modified base refers to a base obtained by modifying a base.
  • the label is a synthesizable and identifiable modified base.
  • the modified base is the methylated base C.
  • marking the starting position and the ending position of the base unit retained in the M-base repeating fragment encoded by n includes: using different numbers or types of modified base pairs in the M-base repeating fragment The starting and ending positions of the reserved base units are marked.
  • the base unit is a single base, and a marker is inserted at the starting position (or other preset positions) of the base unit retained in the M-base repeating fragment after n coding;
  • the base unit is Two-base unit, insert two markers at the starting position (or other preset positions) of the base unit retained in the M-base repeating fragment encoded by n;
  • the base unit is a three-base unit, which is in the Three markers are inserted at the starting position (or other preset positions) of the base unit retained in the M-base repeating fragment after n-coding;
  • the base unit is a four-base unit, and the M base unit after n-coding Four markers are inserted at the starting positions (or other predetermined positions) of the base units retained in the base repeats.
  • the starting position of the base unit can be determined according to the insertion position of the label; the end position of the base unit can be determined according to the quantity of the label.
  • the correspondence between the number of labels and the types of base units is not limited to the above examples.
  • the label is a modified base; in some embodiments, the label is a methylated base C.
  • the base unit is a single base, and one or more markers are respectively inserted before and after the single base retained in the M-base repeating fragment after n coding;
  • the base unit is a double base unit, Insert one or more markers at the starting position and the ending position of the base unit retained in the M base repeating fragment encoded by n;
  • One or more markers are inserted into the starting position and the end position of the base unit retained in the base repeat segment respectively;
  • the base unit is a four base unit, and the base unit retained in the M base repeat segment encoded by n
  • One or more markers are inserted at the start and end positions of the unit, respectively.
  • the starting point position and the ending point position of the base unit can be directly determined based on the insertion position of the label.
  • the types of labels inserted into the base units retained in the M-base repeating fragments may be the same or different; similarly, the types of labels inserted into the base units retained in the M-base repeating fragments may be the same or different.
  • the number of markers can be the same or different.
  • the label is a modified base.
  • the start of the base units retained in the M-base repeating fragment after n coding is marked position and end position.
  • the base unit is a single base, and the first preset marker is inserted at the starting position (or other preset positions) of the base unit retained in the M-base repeating fragment after n coding ;
  • the base unit is a double base unit, and the second preset label is inserted into the starting position (or other preset positions) of the base unit retained in the M base repeating fragment encoded by n;
  • the base The base unit is a three-base unit, and a third preset label is inserted at the starting position (or other preset positions) of the base unit retained in the M-base repeating fragment encoded by n;
  • the base unit For a four-base unit, a fourth preset marker is inserted at the starting position (or other preset positions) of the base unit retained in the n-coded M-base repeating fragment.
  • the types of the four preset labels are four bases that are identifiable and different from each other.
  • the starting position of the base unit can be determined according to the insertion position of the label; the end position of the base unit can be determined according to the type of the label.
  • the label is a modified base.
  • the position, quantity and type of the inserted markers can be determined according to the preset insertion rule.
  • the insertion rule can be the insertion of the marker before the first base in the encoded N, or the insertion of the marker after the last base in the encoded N, or the base at other positions.
  • a marker is inserted between the bases, that is, other positions preset.
  • each coding fragment includes a corresponding base unit, a reference base group representing the number of repetitions of the corresponding base unit, and at least one marker, and the marker is used to mark the starting position and the end point of the base unit in the coding fragment Location.
  • the decoded sequence includes the reference base sets in the M coding segments, the information sequence includes other substances in the compressed sequence except the reference base set, the decoded sequence and the information sequence For synthesizing DNA with stored data information.
  • splitting the compressed sequence includes: extracting the reference base set in the compressed sequence, and encoding the extracted reference base set into a decoded sequence, where the decoded sequence includes the reference base set in the M coding segments.
  • encoding the extracted reference base group into a decoding sequence includes: a preset order of the reference base group, and arranging the reference base group according to the predetermined arrangement order to obtain the decoding sequence.
  • the arrangement order of the reference base group can be set in advance according to the setting rule.
  • the reference base groups may be arranged in sequence according to the order in which the compressed sequence reference base groups appear to obtain the decoded sequence.
  • the reference base groups may be arranged in an order opposite to the order of appearance of the reference base groups to obtain the decoded sequence.
  • the sequence of the reference base groups that appear in sequence can be arranged according to other preset sequences to obtain the decoded sequence.
  • the first reference base group is ranked first
  • the second reference base group is ranked third
  • the third reference base group is ranked first
  • the group is ranked fifth
  • the first-to-last reference base group is second
  • the second-to-last reference base group is fourth, and so on.
  • splitting the compressed sequence includes: extracting other substances in the compressed sequence except the reference base group, and encoding them into an information sequence.
  • the information sequence includes base units, other bases and labels of the compressed sequence other than the reference base set. Among them, other bases refer to the remaining bases in the compressed sequence, excluding the reference base group and base unit.
  • the information sequence includes base units and labels other than the base set of the compressed sequence. That is, the DNA sequence is composed of base repeats and does not contain other bases other than base repeats.
  • the decoded sequences and information sequences obtained in the examples of the present application are used to synthesize DNA storing data information.
  • the decoding sequences and information sequences provided in the examples of the present application can be synthesized by chemical DNA synthesis method or enzymatic DNA synthesis method.
  • the method further includes: dividing the information sequence into J first sub-segments, and dividing the decoding sequence into K second sub-segments; wherein, J and K are both positive integers greater than 0 and less than 200 nt .
  • J and K are both positive integers greater than 0 and less than 200 nt .
  • the first index label and the second index label are label units formed by one or more of the seed bases.
  • 2 bases characterize 16 labeling units; 3 bases characterize 64 labeling units; 4 bases characterize 256 labeling units; 5 bases characterize 1024 labeling units, etc. .
  • one or more of the four bases may be repeated in the labeling unit to form a base with a number of more than 4 bases.
  • tag unit Exemplarily, when the labeling unit is formed by a labeling unit of 5 bases, the labeling unit may be AATGC.
  • the information sequence is divided into J first sub-segments
  • the decoding sequence is divided into K second sub-segments, wherein J and K are both positive integers greater than 0 and less than 200 nt
  • each first sub-segment is A first index mark is set in the segment, which is used to mark the position of the first sub-segment in the information sequence
  • a second index mark is set in each second sub-segment, which is used to mark the position of the second sub-segment in the decoding sequence
  • the marking sequence of the first sub-segment and the second sub-segment is determined by the sequence of each sub-segment in the decoding sequence or the information sequence.
  • the decoding sequence or information sequence is split into 256 synthetic sequence fragments, and 4 bases are used to program the sequence bits of 1 to 256 synthetic sequence fragments.
  • adaptor sequences are attached to both ends of the first sub-fragment and the second sub-fragment, and the adaptor sequences are used for the amplification of synthetic sequence fragments.
  • the linker sequence is a 16-20 base sequence.
  • the synthesized decoding sequence and the information sequence are stored separately.
  • the decoded sequences and information sequences can be stored in organic or inorganic container media, such as polypropylene centrifuge tubes, but can also be stored in other formats.
  • the DNA that stores data information is reduced from a long sequence containing M repeating fragments into a decoding sequence consisting of a reference base group and two short sequences consisting of base units, other bases, and labels. sequence.
  • the number of bases in the sequence is greatly reduced, thereby increasing the average storage density of a single base;
  • the DNA sequence corresponding to the information sequence can be decrypted only by decoding the sequence, which enhances the security of data information.
  • a method for decompressing a DNA sequence stored with data information comprising:
  • sequencing technology is used to read the decoding sequence and the information sequence of the synthetic DNA, respectively.
  • the Sanger sequencing technology can be used to read the decoding sequence or the information sequence
  • the second-generation high-throughput sequencing technology can also be used to read the decoding sequence or the information sequence.
  • the step of reading an informative sequence of synthetic DNA includes reading a label in the informative sequence.
  • the marker can be read according to the type of marker.
  • the label is methylated base C
  • the modified base is read using the methylated base C reading technology.
  • the sequence to be read is processed with bisulfite.
  • the decoding sequence includes K second sub-segments; wherein, J and K are both positive integers greater than 0 and less than 200 nt,
  • Decoding sequences and informative sequences are obtained from the synthesized DNA by sequencing, including:
  • the J first sub-segments are spliced into an information sequence; according to the positional correspondence between the K second sub-segments, the K second sub-segments are spliced into decoding sequence.
  • obtaining the decoding sequence and the information sequence from the synthesized DNA by sequencing includes: performing PCR amplification on the J first sub-fragments and the K second sub-fragments respectively, and reading the bases of the amplified fragments respectively The base sequence; the amplified fragments are sorted according to the first index mark, and spliced into a complete information sequence, and the amplified fragments are sorted according to the second index mark, and a complete decoding sequence is spliced.
  • the concentration of each sub-fragment is increased by PCR amplification, and the sequencing and identification efficiency of each sub-fragment is improved.
  • the reference base group in the decoded sequence is encoded into the information sequence to obtain a compressed sequence.
  • a compressed sequence is obtained according to the decoding sequence and the information sequence, including:
  • the decoded sequence and the information sequence are combined to obtain a compressed sequence.
  • a compressed sequence is obtained according to the decoding sequence and the information sequence, including:
  • the marker information in the information sequence is determined, a reference base group is inserted into the position corresponding to the marker base unit in the information sequence, and the decoded sequence and the information sequence are combined to obtain a compressed sequence.
  • the coding fragments in the compressed sequence are decoded into base repeating fragments to obtain the DNA sequence.
  • decompressing the compressed sequence according to the corresponding relationship to obtain the DNA sequence includes: encoding the M reference bases in the compressed sequence according to the corresponding relationship between the preset number of repetitions and the reference base group.
  • the base unit is reduced to M base repeats to obtain the DNA sequence.
  • the DNA sequence coding is restored to a 0/1 binary sequence according to the coding rules for conversion between 0/1 binary and bases. Further, the 0/1 binary sequence can be converted into corresponding information such as picture/text/video through a conversion program.
  • the decompression method provided in the embodiment of the present application performs sequencing and interpretation on the decoded sequence and information sequence compressed by the above method, and determines the position and type of the repeated fragment in the sequence according to the marker in the sequence, and combines the decoded sequence and information sequence after interpretation. , splicing to obtain the DNA sequence with data information.
  • An embodiment provides a process of compressing and decompressing DNA encoding the following data information "0111000000000000000010010111100110011001100110011001100000110000110000110000110000110000110000110"", including the following steps:
  • Step 2 Use 10A to represent the repeat unit "AAAAAAAAAA” in the DNA sequence, 7GA to represent the repeat unit "GAGAGAGAGAGA” in the heavy DNA sequence, and 6ATC to represent the repeat unit "ATCATCATCATCATCATC” in the DNA sequence to form a new sequence TG10ATACG7GA6ATC.
  • Step 3 Insert 1 methylated cytosine C* in the middle of 10A (methylated cytosine C is currently feasible for synthesis and sequencing), insert 2 methylated cytosine C* in the middle of 7GA, and in the middle of 6ATC Insert 3 methylated cytosine C* to form a new sequence, named DNA template sequence: TG10C*ATACG7C*C*GA6C*C*C*ATC
  • Step 5 Delete the numbers in the DNA template sequence in Step 3 to obtain an information sequence: TGC*ATACGC*C*GAC*C*C*ATC.
  • the obtained decoded and informative sequences can be used in subsequent DNA synthesis processes.
  • Step 1 Insert 2 bases and one unit in the decoding sequence "TGAGAC” obtained by sequencing into the modified base and non-modified bases in the information sequence "TGC*ATACGC*C*GAC*C*C*ATC” obtained by sequencing In the middle of the base, a new sequence is formed: TGTGC*ATACGAGC*C*GAACATC .
  • Step 3 Determine the repeating base unit according to the number of modified bases, and combine the obtained repeating times to restore the sequence to a complete sequence.
  • 10C*A stands for single-base A repeats 10 times and returns to "AAAAAAAAA”
  • 7C*C*GA stands for double-base GA repeats 7 times, returns to "GAGAGAGAGAGA”
  • 6C*C*C*ATC stands for ATC repeat 6
  • the binary 01 information of 96 bits is first stored in the DNA sequence of 48 bases, and then the 48 bases are compressed into 24 bases by the compression method disclosed in this application.
  • the 50% DNA sequence compression effect doubles the data storage density from 2 bits/nt to 4 bits/nt; at the same time, the compressed sequence does not contain single-base, double-base, and triple-base repeats, which is beneficial for subsequent Synthesis and sequencing.
  • some embodiments of the present application provide a processing device 5 for storing a DNA sequence of data information.
  • the processing device 5 includes:
  • the obtaining module 51 is used to obtain the DNA sequence to be compressed, the DNA sequence is obtained by conversion according to the data information to be stored, the DNA sequence includes M repeating segments of bases, and each repeating segment of bases includes continuous and repeated base units, M ⁇ 1, M is an integer;
  • the encoding module 52 is configured to encode the DNA sequence according to the corresponding relationship between the preset number of repetitions and the reference base set to obtain a compressed sequence, where the compressed sequence includes M coding fragments, M base repeat fragments and M
  • the coding fragments are in one-to-one correspondence, and each coding fragment includes a corresponding base unit, a reference base group representing the number of repetitions of the corresponding base unit, and at least one marker, and the marker is used to mark the base unit in the coding fragment. start and end positions;
  • the splitting module 53 is configured to split the compressed sequence to obtain a decoded sequence and an information sequence, where the decoded sequence includes the reference base group in the M coding segments, and the information sequence includes the base units of the compressed sequence other than the reference base group , other bases and labels, decoding sequences and information sequences are used to synthesize DNA that stores data information.
  • the processing system 5 further includes:
  • the sequencing module 54 is used to obtain the decoding sequence and the information sequence from the synthetic DNA by sequencing
  • the decoding module 55 is used for obtaining the compressed sequence according to the decoding sequence and the information sequence;
  • the decompression module 56 is configured to decompress the compressed sequence according to the corresponding relationship to obtain the DNA sequence.
  • the terminal device 70 provided in this embodiment includes: a processor 710 , a memory 720 , and a computer program 721 stored in the memory 720 and running on the processor 710 .
  • the processor 710 executes the computer program 721 , the steps in each of the above embodiments of the processing method are implemented, for example, steps S10 to S30 shown in FIG. 1 .
  • the computer program 721 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 720 and executed by the processor 710 to complete the present application.
  • One or more modules/units may be a series of computer program instruction segments capable of performing specific functions, and the instruction segments may be used to describe the execution process of the computer program 721 in the terminal device.
  • the computer program 721 can be divided into an acquisition module, an encoding module and a splitting module, and the specific functions of each module are as follows:
  • the acquisition module is used to acquire the DNA sequence to be compressed.
  • the DNA sequence is converted according to the data information to be stored.
  • the DNA sequence includes M repeating segments of bases, and each repeating segment of bases includes continuous and repeating base units, and M ⁇ 1, M is an integer;
  • the encoding module is used to encode the DNA sequence according to the corresponding relationship between the preset number of repetitions and the reference base group to obtain a compressed sequence, where the compressed sequence includes M coding fragments, M base repeating fragments and M codes
  • the fragments are in one-to-one correspondence, and each coding fragment includes a corresponding base unit, a reference base group representing the number of repetitions of the corresponding base unit, and at least one marker, and the marker is used to mark the starting point of the base unit in the coding fragment position and end position;
  • the splitting module is used for splitting the compressed sequence to obtain a decoded sequence and an information sequence.
  • the decoded sequence includes the reference base groups in the M coding fragments
  • the information sequence includes the base units of the compressed sequence other than the reference base group, Other bases and labels, decoding sequences and informative sequences are used to synthesize DNA in which data information is stored.
  • the computer program 721 can also be divided into a sequencing module, a decoding module, and a decompression module, and the specific functions of each module are as follows:
  • Sequencing module for obtaining decoded sequences and informative sequences from synthetic DNA by sequencing
  • the decoding module is used to obtain the compressed sequence according to the decoding sequence and the information sequence;
  • the decompression module is used to decompress the compressed sequence according to the corresponding relationship to obtain the DNA sequence.
  • the terminal device 70 may include, but is not limited to, a processor 710 and a memory 720 .
  • FIG. 7 is only an example of the terminal device 70, and does not constitute a limitation on the terminal device 70, and may include more or less components than the one shown, or combine some components, or different components.
  • the processor 710 may be a central processing unit (Central Processing Unit, CPU), and may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuits) Integrated Circuit, ASIC), off-the-shelf Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 720 may be an internal storage unit of the terminal device 70 , such as a hard disk or a memory of the terminal device 70 .
  • the memory 720 may also be an external storage device of the terminal device 70 , such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash memory card (Flash card) equipped on the terminal device 70 . Card) and so on.
  • the memory 720 may also include both an internal storage unit of the terminal device 70 and an external storage device.
  • the memory 720 is used to store the computer program 721 and other programs and data required by the terminal device 70 .
  • the memory 720 may also be used to temporarily store data that has been output or will be output.
  • Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processing methods of the foregoing embodiments are implemented.
  • the embodiments of the present application also provide a computer program product, which enables the terminal device to execute the processing methods of the foregoing embodiments when the computer program product runs on the terminal device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

一种存有数据信息的DNA序列的处理方法及装置,涉及生物信息技术领域。所述方法包括:获取待压缩的DNA序列,DNA序列是根据待存储的数据信息转换所得,DNA序列包括M个碱基重复片段(S10);根据预设的重复次数与基准碱基组之间的对应关系,对DNA序列进行编码,得到压缩序列,压缩序列包括M个编码片段,M个碱基重复片段与M个编码片段一一对应(S20);对压缩序列进行拆分,得到解码序列和信息序列,解码序列包括M个编码片段中的基准碱基组,信息序列包括压缩序列除基准碱基组以外的其他物质(S30)。所述方法能够提升DNA数据存储信息的信息编码密度及数据安全性。

Description

存有数据信息的DNA序列的处理方法及装置 技术领域
本申请涉及生物信息技术领域,特别是属于DNA信息存储技术领域,更具体地涉及一种存有数据信息的DNA序列的处理方法及装置。
背景技术
随着互联网及人工智能大数据的发展,人类社会每天产生的数据里量呈爆炸式的增长。互联网数据中心(Internet Data Center,IDC)预测,截止到2020年,世界范围内的数据总量将达到44 ZB(44×1012 GB)。传统的存储介质比如磁带、光盘、硬盘等耗电量高,存储周期短,成本高;同时,用于信息存储的硅资源储量急剧枯竭,因此,寻找硅基存储的替代物,实现低成本、高效稳定且长期的数据存储对于目前信息社会的高速发展尤为重要。
脱氧核糖核酸(DeoxyriboNucleic Acid,DNA)是由脱氧核苷酸组成的大分子聚合物,其中,脱氧核苷酸由碱基、脱氧核糖和磷酸构成。构成脱氧核苷酸的碱基包括4种,分别为腺嘌呤(A)、鸟嘌呤(G)、胸腺嘧啶(T)和胞嘧啶(C)。基于DNA的数据存储技术是利用上述四种碱基序列来表示二进制“0”和“1”组成的数据系列。相比较于传统存储介质,DNA数据存储具有存储密度高,存储时间久,维护成本低,生物相容性好的特点。据理论推算,1 g DNA能够实现455 EB的数据存储,比传统介质提高6-7个数量级。DNA同时能够稳定存储数据千年以上,同时在维护资源,比如占地、电力等方面要求非常低。由于DNA本身是自然界的遗传物质,DNA存储数据还能放入动植物微生物细胞中,实现代代相传的永久数据存储。
DNA数据存储流程通常包含以下步骤:(1)根据二进制与碱基A、T、C、G之间的预设对应关系,将二进制数据信息转换为由碱基A、T、C、G编码形成的、存储有数据信息的DNA序列;(2)采用高通量DNA合成仪,结合酶拼接技术,合成上述存储有数据信息的DNA序列;(3)采用一代或者二代高通量测序仪对合成DNA序列进行测序;4)根据预设对应关系,将A/T/C/G形成的DNA序列转换为二进制数据信息。该方法中,步骤(1)中,将二进制信息转换为DNA序列的过程中,由于二级制中存储的信息由转换后的DNA序列承载,因此,转换后的DNA序列中碱基数量的多少直接关系到步骤(2)中DNA序列的合成数量以及整个DNA数据存储的存储密度。在二进制数据信息大小既定的情况下, DNA序列的碱基数量越多,单个碱基平均承载的数据信息越小,DNA数据存储的存储密度越小;相反, DNA序列的碱基数量越少,单个碱基平均承载的数据信息越多,DNA数据存储的存储密度越大。目前,已经报道的方法,将二进制0/1信息转换到A/T/C/G的DNA序列信息后,能够实现< 2 bits/nt(bits/nt表示:位/个碱基或比特/个碱基)的数据存储密度,但对于更高的数据存储密度,尚未有报道。
技术问题
本申请实施例的目的之一在于:提供一种存有数据信息的DNA序列的处理方法及装置,旨在解决现有的DNA数据存储技术的数据存储密度较小的问题。
技术解决方案
为解决上述技术问题,本申请实施例采用的技术方案是:
第一方面,提供了一种存储有数据信息的DNA序列的处理方法,方法包括:
获取待压缩的DNA序列,DNA序列是根据待存储的数据信息转换所得,DNA序列包括M个碱基重复片段,每个碱基重复片段包括连续且重复的碱基单元,M≥1,M为整数;
根据预设的重复次数与基准碱基组之间的对应关系,对DNA序列进行编码,得到压缩序列,压缩序列包括M个编码片段,M个碱基重复片段与M个编码片段一一对应,每个编码片段包括对应的碱基单元、表征对应的碱基单元的重复次数的基准碱基组以及至少一个标记物,标记物用于标记编码片段中的碱基单元的起点位置和终点位置;
对压缩序列进行拆分,得到解码序列和信息序列,解码序列包括M个编码片段中的基准碱基组,信息序列包括压缩序列除基准碱基组以外的其他物质,解码序列和信息序列用于合成存储有数据信息的DNA。
在一个实施例中,标记物为修饰碱基。
在一个实施例中,标记物为甲基化碱基C。
在一个实施例中,方法还包括:
将信息序列划分为J个第一子片段,将解码序列划分为K个第二子片段;其中,J和K均为大于0且小于200 nt的正整数。
在一个实施例中,每个第一子片段中设置有第一索引标记,用于标记第一子片段在信息序列中的位置;每个第二子片段中设置有第二索引标记,用于标记第二子片段在解码序列中的位置。
在一个实施例中,第一索引标记、第二索引标记为四种碱基中的一种或多种形成的标记单元。
在一个实施例中,方法还包括:
通过测序从合成的DNA中获取解码序列和信息序列;
根据解码序列和信息序列,得到压缩序列;
根据对应关系对压缩序列进行解压,得到DNA序列。
在一个实施例中,若信息序列包括J个第一子片段,解码序列包括K个第二子片段;其中,J和K均为大于0且小于200 nt的正整数,
则通过测序从合成的DNA中获取解码序列和信息序列,包括:
通过测序从合成的DNA中分别获取J个第一子片段和K个第二子片段;
根据J个第一子片段之间的位置对应关系,将J个第一子片段拼接成信息序列;根据K个第二子片段之间的位置对应关系,将K个第二子片段拼接成解码序列。
在一个实施例中,根据解码序列和信息序列,得到压缩序列,包括:
根据解码序列中的基准碱基组的排列顺序以及信息序列中标记物所在的位置,将解码序列和信息序列合并,得到压缩序列。
在一个实施例中,根据对应关系对压缩序列进行解压,得到DNA序列,包括:
根据对应关系,将压缩序列中的编码片段解码成碱基重复片段,得到DNA序列。
第二方面,提供了一种存储有数据信息的DNA序列的处理装置,处理装置包括:
获取模块,用于获取待压缩的DNA序列,DNA序列是根据待存储的数据信息转换所得,DNA序列包括M个碱基重复片段,每个碱基重复片段包括连续且重复的碱基单元,M≥1,M为整数;
编码模块,用于根据预设的重复次数与基准碱基组之间的对应关系,对DNA序列进行编码,得到压缩序列,压缩序列包括M个编码片段,M个碱基重复片段与M个编码片段一一对应,每个编码片段包括对应的碱基单元、表征对应的碱基单元的重复次数的基准碱基组以及至少一个标记物,标记物用于标记编码片段中的碱基单元的起点位置和终点位置;
拆分模块,用于对压缩序列进行拆分,得到解码序列和信息序列,解码序列包括M个编码片段中的基准碱基组,信息序列包括压缩序列除基准碱基组以外的碱基单元、其他碱基和标记物,解码序列和信息序列用于合成存储有数据信息的DNA。
在一个实施例中,处理系统还包括:
测序模块,用于通过测序从合成DNA中获取解码序列和信息序列;
解码模块,用于根据解码序列和信息序列,得到压缩序列;
解压模块,用于根据对应关系对压缩序列进行解压,得到DNA序列。
第三方面,提供了一种终端设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,处理器执行计算机程序时实现如第一方面的处理方法。
第四方面,提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序被处理器执行时实现如第一方面的处理方法。
第五方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面的处理方法。
本申请提供的存储有数据信息的DNA序列的处理方法,通过预先设定的重复次数与基准碱基组之间的对应关系,将存储有数据信息的待压缩的DNA序列中的一个或多个碱基重复片段简化为编码片段,以得到压缩序列。通过该方法,可以将重复碱基序列简化为编码片段,从而将存储有数据信息的长序列DNA简化为短序列的压缩序列,显著降低存储有数据信息的DNA的碱基数量,提高了单位碱基的存储密度,使得到的DNA的存储密度提高。进一步的,将压缩序列拆分为包解码序列和信息序列两部分,该过程相当于对DNA数据存储进行了加密处理,可以提升DNA数据存储信息的数据安全性;而且,由于解码序列和信息序列较待压缩的DNA序列的碱基数大幅减少,因此在后续DNA存储流程中可以节省DNA的合成和测序时间,提升DNA数据存储的存储合成效率和识别效率。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或示范性技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1是本申请一实施例提供存储有数据信息的DNA序列的压缩方法的流程示意图;
图2是本申请一实施例提供存储有数据信息的DNA序列的解压缩方法的流程示意图;
图3是本申请实施例提供存储有数据信息的DNA序列的压缩方法的流程示意图;
图4是本申请实施例提供存储有数据信息的DNA序列的解压缩方法的流程示意图。
图5是本申请一实施例提供的存储有数据信息的DNA序列的处理系统示意图;
图6是本申请另一实施例提供的存储有数据信息的DNA序列的处理系统示意图;
图7是本申请一实施例提供的终端设备的结构示意图。
本发明的实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本申请。
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
如在本申请说明书和所附权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”、“第四”等仅用于区分描述,而不能理解为指示或暗示相对重要性。
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
为了说明本申请的技术方案,以下结合具体附图及实施例进行详细说明。
目前,存储数据信息的DNA序列,通常通过下述方法获得:
获取待存储的数据,示例性的,数据类型为可以在终端设备中显示的任何文字、图片、声音、视频、软件、程序等信息,但不限于此。
将待存储的数据转换为0/1二进制代码,得到二进制数据信息。示例性的,二进制代码如00、01、10、11。
根据二进制代码与碱基A、T、C、G之间的预设对应关系,将二进制数据信息转换为由碱基A、T、C、G编码形成的、存储有数据信息的DNA序列。示例性的,二进制代码与碱基A、T、C、G之间的预设对应关系为:一个碱基A代表一个00,一个碱基T代表一个01,一个碱基C代表一个10,一个碱基G代表一个11。当二级制数据信息为00110110101100101011011011000011001001时,根据二进制与碱基A、T、C、G之间的预设对应关系,将二进制数据信息转换碱基序列为AGTCCGACCGTCGAAGACT的DNA序列。当然,二进制代码与碱基A、T、C、G之间的预设对应关系不限于上述示例,例如,也可以为:一个碱基T代表一个00,一个碱基A代表一个01,一个碱基G代表一个10,一个碱基C代表一个11,但不限于此。应当理解,二进制代码与碱基A、T、C、G之间的预设对应关系,只需要能够根据预设对应关系将二进制数据信息转换DNA序列就行,并不限于上述示例。
采用高通量DNA合成仪,包括结合酶拼接技术,合成上述存储有数据信息的DNA序列。
这种方法存储数据,数据信息可以通过DNA实现长久存储。然而,目前,通过一个碱基代表一个二进制代码得到的DNA序列,碱基数量较大,因此存储密度不高。
为此,本申请提供给一种存有数据信息的DNA序列的处理方法,通过对DNA序列中的碱基重复片段进行编码,将存储有数据信息的DNA序列进行简化,降低存储有数据信息的DNA的碱基数量,从而使得存储有数据信息的长序列DNA简化为短序列的压缩序列,提高单位碱基的存储密度,使得到的DNA的存储密度提高。
为了说明本申请的技术方案,以下结合具体附图及实施例进行详细说明。
本申请一些实施例提供一种存储有数据信息的DNA的处理方法,包括存储有数据信息的DNA序列的压缩方法和解压缩方法。
结合图1,存储有数据信息的DNA序列的压缩方法,包括:
S10. 获取待压缩的DNA序列,DNA序列是根据待存储的数据信息转换所得,DNA序列包括M个碱基重复片段,每个碱基重复片段包括连续且重复的碱基单元,M≥1,M为整数。
该步骤中,如上文所述,待压缩的DNA序列是根据待存储的数据信息,根据二进制代码与碱基A、T、C、G之间的预设对应关系,转换所得。
在一些实施例中,二进制代码与碱基A、T、C、G之间的预设对应关系,遵循能够在DNA序列中碱基重复片段的碱基数最多的规则,以便将DNA序列编码形成碱基数量最少的压缩序列,从而最大限度的提高单个碱基承载的存储信息量,提高DNA数据存储密度。
本申请实施例中,DNA序列中包括M个碱基重复片段,其中M为大于或等于1的整数。应当理解的是,DNA序列中碱基重复片段的数量可以是M个,也可以多于M个。在一些实施例中,DNA序列中含有N个碱基重复片段,N大于M;在一些实施例中,DNA序列中有且仅有M个碱基重复片段。
每个碱基重复片段包括连续且重复的碱基单元,其中,碱基单元可以是单碱基、双碱基单元、三碱基单元和四碱基单元。
示例性的,当碱基重复片段包括连续且重复的单碱基时,是指若干个相同的单碱基重复排列,如AAAAA,TTTTTTTTTT,GGGGGGGG,CCCC,对应的碱基单元为单碱基A,单碱基T,单碱基G,单碱基C。当然,连续且重复的单碱基形成的碱基重复片段中碱基重复的次数并不限于上述示例。
示例性的,当碱基重复片段包括连续且重复的双碱基单元时,是指四种碱基A、T、G、C中的任意两种组合后形成的双碱基单元的重复排列,如ATATATAT,TCTCTCTCTCTCTCTCTC,GCGCGCGCGCGCGC,CTCTCTCT,对应的碱基单元为双碱基单元AT,双碱基单元TC,双碱基单元GC,双碱基单元CT。当然,第二序列片段中双碱基的组合类型以及双碱基的重复的次数并不限于上述示例。
示例性的,当碱基重复片段包括连续且重复的三碱基单元时,是指四种碱基A、T、G、C中的任意两种或三种组合成三个碱基后形成的三碱基单元的重复排列,如AGTAGTAGTAGT,TCATCATCATCATCATCATCATCATCA,GTCGTCGTCGTCGTCGTCGTC,CGTCGTCGTCGT、AATAATAAAT,对应的碱基单元为三碱基单元AGT,三碱基单元TCA,三碱基单元GTC,三碱基单元CGT。当然,第三序列片段中三碱基的组合类型以及三碱基的重复的次数并不限于上述示例。
示例性的,当碱基重复片段包括连续且重复的四碱基单元时,是指四种碱基A、T、G、C中的任意两种按照非ABAB的方式组合成四个碱基后形成的四碱基单元的重复排列,或四种脱氧核苷酸A、T、G、C中的任意三种组合成四个碱基后形成的四碱基单元的重复排列,或四种脱氧核苷酸A、T、G、C随机组合形成的四碱基单元的重复排列。示例性的,四碱基单元如AGGAAGGAA,ATCAATCA,AGTCAGTCAGTCAGTC,TGCATGCATGCATGCA,GATCGATCGATC,CGATCGATCGATCGAT,对应的碱基单元分别为四碱基单元AGGA,四碱基单元TATCA,四碱基单元AGTC,四碱基单元TGCA,四碱基单元GATC,四碱基单元CGAT。当然,第四序列片段中四碱基的组合类型以及四碱基的重复的次数并不限于上述示例。
在一些实施例中,DNA序列中,M个碱基重复片段分别对应DNA序列中连续且重复的单碱基形成的重复片段、连续且重复的双碱基单元形成的重复片段、连续且重复的三碱基单元形成的重复片段。即M个碱基重复片段不包括连续且重复的四碱基单元形成的重复片段。在这种情况下,下述步骤S20可以采用碱基数量较少的基准碱基组来编码DNA序列中的单碱基或碱基单元的重复次数,如采用两个碱基组成的基准碱基组来编码DNA序列中的单碱基或碱基单元的重复次数,从而进一步减少得到的压缩序列中的碱基数量,提高单碱基的平均存储密度。
S20. 根据预设的重复次数与基准碱基组之间的对应关系,对DNA序列进行编码,得到压缩序列,压缩序列包括M个编码片段,M个碱基重复片段与M个编码片段一一对应,每个编码片段包括对应的碱基单元、表征对应的碱基单元的重复次数的基准碱基组以及至少一个标记物,标记物用于标记编码片段中的碱基单元的起点位置和终点位置。
该步骤中,根据预设的重复次数与基准碱基组之间的对应关系,对DNA序列进行编码,将DNA序列中的M个碱基重复片段编码为碱基数更为简短的M个碱基重复片段,从而将存储有数据信息的长序列DNA简化为短序列DNA。
本申请实施例中,重复次数是指碱基重复片段中碱基单元的重复次数。示例性的,碱基重复片段TTTTTTTTTT中,碱基单元T的重复次数为10;碱基重复片段ATATATAT中,碱基单元AT的重复次数为4;碱基重复片段TGCTGCTGCTGCTGCTGC中,碱基单元TGC的重复次数为6;碱基重复片段ATCGATCG中,碱基单元ATCG的重复次数为2。
本申请实施例中,基准碱基组中碱基的类型和数量,可以根据DNA序列中需要编码的M个碱基重复片段中碱基单元的最大重复次数进行选定。在一些实施例中,当M个碱基重复片段中碱基单元的最大重复次数小于或等于4 1时,基准碱基组可选用单碱基、双碱基、三碱基或四碱基。但基准碱基组的碱基数量为越少,越有利于减少压缩序列中的碱基数量,从而越有利于提高单位碱基的存储密度。因此,基准碱基组的碱基数量为1,即基准碱基组选用单碱基时,得到的压缩序列的碱基数量最少,对应的,单位碱基的存储密度最高。在一些实施例中,当M个碱基重复片段中碱基单元的最大重复次数小于或等于4 2即16时,基准碱基组可选用双碱基、三碱基或四碱基。在这种情况下,基准碱基组的碱基数量为2,即基准碱基组选用双碱基时,得到的压缩序列的碱基数量最少,对应的,单位碱基的存储密度最高。以此类推,当M个碱基重复片段中碱基单元的最大重复次数小于或等于4 s时,基准碱基组可选用s碱基时,得到的压缩序列的碱基数量最少,其中,s为大于或等于3的整数。
预设的重复次数与基准碱基组之间的对应关系,是指预先设定的碱基重复片段中碱基单元的重复次数与基准碱基组之间的等同关系。示例性的,基准碱基组中的碱基数量为2,重复次数与基准碱基组之间的对应关系可以预设如下:5对应AT,6对应AC;7对应AG;8对应TA,9对应TC,10对应TG,11对应CA,12对应CT,13对应CG,14对应GA,15对应GT,16对应GC...。当然,基准碱基组中的碱基数量并不限于2,重复次数和基准碱基组的对应关系也不限于上述对应关系。
在一些实施例中,根据预设的重复次数与基准碱基组之间的对应关系,对DNA序列进行编码,得到压缩序列,包括:
采用数字n表示DNA序列中M个碱基重复片段中碱基单元的重复次数,并编码碱基重复片段中除一个碱基单元以外的其他重复碱基单元,即保留M个碱基重复片段中的一个碱基单元。
示例性的,当碱基重复片段为AAAAAAAA时,编码为8A;当碱基重复片段为GCGCGCGCGCGCGC时,编码为7GC;当碱基重复片段为AGTAGTAGTAGT时,编码为4AGT;当碱基重复片段为AGCTAGCTAGCTAGCTAGCT时,编码为5AGCT。
标记n编码后的M个碱基重复片段中保留的碱基单元的起始位置和终止位置。
在一些实施例中,采用标记物标记编码后的M个碱基重复片段中保留的碱基单元的起始位置和终止位置,且标记物为具有可合成性和可识别性的标记物。在这种情况下,压缩后的DNA序列在解压缩过程中,测序过程中可自动识别测试序列中的标记物,进而识别碱基重复片段的类型,以实现DNA序列的解压缩。
在一些实施例中,标记物为修饰碱基。其中,修饰碱基是指对碱基进行修饰后得到的碱基。具体的,标记物为具有可合成性和可识别性的修饰碱基。示例性的,修饰碱基为甲基化碱基C。
在一些实施例中,标记n编码后的M个碱基重复片段中保留的碱基单元的起始位置和终止位置,包括:采用不同数量或类型的修饰碱基对M个碱基重复片段中保留的碱基单元的起始位置和终止位置进行标记。
在一种可能的实施方式中,通过在不同类型的碱基单元的起始位置中插入不同数量的标记物,来标记n编码后的M个碱基重复片段中保留的碱基单元的起始位置和终止位置。示例性的:碱基单元为单碱基,在n编码后的M个碱基重复片段中保留的碱基单元的起始位置(或预设的其他位置)插入一个标记物;碱基单元为双碱基单元,在n编码后的M个碱基重复片段中保留的碱基单元的起始位置(或预设的其他位置)插入两个标记物;碱基单元为三碱基单元,在n编码后的M个碱基重复片段中保留的碱基单元的起始位置(或预设的其他位置)插入三个标记物;碱基单元为四碱基单元,在n编码后的M个碱基重复片段中保留的碱基单元的起始位置(或预设的其他位置)插入四个标记物。由此,可以根据标记物的插入位置,确定碱基单元的起点位置;根据标记物的数量,确定碱基单元的终点位置。当然,标记物的数量与碱基单元类型之间的对应关系并不限于上述示例。在一些实施例中,标记物为修饰碱基;在一些实施例中,标记物为甲基化碱基C。
在一种可能的实施方式中,通过在碱基单元的起始位置和终止位置均插入标记物,直接标记n编码后的M个碱基重复片段中保留的碱基单元的起始位置和终止位置。示例性的:碱基单元为单碱基,在n编码后的M个碱基重复片段中保留的单碱基的前后位置分别插入一个或多个标记物;碱基单元为双碱基单元,在n编码后的M个碱基重复片段中保留的碱基单元的起始位置和终止位置分别插入一个或多个标记物;碱基单元为三碱基单元,在n编码后的M个碱基重复片段中保留的碱基单元的起始位置和终止位置分别插入一个或多个标记物;碱基单元为四碱基单元,在n编码后的M个碱基重复片段中保留的碱基单元的起始位置和终止位置分别插入一个或多个标记物。由此,可以根据标记物的插入位置,直接确定碱基单元的起点位置和终点位置。这种实施方式中,M个碱基重复片段中保留的碱基单元中插入的标记物的类型可以相同,也可以不同;同样的,M个碱基重复片段中保留的碱基单元中插入的标记物的数量可以相同,也可以不同。在一些实施例中,标记物为修饰碱基。
在一种可能的实施方式中,通过在不同类型的碱基单元的起始位置中插入不同类型的标记物,来标记n编码后的M个碱基重复片段中保留的碱基单元的起始位置和终止位置。示例性的:碱基单元为单碱基,在n编码后的M个碱基重复片段中保留的碱基单元的起始位置(或预设的其他位置)插入第一种预设的标记物;碱基单元为双碱基单元,在n编码后的M个碱基重复片段中保留的碱基单元的起始位置(或预设的其他位置)插入第二种预设的标记物;碱基单元为三碱基单元,在n编码后的M个碱基重复片段中保留的碱基单元的起始位置(或预设的其他位置)插入第三种预设的标记物;碱基单元为四碱基单元,在n编码后的M个碱基重复片段中保留的碱基单元的起始位置(或预设的其他位置)插入第四种预设的标记物。此处,应当理解的是,四种预设的标记物的类型为具有可识别性且各不相同的四种碱基。由此,可以根据标记物的插入位置,确定碱基单元的起点位置;根据标记物的类型,确定碱基单元的终点位置。在一些实施例中,标记物为修饰碱基。
上述实施例中,插入的标记物的位置、数量和类型,可以根据事先设定的插入规则来定。示例性的,插入规则可以为在编码后的N中的第一个碱基之前插入标记物,也可以为在编码后N中的最后一个碱后插入标记物,还可以是在其他位置的碱基之间插入标记物,即预设的其他位置。
由此,根据预设的重复次数与基准碱基组之间的对应关系,对DNA序列进行编码,得到的压缩序列,包括M个编码片段,M个碱基重复片段与M个编码片段一一对应,每个编码片段包括对应的碱基单元、表征对应的碱基单元的重复次数的基准碱基组以及至少一个标记物,标记物用于标记编码片段中的碱基单元的起点位置和终点位置。
S30. 对压缩序列进行拆分,得到解码序列和信息序列,解码序列包括M个编码片段中的基准碱基组,信息序列包括压缩序列除基准碱基组以外的其他物质,解码序列和信息序列用于合成存储有数据信息的DNA。
该步骤中,对压缩序列进行拆分,包括:提取压缩序列中的基准碱基组,将提取的基准碱基组编码成解码序列,解码序列包括M个编码片段中的基准碱基组。
在一些实施例中,将提取的基准碱基组编码成解码序列,包括:预先设定的基准碱基组的排列顺序,根据预定的排列顺序将基准碱基组进行排列,得到解码序列。基准碱基组的排列顺序,可以根据设置规则事先设定。在一些实施例中,可以根据压缩序列基准碱基组出现的先后顺序,将基准碱基组依次排列,得到解码序列。在一些实施例中,可以根据压缩序列中基准碱基组出现的先后顺序,按照与基准碱基组出现顺序相反的顺序排列基准碱基组,得到解码序列。在一些实施例中,还可以根据压缩序列中基准碱基组出现的先后顺序,将依次出现的基准碱基组,按照预设的其他排列顺序排列,得到解码序列。示例性的,根据压缩序列中基准碱基组出现的先后顺序,将第一个基准碱基组排在第一位,第二个基准碱基组排在第三位,第三个基准碱基组排在第五位,倒数第一个基准碱基组排在第二位,倒数第二个基准碱基组排在第四位等。
该步骤中,对压缩序列进行拆分,包括:提取压缩序列中除基准碱基组外的其他物质,编码成信息序列。在一种可能的实施方式中,信息序列包括压缩序列除基准碱基组以外的碱基单元、其他碱基和标记物。其中,其他碱基是指压缩序列中,除基准碱基组、碱基单元以外的剩余的碱基。在一种可能的实施方式中,信息序列包括压缩序列除基准碱基组以外的碱基单元和标记物。即DNA序列由碱基重复片段组成,不含碱基重复片段以外的其他碱基。
本申请实施例得到的解码序列和信息序列用于合成存储有数据信息的DNA。本申请实施例提供的解码序列和信息序列,均可以通过化学DNA合成法或酶DNA合成法进行合成。
在一个实施例中,方法还包括:将信息序列划分为J个第一子片段,将解码序列划分为K个第二子片段;其中,J和K均为大于0且小于200 nt的正整数。通过将信息序列和解码序列进行拆分小片段,便于合成。在一个实施例中,每个第一子片段中设置有第一索引标记,用于标记第一子片段在信息序列中的位置;每个第二子片段中设置有第二索引标记,用于标记第二子片段在解码序列中的位置。在一个实施例中,第一索引标记、第二索引标记为种碱基中的一种或多种形成的标记单元。示例性的,用2个碱基表征16个标记单元;用3个碱基表征64个标记单元;用4个碱基表征256个标记单元;用5个碱基表征1024个标记单元,等等。此处,应当理解,当采用四种碱基中的至少两种形成标记单元时,四种碱基中一种或多种碱基在标记单元中可以重复出现,以形成碱基数量大于4的标记单元。示例性的,当标记单元由5个碱基形成的标记单元时,标记单元可以为AATGC。通常来讲,信息序列和解码序列成分成的子片段数越多,需要的索引标记越多,对应的,标记单元中的碱基数量就越多。
示例性的,将信息序列划分为J个第一子片段,将解码序列划分为K个第二子片段,其中,J和K均为大于0且小于200 nt的正整数,每个第一子片段中设置有第一索引标记,用于标记第一子片段在信息序列中的位置,每个第二子片段中设置有第二索引标记,用于标记第二子片段在解码序列中的位置;分别合成各第一子片段和第二子片段后,根据第一索引标记将各第一子片段连接,得到信息序列,根据第二索引标记将各第二子片段连接,得到解码序列。其中,第一子片段和第二子片段的标记顺序由各子片段在解码序列或信息序列中所处的顺序决定。示例性的,解码序列或信息序列拆分成256个合成序列片段,采用4个碱基来编排1至256个合成序列片段的顺序位。
在一些实施例中,在第一子片段和第二子片段的两端连接接头序列,接头序列用于合成序列片段的扩增。在一些实施例中,接头序列为16-20个碱基的序列。
本申请实施例将合成的解码序列和信息序列分别储存。在一些实施例中,解码序列和信息序列可以储存在有机或无机容器介质中,比如聚丙烯离心管中,也可以以其他形式进行存储。
由此,存储有数据信息的DNA,由含有M个碱基重复片段的长序列减缩为由基准碱基组组成的解码序列和有碱基单元、其他碱基以及和标记物组成的两条短序列。一方面,大幅缩小序列中的碱基数量,从而提高了单个碱基的平均存储密度;另一方面,通过解码序列才能解密信息序列对应的DNA序列,加强了数据信息的安全性。
结合图2,在一些实施例中,提供了一种存储有数据信息的DNA序列的解压缩方法,方法还包括:
S40. 通过测序从合成的DNA中获取解码序列和信息序列。
该步骤中,利用测序技术分别读取合成DNA的解码序列和信息序列。示例性的,可以采用Sanger测序技术对解码序列或信息序列进行读取,也可以采用二代高通量测序技术对解码序列或信息序列进行读取。
读取合成DNA的信息序列的步骤,包括读取信息序列中的标记物。标记物可以根据标记物的类型,采用对应的读取技术。示例性的,标记物为甲基化碱基C,采用甲基化碱基C的读取技术,读取修饰碱基。在一些实施例中,甲基化碱基C的读取技术读取修饰碱基时,采用重亚硫酸盐对待读取序列进行处理。
在一个实施例中,若信息序列包括J个第一子片段,解码序列包括K个第二子片段;其中,J和K均为大于0且小于200 nt的正整数,
则通过测序从合成的DNA中获取解码序列和信息序列,包括:
通过测序从合成的DNA中分别获取J个第一子片段和K个第二子片段;
根据J个第一子片段之间的位置对应关系,将J个第一子片段拼接成信息序列;根据K个第二子片段之间的位置对应关系,将K个第二子片段拼接成解码序列。
在一些实施例中,通过测序从合成的DNA中获取解码序列和信息序列,包括:对J个第一子片段和K个第二子片段分别进行PCR扩增,分别读取扩增片段的碱基序列;根据第一索引标记对扩增片段进行排序,拼接成完整的信息序列,根据第二索引标记对扩增片段进行排序,拼接成完整的解码序列。在这种情况下,通过PCR扩增提高各子片段的浓度,提高各子片段的测序识别效率。
S50. 根据解码序列和信息序列,得到压缩序列。
该步骤中,根据解码序列以及信息序列的标记信息,将解码序列中的基准碱基组编码进信息序列中,得到压缩序列。
在一个实施例中,根据解码序列和信息序列,得到压缩序列,包括:
根据解码序列中的基准碱基组的排列顺序以及信息序列中标记物所在的位置,将解码序列和信息序列合并,得到压缩序列。
在一些实施例中,根据解码序列和信息序列,得到压缩序列,包括:
获取解码序列中的基准碱基组。
确定信息序列中的标记信息,在信息序列中的标记碱基单元对应的位置插入基准碱基组,将所述解码序列和所述信息序列合并,得到压缩序列。
S60. 根据对应关系对压缩序列进行解压,得到DNA序列。
该步骤中,根据对应关系,将压缩序列中的编码片段解码成碱基重复片段,得到DNA序列。
在一个实施例中,根据对应关系对压缩序列进行解压,得到DNA序列,包括:根据预设的重复次数与基准碱基组之间的对应关系,将压缩序列中的M个基准碱基编码的碱基单元还原为M个碱基重复片段,得到DNA序列。
在一些实施例中,按照0/1二进制与碱基之间转换的编码规则,将DNA序列编码还原为0/1二进制序列。进一步的,0/1二进制序列可以通过转换程序转换为对应的图片/文字/视频等信息。
本申请实施例提供的解压缩方法,针对上述方法压缩得到的解码序列和信息序列进行测序解读,并根据序列中的标记确定序列中重复片段的位置和类型,结合解读后的解码序列和信息序列,拼接得到存有数据信息的DNA序列。
一个实施例提供了一种将编码了如下数据信息的DNA的压缩和解压缩过程“011100000000000000000000010010111100110011001100110011001100000110000110000110000110000110000110”,包括如下步骤:
(1)压缩过程,如图3所示,包括
步骤一:通过“A=00,T=01,C=10,G=11”的编码规则对上述数据信息进行编码,获得DNA序列:TGAAAAAAAAAATACGGAGAGAGAGAGAGAATCATCATCATCATCATC;提取DNA序列中的单碱基重复序列单元“AAAAAAAAAA”、双碱基重复序列单元“GAGAGAGAGAGAGA”和三碱基重复序列单元“ATCATCATCATCATCATC”。
步骤二:用10A代表DNA序列中的重复序列单元“AAAAAAAAAA”,用7GA代表重DNA序列中的复序列单元“GAGAGAGAGAGAGA”,用6ATC代表DNA序列中的重复序列单元“ATCATCATCATCATCATC”,形成新的序列TG10ATACG7GA6ATC。
步骤三:在10A中间插入1个甲基化的胞嘧啶C*(甲基化胞嘧啶C目前合成和测序都可行),在7GA中间插入2个甲基化的胞嘧啶C*,在6ATC中间插入3个甲基化的胞嘧啶C*,形成新的序列,命名为DNA模板序列:TG10C*ATACG7C*C*GA6C*C*C*ATC
步骤四:建立“5=AT;6=AC;7=AG;8=TA,9=TC,10=TG,11=CA,12=CT,13=CG,14=GA,15=GT,16=GC”的对应关系,用TG代表DNA模板序列中的10,用AG代表序列中的7,用AC代表序列中的6,形成新的序列:TGTGC*ATACGAGC*C*GAACATC。依序提取代表步骤四中数字的双碱基,组合成一条新的解码序列:TGAGAC。
步骤五:删除步骤三DNA模板序列中的数字,获得一条信息序列:TGC*ATACGC*C*GAC*C*C*ATC。
获得的解码序列和信息序列可以用于后续的DNA合成过程。
(2)解压缩过程,如图4所示,包括:
步骤一:将测序获得的解码序列“TGAGAC”中2个碱基一个单元依序插入到测序获得的信息序列“TGC*ATACGC*C*GAC*C*C*ATC”中修饰碱基和非修饰碱基中间,形成一条新的序列:TGTGC*ATACGAGC*C*GAACATC
步骤二:参照“5=AT;6=AC;7=AG;8=TA,9=TC,10=TG,11=CA,12=CT,13=CG,14=GA,15=GT,16=GC”的对应关系,用10代表TG,用7代表AG,用6代表AC,形成新的序列:TG10C*ATACG7C*C*GA6C*C*C*ATC。
步骤三:根据修饰碱基个数,确定重复碱基单元,结合获得的重复次数,将序列恢复成完整的序列。10C*A代表单碱基A重复10次,恢复为“AAAAAAAAAA”,7C*C*GA,代表双碱基GA重复7次,恢复为“GAGAGAGAGAGAGA”,6C*C*C*ATC代表ATC重复6次,恢复为“ATCATCATCATCATCATC”,最终获得完整的序列TGAAAAAAAAAATACGGAGAGAGAGAGAGAATCATCATCATCATCATC;接下来根据“A=00; T=01; C=10;G=11”规则对应的翻译成0/1二进制序列。
本实施例的压缩过程,首先将96 bits的二进制01信息存于48个碱基的DNA序列中,然后通过本申请提供公开的压缩方法,将48个碱基压缩成24个碱基,实现了50%的DNA序列压缩效果,实现了数据存储密度2bits/nt到4 bits/nt的一倍提升;同时,压缩后的序列不包含单碱基、双碱基、三碱基重复,利于后续的合成和测序。
结合图5,本申请一些实施例提供了一种存储有数据信息的DNA序列的处理装置5,处理装置5包括:
获取模块51,用于获取待压缩的DNA序列,DNA序列是根据待存储的数据信息转换所得,DNA序列包括M个碱基重复片段,每个碱基重复片段包括连续且重复的碱基单元,M≥1,M为整数;
编码模块52,用于根据预设的重复次数与基准碱基组之间的对应关系,对DNA序列进行编码,得到压缩序列,压缩序列包括M个编码片段,M个碱基重复片段与M个编码片段一一对应,每个编码片段包括对应的碱基单元、表征对应的碱基单元的重复次数的基准碱基组以及至少一个标记物,标记物用于标记编码片段中的碱基单元的起点位置和终点位置;
拆分模块53,用于对压缩序列进行拆分,得到解码序列和信息序列,解码序列包括M个编码片段中的基准碱基组,信息序列包括压缩序列除基准碱基组以外的碱基单元、其他碱基和标记物,解码序列和信息序列用于合成存储有数据信息的DNA。
在一个实施例中,结合图6,处理系统5还包括:
测序模块54,用于通过测序从合成DNA中获取解码序列和信息序列;
解码模块55,用于根据解码序列和信息序列,得到压缩序列;
解压模块56,用于根据对应关系对压缩序列进行解压,得到DNA序列。
参照图7,示出了本申请一个实施例的一种终端设备的示意图。如图7所示,本实施例提供的终端设备70包括:处理器710、存储器720以及存储在存储器720中并可在处理器710上运行的计算机程序721。处理器710执行计算机程序721时实现上述处理方法各个实施例中的步骤,例如图1所示的步骤S10至S30。
示例性的,计算机程序721可以被分割成一个或多个模块/单元,一个或者多个模块/单元被存储在存储器720中,并由处理器710执行,以完成本申请。一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段可以用于描述计算机程序721在终端设备中的执行过程。例如,计算机程序721可以被分割成获取模块、编码模块和拆分模块,各模块具体功能如下:
获取模块,用于获取待压缩的DNA序列,DNA序列是根据待存储的数据信息转换所得,DNA序列包括M个碱基重复片段,每个碱基重复片段包括连续且重复的碱基单元,M≥1,M为整数;
编码模块,用于根据预设的重复次数与基准碱基组之间的对应关系,对DNA序列进行编码,得到压缩序列,压缩序列包括M个编码片段,M个碱基重复片段与M个编码片段一一对应,每个编码片段包括对应的碱基单元、表征对应的碱基单元的重复次数的基准碱基组以及至少一个标记物,标记物用于标记编码片段中的碱基单元的起点位置和终点位置;
拆分模块,用于对压缩序列进行拆分,得到解码序列和信息序列,解码序列包括M个编码片段中的基准碱基组,信息序列包括压缩序列除基准碱基组以外的碱基单元、其他碱基和标记物,解码序列和信息序列用于合成存储有数据信息的DNA。
在一些实施例中,计算机程序721还可以被分割成测序模块、解码模块和解压模块,各模块具体功能如下:
测序模块,用于通过测序从合成DNA中获取解码序列和信息序列;
解码模块,用于根据解码序列和信息序列,得到压缩序列;
解压模块,用于根据对应关系对压缩序列进行解压,得到DNA序列。
终端设备70可包括,但不仅限于,处理器710、存储器720。本领域技术人员可以理解,图7仅仅是终端设备70的一种示例,并不构成对终端设备70的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件。
处理器710可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
存储器720可以是终端设备70的内部存储单元,例如终端设备70的硬盘或内存。存储器720也可以是终端设备70的外部存储设备,例如终端设备70上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等等。进一步地,存储器720还可以既包括终端设备70的内部存储单元也包括外部存储设备。存储器720用于存储计算机程序721以及终端设备70所需的其他程序和数据。存储器720还可以用于暂时地存储已经输出或者将要输出的数据。
本申请实施例还提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序被处理器执行时实现前述各实施例的处理方法。
本申请实施例还提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行前述各实施例的处理方法。
以上仅为本申请的可选实施例而已,并不用于限制本申请。对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (14)

  1. 一种存储有数据信息的DNA序列的处理方法,其特征在于,所述方法包括:
    获取待压缩的DNA序列,所述DNA序列是根据待存储的数据信息转换所得,所述DNA序列包括M个碱基重复片段,每个所述碱基重复片段包括连续且重复的碱基单元,M≥1,M为整数;
    根据预设的重复次数与基准碱基组之间的对应关系,对所述DNA序列进行编码,得到压缩序列,所述压缩序列包括M个编码片段,所述M个碱基重复片段与所述M个编码片段一一对应,每个所述编码片段包括对应的碱基单元、表征所述对应的碱基单元的重复次数的基准碱基组以及至少一个标记物,所述标记物用于标记所述编码片段中的碱基单元的起点位置和终点位置;
    对所述压缩序列进行拆分,得到解码序列和信息序列,所述解码序列包括所述M个编码片段中的基准碱基组,所述信息序列包括所述压缩序列除所述基准碱基组以外的其他物质,所述解码序列和信息序列用于合成存储有所述数据信息的DNA。
  2. 根据权利要求1所述的方法,其特征在于,所述标记物为修饰碱基。
  3. 根据权利要求1所述的方法,其特征在于,所述标记物为甲基化碱基C。
  4. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    将所述信息序列划分为J个第一子片段,将所述解码序列划分为K个第二子片段;其中,J和K均为大于0且小于200 nt的正整数。
  5. 根据权利要求4所述的方法,其特征在于,每个所述第一子片段中设置有第一索引标记,用于标记所述第一子片段在所述信息序列中的位置;每个所述第二子片段中设置有第二索引标记,用于标记所述第二子片段在所述解码序列中的位置。
  6. 根据权利要求5所述的方法,其特征在于,所述第一索引标记、所述第二索引标记为四种碱基中的一种或多种形成的标记单元。
  7. 根据权利要求1至6任一项所述的方法,其特征在于,所述方法还包括:
    通过测序从合成的DNA中获取所述解码序列和所述信息序列;
    根据所述解码序列和所述信息序列,得到压缩序列;
    根据所述对应关系对所述压缩序列进行解压,得到所述DNA序列。
  8. 根据权利要求7所述的方法,其特征在于,若所述信息序列包括J个第一子片段,所述解码序列包括K个第二子片段;其中,J和K均为大于0且小于200 nt的正整数,
    则所述通过测序从合成的DNA中获取所述解码序列和所述信息序列,包括:
    通过测序从合成的DNA中分别获取所述J个第一子片段和所述K个第二子片段;
    根据所述J个第一子片段之间的位置对应关系,将所述J个第一子片段拼接成所述信息序列;根据所述K个第二子片段之间的位置对应关系,将所述K个第二子片段拼接成所述解码序列。
  9. 根据权利要求7所述的方法,其特征在于,所述根据所述解码序列和所述信息序列,得到压缩序列,包括:
    根据所述解码序列中的基准碱基组的排列顺序以及所述信息序列中标记物所在的位置,将所述解码序列和所述信息序列合并,得到所述压缩序列。
  10. 根据权利要求9所述的方法,其特征在于,所述根据所述对应关系对所述压缩序列进行解压,得到所述DNA序列,包括:
    根据所述对应关系,将所述压缩序列中的编码片段解码成所述碱基重复片段,得到所述DNA序列。
  11. 一种存储有数据信息的DNA序列的处理装置,其特征在于,所述处理装置包括:
    获取模块,用于获取待压缩的DNA序列,所述DNA序列是根据待存储的数据信息转换所得,所述DNA序列包括M个碱基重复片段,每个所述碱基重复片段包括连续且重复的碱基单元,M≥1,M为整数;
    编码模块,用于根据预设的重复次数与基准碱基组之间的对应关系,对所述DNA序列进行编码,得到压缩序列,所述压缩序列包括M个编码片段,所述M个碱基重复片段与所述M个编码片段一一对应,每个所述编码片段包括对应的碱基单元、表征所述对应的碱基单元的重复次数的基准碱基组以及至少一个标记物,所述标记物用于标记所述编码片段中的碱基单元的起点位置和终点位置;
    拆分模块,用于对所述压缩序列进行拆分,得到解码序列和信息序列,所述解码序列包括所述M个编码片段中的基准碱基组,所述信息序列包括所述压缩序列除所述基准碱基组以外的其他碱基单元和所述标记物,所述解码序列和信息序列用于合成存储有所述数据信息的DNA。
  12. 根据权利要求11所述的处理装置,其特征在于,所述处理装置还包括:
    测序模块,用于通过测序从合成DNA中获取所述解码序列和所述信息序列;
    解码模块,用于根据所述解码序列和所述信息序列,得到压缩序列;
    解压模块,用于根据所述对应关系对所述压缩序列进行解压,得到所述DNA序列。
  13. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至10任一项所述的方法。
  14. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至10任一项所述的方法。
PCT/CN2020/122721 2020-10-22 2020-10-22 存有数据信息的dna序列的处理方法及装置 WO2022082573A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/122721 WO2022082573A1 (zh) 2020-10-22 2020-10-22 存有数据信息的dna序列的处理方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/122721 WO2022082573A1 (zh) 2020-10-22 2020-10-22 存有数据信息的dna序列的处理方法及装置

Publications (1)

Publication Number Publication Date
WO2022082573A1 true WO2022082573A1 (zh) 2022-04-28

Family

ID=81289576

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/122721 WO2022082573A1 (zh) 2020-10-22 2020-10-22 存有数据信息的dna序列的处理方法及装置

Country Status (1)

Country Link
WO (1) WO2022082573A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114721601A (zh) * 2022-05-26 2022-07-08 昆仑智汇数据科技(北京)有限公司 一种工业设备数据的存储方法及装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130282677A1 (en) * 2011-01-07 2013-10-24 Zhen Ji Data compression system for dna sequence
CN106100641A (zh) * 2016-06-12 2016-11-09 深圳大学 针对fastq数据的多线程快速存储无损压缩方法及其系统
CN110111852A (zh) * 2018-01-11 2019-08-09 广州明领基因科技有限公司 一种海量dna测序数据无损快速压缩平台

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130282677A1 (en) * 2011-01-07 2013-10-24 Zhen Ji Data compression system for dna sequence
CN106100641A (zh) * 2016-06-12 2016-11-09 深圳大学 针对fastq数据的多线程快速存储无损压缩方法及其系统
CN110111852A (zh) * 2018-01-11 2019-08-09 广州明领基因科技有限公司 一种海量dna测序数据无损快速压缩平台

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JI ZHEN, ZHOU JIA-RUI,ZHU ZE-XUAN,QH WU: "Bioinformatics Features Based DNA Sequence Data Compression Algorithm", ACTA ELECTRONICA SINICA, ZHONGGUO DIANZI XUEHUI, CN, vol. 39, no. 5, 31 May 2011 (2011-05-31), CN , pages 991 - 995, XP055923249, ISSN: 0372-2112 *
XIONG WENPING, SUN JI-FENG: "A New Compression Scheme for DNA Sequences Based on Statistical Analysis and Segmented Codebook", SCIENCE TECHNOLOGY AND ENGINEERING, ZHONGGUO JISHU JINGJI YANJIUHUI, CN, vol. 12, no. 29, 31 October 2012 (2012-10-31), CN , XP055923243, ISSN: 1671-1815 *
ZHANG LIXIA, ZHANG YI-QING, LIN PI-YUAN, LIU JI-PING: "DNA Compressed Pattern Matching Algorithms Based on Character and 0/1 Coding", APPLICATION RESEARCH OF COMPUTERS, CHENGDU, CN, vol. 24, no. 9, 30 September 2007 (2007-09-30), CN , pages 22 - 24, XP055923242, ISSN: 1001-3695 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114721601A (zh) * 2022-05-26 2022-07-08 昆仑智汇数据科技(北京)有限公司 一种工业设备数据的存储方法及装置

Similar Documents

Publication Publication Date Title
CN112288090B (zh) 存有数据信息的dna序列的处理方法及装置
CN112382340B (zh) 用于dna数据存储的编解码方法和编解码装置
Dong et al. DNA storage: research landscape and future prospects
US20210050074A1 (en) Systems and methods for sequence encoding, storage, and compression
CN109830263B (zh) 一种基于寡核苷酸序列编码存储的dna存储方法
KR20200071720A (ko) Dna-기반 데이터 저장
EP2947779A1 (en) Method and apparatus for storing information units in nucleic acid molecules and nucleic acid storage system
CN111858510B (zh) Dna活字存储系统和方法
CN111368132B (zh) 基于dna序列存储音频或视频文件的方法及存储介质
CN113744804A (zh) 利用dna进行数据存储的方法、装置及存储设备
CN111858507B (zh) 基于dna的数据存储方法、解码方法、系统和装置
CN105760706A (zh) 一种二代测序数据的压缩方法
Lee et al. Enzymatic DNA synthesis for digital information storage
CN114958828B (zh) 基于dna分子介质的数据信息存储方法
CN110569974B (zh) 可包含人造碱基的dna存储分层表示与交织编码方法
WO2022082573A1 (zh) 存有数据信息的dna序列的处理方法及装置
CN113782102B (zh) Dna数据的存储方法、装置、设备及可读存储介质
Cevallos et al. A brief review on DNA storage, compression, and digitalization
Wang et al. Hidden addressing encoding for DNA storage
US20220382480A1 (en) Method, system, apparatus for data storage, decoding method, and storage medium
WO2022109879A1 (zh) 用于dna数据存储的二进制信息到碱基序列的编解码方法和编解码装置
US20060057586A1 (en) Transcript mapping method
CN116030895A (zh) 一种基于天然和非天然碱基的dna信息存储方法
CN116564424A (zh) 基于纠删码与组装技术的dna数据存储方法、读取方法及终端
Akash et al. How to make DNA data storage more applicable

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20958156

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20958156

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20958156

Country of ref document: EP

Kind code of ref document: A1