WO2019080653A1 - Procédé de codage/décodage, codeur/décodeur, et procédé et appareil de mémorisation - Google Patents

Procédé de codage/décodage, codeur/décodeur, et procédé et appareil de mémorisation

Info

Publication number
WO2019080653A1
WO2019080653A1 PCT/CN2018/103795 CN2018103795W WO2019080653A1 WO 2019080653 A1 WO2019080653 A1 WO 2019080653A1 CN 2018103795 W CN2018103795 W CN 2018103795W WO 2019080653 A1 WO2019080653 A1 WO 2019080653A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
bit
binary code
symbol group
code sequence
Prior art date
Application number
PCT/CN2018/103795
Other languages
English (en)
Chinese (zh)
Inventor
黄小罗
陈世宏
林涛
陈泰
沈玥
徐讯
尹烨
杨焕明
Original Assignee
深圳华大生命科学研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大生命科学研究院 filed Critical 深圳华大生命科学研究院
Priority to CN201880068914.8A priority Critical patent/CN111279422B/zh
Publication of WO2019080653A1 publication Critical patent/WO2019080653A1/fr
Priority to US16/858,295 priority patent/US20200321079A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/02Conversion to or from weighted codes, i.e. the weight given to a digit depending on the position of the digit within the block or code word
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code

Definitions

  • the present disclosure relates to the field of data storage technologies, and in particular, to an encoding method, a decoding method, a storage method, an encoder, a decoder, a storage device, and a computer readable storage medium.
  • DNA storage technology has provided a new way to solve these problems.
  • the use of DNA as a storage medium for information storage has a long storage time (can be more than a thousand years, more than 100 times that of existing tape and optical media), and has a high storage density (up to 109 Gb/mm3, More than 10 million times of tape and optical media) and good storage security.
  • the present disclosure proposes a coding, decoding, and storage scheme that has a high storage density and is capable of avoiding high GC or high AT repetition.
  • an encoding method comprising: encoding a first binary code sequence and a second binary code sequence into a coding sequence, the first binary code sequence and The second binary code sequence has the same number of bits, and the code sequence is composed of four different symbols, wherein the code sequence is obtained by the following steps: according to the first bit of the first binary code sequence Determining a first bit of the second binary code sequence and a reference symbol, determining a first bit of the coded sequence, the reference symbol being any one of the four different symbols, according to the first binary Determining a current bit of the code sequence, a current bit of the code sequence, a current bit of the second binary code sequence, and a previous bit of the code sequence, the current bit of the code sequence being other than the code Other bits than the first bit of the sequence.
  • determining, according to a preset first mapping relationship, a first candidate symbol group of a current bit of the coding sequence according to a current bit in the first binary code sequence, the first candidate symbol group include two of the four different symbols; determining the coding sequence according to a preset second mapping relationship according to a current bit of the second binary code sequence and a previous bit of the coding sequence a second candidate symbol group of a current bit, the second candidate symbol group comprising two of the four different symbols, the first candidate symbol group having the same symbol as the second candidate symbol group; The same symbol is determined as the current bit of the encoded sequence.
  • the information to be encoded is transcoded into a binary code; the first binary code sequence and the second binary code sequence are extracted from the binary code.
  • the four different symbols are four deoxyribonucleotides of adenine A, cytosine C, guanine G, and thymine T, and the coding sequence is the one comprising the four deoxyribonucleotides.
  • Nucleic acid sequence is the one comprising the four deoxyribonucleotides.
  • the first mapping relationship is a correspondence between a first bit or a current bit in the first binary code sequence and a symbol in the first candidate symbol group, the first candidate symbol group
  • the symbols in the middle are two of A, C, G, and T.
  • the second mapping relationship is a correspondence between a first bit in the second binary code sequence and a symbol in the second candidate symbol group, or the second binary code a correspondence between a current bit and a previous bit in the sequence and a symbol in the second candidate symbol group, and the symbols in the second candidate symbol group are two of A, C, G, and T.
  • the second candidate symbol group has the same symbol as the first candidate symbol group.
  • a storage method comprising: splitting a nucleic acid sequence obtained by the encoding method according to any of the above embodiments into a plurality of sequence segments; adding an index identifier to each sequence segment, The index identifier includes positional order information of the sequence fragments; each of the sequence fragments is synthesized into each nucleic acid fragment.
  • each of the nucleic acid fragments is stored in a medium, which is a storage tube or cell.
  • the index is identified as a deoxyribonucleic acid sequence.
  • each nucleic acid fragment is assembled prior to storing the nucleic acid fragments in a medium.
  • each nucleic acid fragment is ligated into a vector prior to storing the nucleic acid fragments in a medium.
  • a decoding method including:
  • the encoded sequence generated by the encoding method according to any of the above embodiments is decoded into a first binary code sequence and a second binary code sequence, wherein the first binary code sequence is obtained by the following steps: According to the first mapping relationship in the encoding method according to any of the above embodiments, two of the four different symbols included in the coding sequence are decoded to 0, and the other two of the four different symbols are used.
  • the nucleic acid fragments synthesized by the storage method described in any of the above embodiments are sequenced to obtain each sequence segment; and the positional order information of each sequence segment is obtained according to the index identifier of each sequence segment; The position sequence information combines the sequence segments into the code sequence.
  • the four different symbols are adenine A, cytosine C, guanine G, and thymine T, four deoxyribonucleotides.
  • the decoded binary code sequence is combined into a binary code; the binary code is transcoded into corresponding information.
  • an encoder comprising: a memory configured to store a first binary code sequence and a second binary code sequence to be encoded, the first binary code a sequence having the same number of bits as the second binary code sequence; a processor, the processor being coupled to the memory, the processor being configured to sequence the first binary code and the second
  • the binary code sequence is encoded as a code sequence consisting of four different symbols, wherein the code sequence is obtained by the following steps: according to the first bit of the first binary code sequence, Determining a first bit of the encoded sequence, the first bit of the second binary code sequence, the reference symbol being any one of the four different symbols, according to the first binary Determining a current bit of the code sequence, a current bit of the code sequence, a current bit of the second binary code sequence, and a previous bit of the code sequence, the current bit of the code sequence being other than the code sequence number one Other bits than the bit.
  • the processor is configured to determine a current bit of the coding sequence by performing: determining, according to a preset first mapping relationship, according to a current bit in the first binary code sequence a first candidate symbol group of a current bit of a coding sequence, the first candidate symbol group comprising two of the four different symbols; a current bit according to the second binary code sequence, and the encoding a previous candidate bit of the sequence, determining a second candidate symbol group of a current bit of the coding sequence according to a preset second mapping relationship, where the second candidate symbol group includes two of the four different symbols, The first candidate symbol group has the same symbol as the second candidate symbol group; the same symbol is determined as the current bit of the coding sequence.
  • the processor is configured to transcode the information to be encoded into a binary code from which the first binary code sequence and the second binary code sequence are extracted.
  • the four different symbols are four deoxyribonucleotides of adenine A, cytosine C, guanine G, and thymine T, and the coding sequence is the one comprising the four deoxyribonucleotides.
  • Nucleic acid sequence is the one comprising the four deoxyribonucleotides.
  • the first mapping relationship is a correspondence between a first bit or a current bit in the first binary code sequence and a symbol in the first candidate symbol group, the first candidate symbol group
  • the symbols in the middle are two of A, C, G, and T.
  • the second mapping relationship is a correspondence between a first bit in the second binary code sequence and a symbol in the second candidate symbol group, or the second binary code a correspondence between a current bit in the sequence and a symbol of the previous bit and the second candidate symbol group, wherein the symbols in the first candidate symbol group are two of A, C, G, and T.
  • the second candidate symbol group has the same symbol as the first candidate symbol group.
  • a storage device comprising: a sequence splitting module configured to split a nucleic acid sequence obtained according to the encoding method according to any of the above embodiments into a plurality of sequence segments; an index Adding a module, the index adding module is connected to the sequence splitting module, configured to add an index identifier to each sequence segment, where the index identifier includes location sequence information of the sequence segment; a nucleic acid synthesis module, the nucleic acid A synthesis module is coupled to the index addition module and configured to synthesize the sequence fragments into individual nucleic acid fragments.
  • the index is identified as a deoxyribonucleic acid sequence.
  • a nucleic acid assembly module is further included, the nucleic acid assembly module being coupled to the nucleic acid synthesis module and configured to assemble the nucleic acid fragments.
  • a vector linkage module is further included, the vector linkage module being coupled to the nucleic acid synthesis module and configured to link the nucleic acid fragments into a vector.
  • a media storage module is further included, the media storage module being coupled to the nucleic acid synthesis module, configured to store the nucleic acid fragments in a medium, the medium being a storage tube or a cell.
  • a decoder comprising: a memory configured to store an encoded sequence generated by an encoder according to any of the above embodiments.
  • a processor the processor being coupled to the memory, the processor being configured to: include four types of the encoded sequence according to a first mapping relationship in an encoder according to any of the above embodiments Two of the different symbols are decoded to 0, and the other two of the four different symbols are decoded to 1 to obtain the first binary code sequence, and the second binary code sequence passes the following steps Obtaining: determining, according to the first bit of the coding sequence and the reference symbol, the first mapping of the second binary code according to the second mapping relationship in the encoder according to any of the foregoing embodiments, the reference The symbol is any one of the four different symbols, and according to the current mapping and the previous bit of the coding sequence, determining a current bit of the second binary code according to the second mapping relationship, The current bit of the coded sequence is a bit other than the first bit of the coded sequence
  • each nucleic acid fragment synthesized according to the storage method according to any of the above embodiments is sequenced; according to the index identification of each nucleic acid fragment, positional order information of each nucleic acid fragment is obtained; according to the positional order Information, assembling each of the nucleic acid fragments into the coding sequence.
  • the four different symbols are adenine A, cytosine C, guanine G, and thymine T, four deoxyribonucleotides.
  • the processor is configured to combine the decoded acquired binary code sequences into binary code, transcoding the binary code into corresponding information.
  • a computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements at least one of the following methods: encoding as described in any of the above embodiments Method, the decoding method described in any of the above embodiments.
  • the encoding method is set with the previous bit in the encoding sequence as a constraint.
  • the encoding method encodes two different binary code sequences into one code sequence consisting of four different symbols, thereby increasing the storage density.
  • the coding method has multiple joint coding implementations, which can more flexibly set the mapping relationship between the binary code and the coded symbols, thereby avoiding the occurrence of high GC and high AT repetition in the coding sequence, thereby causing subsequent decoding. The problem of low accuracy.
  • FIG. 1 shows a flow chart of an encoding method in accordance with some embodiments of the present disclosure.
  • FIG. 2 shows a flow chart of an encoding method in accordance with further embodiments of the present disclosure.
  • FIG. 3 shows a schematic diagram of an encoding method in accordance with further embodiments of the present disclosure.
  • FIG. 4 illustrates a flow chart of a storage method in accordance with some embodiments of the present disclosure.
  • FIG. 5 illustrates a flow chart of a decoding method in accordance with some embodiments of the present disclosure.
  • FIG. 6 shows a schematic diagram of information to be encoded, in accordance with some embodiments of the present disclosure.
  • Figures 7a-7c show sequenced peaks of sequence fragments 1-3.
  • FIG. 8 shows a block diagram of an encoder of some embodiments of the present disclosure.
  • FIG. 9 shows a block diagram of a memory device of some embodiments of the present disclosure.
  • FIG. 10 shows a block diagram of a storage device of further embodiments of the present disclosure.
  • Figure 11 shows a block diagram of a decoder of some embodiments of the present disclosure.
  • the encoding method of the present disclosure is capable of encoding a first binary code sequence (e.g., sequence a) and a second binary code sequence (e.g., sequence b) having the same number of bits into one code sequence.
  • a first binary code sequence e.g., sequence a
  • a second binary code sequence e.g., sequence b
  • information to be encoded such as pictures, videos, voices, documents, etc.
  • sequence a and sequence b can be extracted from the binary code.
  • FIG. 1 shows a flow chart of an encoding method in accordance with some embodiments of the present disclosure.
  • the encoding method may specifically include: step 110, determining a first bit of the coding sequence; and step 120, determining a current bit of the coding sequence.
  • a first bit of the encoded sequence can be determined based on the first bit of the first binary code sequence, the first bit of the second binary code sequence, and the reference symbol.
  • the code sequence can be composed of four different symbols
  • the reference symbol can be any of four different symbols.
  • the current bit of the coding sequence is determined according to the current bit of the first binary code sequence, the current bit of the second binary code sequence, and the previous bit of the coding sequence, and the current bit of the coding sequence is a division code. Other bits than the first bit of the sequence.
  • the coding principles for the "first bit” and “current bit” of the code sequence are the same, except that the "first bit” has no so-called “previous bit” in the code sequence, so a reference symbol can be specified.
  • the “first place” of the “first place.” For the sake of brevity and convenience of expression, the "current bit” and “previous bit” are used below to describe embodiments of the present disclosure.
  • the current bit of the encoded sequence can be determined by the method of Figure 2.
  • FIG. 2 shows a flow chart of an encoding method in accordance with further embodiments of the present disclosure.
  • the encoding method includes: step 1201, determining a first candidate symbol group of a current bit of a coding sequence; step 1202, determining a second candidate symbol group of a current bit of the coding sequence; and step 1203, determining a coding sequence Current bit.
  • a first candidate symbol group of a current bit of the coding sequence may be determined according to a preset first mapping relationship according to a current bit in the first binary code sequence, where the first candidate symbol group includes four different types. Two of the symbols.
  • the first mapping relationship may be set to map 0 to symbol 1 or symbol 2, and 1 to symbol 3 or symbol 4.
  • the first candidate symbol group corresponding to 0 includes symbol 1 and symbol 2
  • the first candidate symbol group corresponding to 1 includes symbol 3 and symbol 4.
  • a second candidate symbol group of a current bit of the coding sequence may be determined according to a preset second mapping relationship according to a current bit of the second binary code sequence and a previous bit of the coding sequence.
  • the second candidate symbol group contains two of four different symbols, the first candidate symbol group having the same symbol as the second candidate symbol group.
  • the second mapping relationship can be set according to Table 1.
  • the previous bit of the coding sequence is symbol 1
  • the second candidate symbol group corresponding to the current bit X of the second binary code sequence includes symbol 1 and symbol 3
  • the previous bit of the coding sequence is symbol 2
  • the second candidate symbol group corresponding to the current bit Y of the second binary code sequence includes symbol 2 and symbol 4.
  • the setting of the first mapping relationship and the second mapping relationship can ensure that the first candidate symbol group and the second candidate symbol group have one and the same symbol.
  • FIG. 3 shows a schematic diagram of an encoding method in accordance with further embodiments of the present disclosure.
  • the binary code sequence 31 is 001100111100110011001100110011
  • the binary code sequence 32 is 010101111101111111111111111111.
  • the four symbols of the coding method in the embodiments of Figures 1 and 2 may be adenine A, cytosine C, guanine G and thymine T
  • coding sequence 33 is a deoxyribose nucleus comprising A, C, G and T. The sequence of the nucleotide.
  • the first mapping relationship can be set to:
  • the first candidate symbol group of the current bit of the code sequence 33 corresponding to the current bit 0 in the binary code sequence 31 includes A and T, and the first candidate symbol of the current bit of the code sequence 33 corresponding to the current bit 1 in the binary code sequence 31
  • the group contains G and C.
  • the second mapping relationship can be set to:
  • the second candidate symbol group of the current bit of the coding sequence 33 corresponding to the current bit 0 in the binary code sequence 32 contains A and G.
  • the second mapping relationship may also be established according to the correspondence in the above table.
  • the following can determine all "current bits" of the code sequence 33 according to the first and second mapping relationships described above.
  • the current bit of the code sequence 33 is G or A
  • the current bit in the binary code sequence 31 is 0, and the current bit in the binary code sequence 32 is 0, the current bit of the code sequence 33 is A. If the previous bit of the code sequence 33 is G or A, the current bit in the binary code sequence 31 is 0, and the current bit in the binary code sequence 32 is 1, the current bit of the code sequence 33 is T. If the previous bit of the code sequence 33 is G or A, the current bit in the binary code sequence 31 is 1, and the current bit in the binary code sequence 32 is 0, the current bit of the code sequence 33 is G. If the previous bit of the code sequence 33 is G or A, the current bit in the binary code sequence 31 is 1, and the current bit in the binary code sequence 32 is 1, the current bit of the code sequence 33 is C.
  • the current bit of the code sequence 33 is C or T, the current bit in the binary code sequence 31 is 0, and the current bit in the binary code sequence 32 is 0, the current bit of the code sequence 33 is T. If the previous bit of the code sequence 33 is C or T, the current bit in the binary code sequence 31 is 0, and the current bit in the binary code sequence 32 is 1, the current bit of the code sequence 33 is A. If the previous bit of the code sequence 33 is C or T, the current bit in the binary code sequence 31 is 1, and the current bit in the binary code sequence 32 is 0, the current bit of the code sequence 33 is C. If the previous bit of the code sequence 33 is C or T, the current bit in the binary code sequence 31 is 1, and the current bit in the binary code sequence 32 is 1, the current bit of the code sequence 33 is G.
  • the coding sequence 33 is encoded bit by bit according to the above method, and the coding sequence 33 can be obtained as ATCGATGCGCTACGTACGTACG.
  • a reference bit can be set as the "previous bit” of the first bit.
  • any one of A, C, G, or T may be set as a reference bit in front of the coding sequence 33, and the reference bit is used as the “previous bit” in the above coding method, and the remaining steps are the same, and are not described herein again. .
  • the above encoding process considers the contents of the code in the binary code sequences 31 and 32 to determine the content in the final encoded sequence 33. That is, the information of two different binary code sequences can be fused into one and the same code sequence, thereby increasing the storage density of the code.
  • first mapping relationship can also be set to:
  • the second mapping relationship can be correspondingly set to:
  • the first mapping relationship and the second mapping relationship may ensure that one identical symbol exists in the first candidate symbol group and the second candidate symbol group, that is, two different binary code sequences may be encoded into the same coding sequence.
  • the first mapping relationship and the second mapping relationship of the present disclosure may have multiple joint setting manners, which may be specifically set by the following steps.
  • the setting of the first mapping relationship can be expressed as:
  • Symbol 1, symbol 2, symbol 3, and symbol 4 may correspond to one of bases A, T, G, and C, respectively.
  • the symbol 1 and the symbol 2 in the first mapping relationship have no order relationship, respectively corresponding to two bases, and the symbols 3 and 4 correspond to the other two bases, so the first mapping relationship has Kind of setting.
  • the present disclosure can transform a plurality of mapping relationships to encode a binary code sequence, thereby maximally avoiding high GC, high AT repetition problems occurring in the coding sequence.
  • the encoding method is set with the former bit in the encoding sequence as a constraint.
  • the encoding method encodes two different binary code sequences into one code sequence consisting of four different symbols, thereby increasing the storage density.
  • the coding method has multiple joint coding implementations, which can more flexibly set the mapping relationship between the binary code and the coded symbols, thereby avoiding high GC and high AT repetition in the coding sequence, thereby causing subsequent decoding to be accurate. Low sexual problem.
  • the binary code sequence can be encoded as a nucleic acid sequence comprising A, C, G, T.
  • the nucleic acid sequence can be further synthesized into a nucleic acid fragment by the storage method in FIG.
  • FIG. 4 illustrates a flow chart of a storage method in accordance with some embodiments of the present disclosure.
  • the storage method includes steps 410-430.
  • step 410 the nucleic acid sequence obtained according to the encoding method in some of the above embodiments is split into a plurality of sequence fragments. These sequence fragments are relatively short in length and are convenient for synthesis.
  • an index identifier is added for each sequence segment, and the index identifier contains positional order information of the sequence segment for synthesis.
  • the index identifier can be a deoxyribonucleic acid sequence.
  • each sequence fragment is synthesized into each nucleic acid fragment.
  • Each nucleic acid fragment can be assembled directly into larger fragments, or each nucleic acid fragment can be ligated into a vector.
  • Each nucleic acid fragment can be stored in a medium, which can be a storage tube or a cell, for example, can be stored in an ex vivo chemical medium or preserved in living cells.
  • the nucleic acid sequences are synthesized into nucleic acid fragments for storage, thereby improving the storage time and storage density of the data storage.
  • the encoding sequence can be decoded by the decoding method according to the encoding method by the steps in FIG.
  • FIG. 5 illustrates a flow chart of a decoding method in accordance with some embodiments of the present disclosure.
  • the decoding method may decode the encoded sequence generated according to the foregoing encoding method into a first binary code sequence and a second binary code sequence, and specifically includes steps 510-530.
  • step 510 two of the four different symbols included in the coding sequence may be decoded to 0 according to the first mapping relationship in the foregoing encoding method, and the other two of the four different symbols are decoded to 1 to obtain The first binary code sequence.
  • step 520 according to the first bit of the coding sequence and the reference symbol, the first bit of the second binary code is determined according to the second mapping relationship in the foregoing coding method, and the reference symbol is any of the four different symbols.
  • the reference symbol is any of the four different symbols.
  • step 530 according to the current bit and the previous bit of the coding sequence, the current bit of the second binary code is determined according to the second mapping relationship, and the current bit of the coding sequence is other bits than the first bit of the coding sequence. .
  • the decoded binary code sequence can then be combined into a binary code, and the binary code can be transcoded into corresponding information, such as audio, video, document, and the like.
  • This step can be implemented by a program that comes with any operating system or a program that is specially written to convert binary code into corresponding information.
  • each nucleic acid fragment synthesized according to the above storage method can be sequenced to obtain each sequence fragment.
  • the sequencing method can be Sanger sequencing or high throughput sequencing.
  • the position order information of each sequence segment is obtained, that is, the sequence segments are sorted.
  • each sequence segment is combined into a coding sequence based on the positional order information.
  • the coding sequence is a sequence comprising four deoxyribonucleotides of A, C, G, T.
  • the code sequence 33 can be decoded into a binary code sequence 31 according to the first mapping relationship: 001100111100110011001100110011.
  • the code sequence 33 can be decoded into a binary code sequence 32 according to the second mapping relationship: 010101111101111111111111111.
  • the binary code sequence 32 can be decoded by the following steps.
  • the decoding is 0. If the previous bit of the code sequence 33 is A or G and the current bit is A or G, the decoding is 0. If the previous bit of the code sequence 33 is A or G and the current bit is T or C, the decoding is 1. If the previous bit of the code sequence 33 is T or C and the current bit is A or G, the decoding is 1. If the previous bit of the code sequence 33 is T or C and the current bit is T or C, the decoding is 0.
  • the reference bit set in advance at the time of encoding can be used as the "previous bit" of the first bit, and the remaining steps are the same.
  • two different binary code sequences can be decoded from a code sequence composed of four different symbols according to different mapping relationships, thereby improving the code storage density.
  • the information to be encoded is transcoded into a binary code.
  • Figure 6 shows a schematic diagram of information to be encoded, in accordance with some embodiments of the present disclosure. Encode the text in Figure 6 to get the corresponding binary code:
  • the binary code is divided into two parts, sequence a and sequence b.
  • Sequence a is:
  • Sequence b is:
  • sequence a and the sequence b are jointly encoded into the same code sequence by using the first mapping relationship and the second mapping relationship.
  • the first mapping relationship is set to:
  • the second mapping relationship is set to:
  • the coding sequence (SEQ ID NO: 1) obtained after the joint coding is:
  • the above coding sequence is a nucleic acid sequence comprising the four deoxyribonucleotides.
  • the nucleic acid sequence is synthesized and stored in accordance with the storage method of the present disclosure.
  • the above nucleic acid sequence was split into three sequence fragments of 173 bp in length.
  • index identifiers of the three sequence fragments are: AGTCG, ACGCT, and CAATG.
  • Table 7 adds the sequence fragment after the index identifier
  • the nucleic acid fragments stored in the centrifuge tube can be decoded as necessary to obtain corresponding document information.
  • the stored nucleic acid fragments can be subjected to Sanger sequencing to obtain sequence fragments 1-3.
  • Figures 7a-7c show sequenced peaks of sequence fragments 1-3.
  • each sequence segment is obtained according to the index identifier, and the sequence segments are sorted and assembled into a complete coding sequence.
  • the coding sequence (SEQ ID NO: 1) obtained after assembly is:
  • sequence a and sequence b can be converted to the text in Figure 6 using software.
  • the document information is stored in the nucleic acid fragment by the technical solution of the present disclosure, and 100% of the document information stored in the nucleic acid fragment can be completely decoded.
  • the obtained coding sequence except for the index identifier, has a binary storage density of 2 bits/nt for the document information, which is significantly higher than the storage method in the related art.
  • the content of continuous GC and continuous AT in the coding sequence is uniform, and there is no long continuous single-base repeat sequence, that is, the repetition of high GC and high AT is avoided, so that the subsequent decoding of the sequence fragment is more accurate.
  • FIG. 8 shows a block diagram of an encoder of some embodiments of the present disclosure.
  • the encoder 8 includes a memory 81 and a processor 82.
  • the memory 81 stores a first binary code sequence to be encoded and a second binary code sequence, the first binary code sequence being the same as the second binary code sequence number of bits.
  • Processor 82 is coupled to memory 81, which is configured to encode the first binary code sequence and the second binary code sequence into a single encoded sequence. For example, processor 82 transcodes the information to be encoded into a binary code from which the first binary code sequence and the second binary code sequence are extracted.
  • the coding sequence may be composed of four different symbols, for example, four deoxyribonucleotides of A, C, G, and T, and the coding sequence is a nucleic acid sequence comprising four deoxyribonucleotides.
  • the coding sequence can be obtained by the following steps.
  • a first bit of the encoded sequence is determined based on the first bit of the first binary code sequence, the first bit of the second binary code sequence, and the reference symbol, the reference symbol being any one of four different symbols. For example, according to a first bit in the first binary code sequence, determining a first candidate symbol group of a first bit of the coding sequence according to a preset first mapping relationship, where the first candidate symbol group includes four different symbols Both. Determining, according to a first mapping relationship of the second binary code sequence, a second candidate symbol group of the first bit of the coding sequence according to a preset second mapping relationship, where the second candidate symbol group includes four different symbols In the two, the first candidate symbol group and the second candidate symbol group have one and the same symbol. The same symbol is determined as the first bit of the coding sequence.
  • the first mapping relationship is a correspondence between a first bit or a current bit in the first binary code sequence and a symbol in the first candidate symbol group, and the symbol in the first candidate symbol group is A, Two of C, G, and T.
  • the second mapping relationship is a correspondence between the first bit and the reference symbol in the second binary code sequence and the symbol in the second candidate symbol group, or the current bit and the previous bit in the second binary code sequence
  • the correspondence between the symbols in the second candidate symbol group, and the symbols in the second candidate symbol group are two of A, C, G, and T.
  • the encoding method is set with the former bit in the encoding sequence as a constraint.
  • the encoding method encodes two different binary code sequences into one code sequence consisting of four different symbols, thereby increasing the storage density.
  • the coding method has multiple joint coding implementations, which can more flexibly set the mapping relationship between the binary code and the coded symbols, thereby avoiding the problems of high GC and high AT repetition occurring in the coding sequence.
  • FIG. 9 shows a block diagram of a memory device of some embodiments of the present disclosure.
  • the storage device 9 includes a sequence splitting module 91, an index adding module 92, and a nucleic acid synthesizing module 93.
  • the sequence splitting module 91 splits the nucleic acid sequence obtained according to the above encoding method into a plurality of sequence fragments.
  • the index adding module 92 is connected to the sequence splitting module 91, and adds an index identifier to each sequence segment, where the index identifier includes position sequence information of the sequence segment.
  • the index identifier can be a deoxyribonucleic acid sequence.
  • the nucleic acid synthesis module 93 is coupled to the index addition module 92 to synthesize each sequence fragment into each nucleic acid fragment.
  • FIG. 10 shows a block diagram of a storage device of further embodiments of the present disclosure.
  • the storage device 10 includes a sequence splitting module 91, an index adding module 92, a nucleic acid synthesizing module 93, a nucleic acid assembling module 104, a carrier connecting module 105, and a media storage module 106.
  • the functions of the sequence splitting module 91, the index adding module 92, and the nucleic acid synthesizing module 93 are the same as those of the above embodiment, and are not described again.
  • the nucleic acid assembly module 104 is coupled to the nucleic acid synthesis module 93 for assembling each nucleic acid fragment.
  • a vector ligation module 105 is coupled to the nucleic acid synthesis module 93 for ligating each nucleic acid fragment into a vector.
  • the media storage module 106 is coupled to the nucleic acid synthesis module 93 for storing each nucleic acid segment in a medium, which is a storage tube or cell.
  • the nucleic acid sequences are synthesized into nucleic acid fragments for storage, thereby improving the storage time and storage density of the data storage.
  • Figure 11 shows a block diagram of a decoder of some embodiments of the present disclosure.
  • the decoder 11 includes a memory 111 and a processor 112.
  • the memory 111 stores a code sequence generated according to the above encoder.
  • the processor 112 is coupled to the memory 111 and configured as follows.
  • two of the four different symbols included in the coding sequence are decoded to 0, and the other two of the four different symbols are decoded to 1 to obtain a first binary code sequence.
  • the second binary code sequence can be obtained by the following steps.
  • the first bit of the second binary code is determined according to the first bit of the coding sequence and the reference symbol according to the second mapping relationship in the encoder, and the reference symbol is any one of four different symbols.
  • processor 112 obtains a code sequence by performing the following steps. Each nucleic acid fragment synthesized according to the above storage method was subjected to sequencing. The positional order information of each nucleic acid fragment is obtained based on the index of each nucleic acid fragment. Each nucleic acid fragment is assembled into a coding sequence based on positional sequence information.
  • processor 112 combines the decoded binary code sequences into binary code and transcodes the binary code into corresponding information.
  • two different binary code sequences can be decoded from a code sequence composed of four different symbols according to different mapping relationships, thereby improving the code storage density.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

La présente invention concerne un procédé de codage/décodage, un codeur/décodeur et un procédé et un appareil de mémorisation, se rapportant au domaine technique de la mémorisation de données. Le procédé de codage consiste : à déterminer un premier bit d'une séquence codée en fonction d'un premier bit d'une première séquence de code binaire et d'un premier bit d'une seconde séquence de code binaire, un symbole de référence constituant l'un quelconque de quatre symboles différents ; et à déterminer un bit en cours de la séquence codée en fonction d'un bit en cours de la première séquence de code binaire, d'un bit en cours de la seconde séquence de code binaire et d'un bit précédent de la séquence codée, le bit en cours de la séquence codée constituant d'autres bits à l'exclusion du premier bit de la séquence codée. La présente invention permet d'améliorer la densité de mémorisation et d'éviter le problème d'un GC élevé et d'une répétition AT élevée survenant dans une séquence codée.
PCT/CN2018/103795 2017-10-25 2018-09-03 Procédé de codage/décodage, codeur/décodeur, et procédé et appareil de mémorisation WO2019080653A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201880068914.8A CN111279422B (zh) 2017-10-25 2018-09-03 编码/解码方法、编码/解码器和存储方法、装置
US16/858,295 US20200321079A1 (en) 2017-10-25 2020-04-24 Encoding/decoding method, encoder/decoder, storage method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711009900.2 2017-10-25
CN201711009900 2017-10-25

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/858,295 Continuation-In-Part US20200321079A1 (en) 2017-10-25 2020-04-24 Encoding/decoding method, encoder/decoder, storage method and device

Publications (1)

Publication Number Publication Date
WO2019080653A1 true WO2019080653A1 (fr) 2019-05-02

Family

ID=66247716

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/103795 WO2019080653A1 (fr) 2017-10-25 2018-09-03 Procédé de codage/décodage, codeur/décodeur, et procédé et appareil de mémorisation

Country Status (3)

Country Link
US (1) US20200321079A1 (fr)
CN (1) CN111279422B (fr)
WO (1) WO2019080653A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539371B (zh) * 2021-07-05 2023-06-23 南方科技大学 一种序列的编码方法及装置、可读存储介质
CN114758703B (zh) * 2022-06-14 2022-09-13 深圳先进技术研究院 基于重组质粒dna分子的数据信息存储方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104662544A (zh) * 2012-07-19 2015-05-27 哈佛大学校长及研究员协会 利用核酸存储信息的方法
CN105022935A (zh) * 2014-04-22 2015-11-04 中国科学院青岛生物能源与过程研究所 一种利用dna进行信息存储的编码方法和解码方法
CN106845158A (zh) * 2017-02-17 2017-06-13 苏州泓迅生物科技股份有限公司 一种利用dna进行信息存储的方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPWO2002101935A1 (ja) * 2001-06-06 2004-09-30 セイコーエプソン株式会社 復号化装置、復号化方法、ルックアップテーブルおよび復号化プログラム
US20060147946A1 (en) * 2004-01-27 2006-07-06 Pinchas Akiva Novel calcium channel variants and methods of use thereof
CN101565746B (zh) * 2009-06-03 2012-08-08 东南大学 带奇偶校验的信号组合编码dna连接测序方法
CN104850760B (zh) * 2015-03-27 2016-12-21 苏州泓迅生物科技有限公司 人工合成dna存储介质的信息存储读取方法
CN105550570A (zh) * 2015-12-02 2016-05-04 深圳市同创国芯电子有限公司 一种应用于可编程器件的加密、解密方法及装置
CN106022006B (zh) * 2016-06-02 2018-08-10 广州麦仑信息科技有限公司 一种将基因信息进行二进制表示的存储方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104662544A (zh) * 2012-07-19 2015-05-27 哈佛大学校长及研究员协会 利用核酸存储信息的方法
CN105022935A (zh) * 2014-04-22 2015-11-04 中国科学院青岛生物能源与过程研究所 一种利用dna进行信息存储的编码方法和解码方法
CN106845158A (zh) * 2017-02-17 2017-06-13 苏州泓迅生物科技股份有限公司 一种利用dna进行信息存储的方法

Also Published As

Publication number Publication date
CN111279422B (zh) 2023-12-22
US20200321079A1 (en) 2020-10-08
CN111279422A (zh) 2020-06-12

Similar Documents

Publication Publication Date Title
CN110945595B (zh) 基于dna的数据存储和检索
CN110603595B (zh) 用于从压缩的基因组序列读段重建基因组参考序列的方法和系统
US20210050074A1 (en) Systems and methods for sequence encoding, storage, and compression
CN109830263B (zh) 一种基于寡核苷酸序列编码存储的dna存储方法
Pan et al. Rewritable two-dimensional DNA-based data storage with machine learning reconstruction
WO2020132935A1 (fr) Procédé et dispositif d'édition à point fixe d'une séquence nucléotidique stockée avec des données
US20210074380A1 (en) Reverse concatenation of error-correcting codes in dna data storage
CN112527736B (zh) 基于dna的数据存储方法、数据恢复方法及终端设备
JP2017528796A (ja) コード生成方法、コード生成装置およびコンピュータ可読記憶媒体
WO2019080653A1 (fr) Procédé de codage/décodage, codeur/décodeur, et procédé et appareil de mémorisation
KR20120137235A (ko) 유전자 데이터를 압축하는 방법 및 장치
US20090045987A1 (en) Method and apparatus for encoding/decoding metadata
US11527307B2 (en) Quality score compression
US20210304841A1 (en) Efficient data structures for bioinformatics information representation
CN110867213A (zh) 一种dna数据的存储方法和装置
Zhang et al. A high storage density strategy for digital information based on synthetic DNA
EP3583249A1 (fr) Procédé et systèmes de reconstruction de séquences de référence génomiques à partir de lectures de séquences génomiques compressées
CN111095423B (zh) 编码/解码方法、装置和数据处理装置
JP2020503580A (ja) バイオインフォマティクスデータのコンパクトな表現のための方法および装置
WO2022120626A1 (fr) Procédé et appareil de stockage de données basé sur l'adn, procédé et appareil de récupération de données basée sur l'adn et dispositif terminal
US20230032409A1 (en) Method for Information Encoding and Decoding, and Method for Information Storage and Interpretation
CN110915140A (zh) 用于编码和解码数据结构的质量值的方法
WO2018071078A1 (fr) Procédé et appareil d'accès à des données bioinformatiques structurées dans des unités d'accès
WO2023206023A1 (fr) Procédé de codage et dispositif de codage pour stockage d'adn
JP2003101485A (ja) 生体高分子を通信媒体もしくは記録媒体とした、情報通信方法、情報記録方法、エンコーダおよびデコーダ

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18871409

Country of ref document: EP

Kind code of ref document: A1