WO2023018157A1 - Procédé de codage et de décodage de données d'adn utilisant un code de contrôle de parité à faible densité, programme et dispositif - Google Patents

Procédé de codage et de décodage de données d'adn utilisant un code de contrôle de parité à faible densité, programme et dispositif Download PDF

Info

Publication number
WO2023018157A1
WO2023018157A1 PCT/KR2022/011804 KR2022011804W WO2023018157A1 WO 2023018157 A1 WO2023018157 A1 WO 2023018157A1 KR 2022011804 W KR2022011804 W KR 2022011804W WO 2023018157 A1 WO2023018157 A1 WO 2023018157A1
Authority
WO
WIPO (PCT)
Prior art keywords
decoding
ldpc
length
base sequences
ldpc decoding
Prior art date
Application number
PCT/KR2022/011804
Other languages
English (en)
Korean (ko)
Inventor
박성준
정재호
노종선
박호성
김성환
노알버트
Original Assignee
서울대학교 산학협력단
전남대학교산학협력단
울산대학교 산학협력단
홍익대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 서울대학교 산학협력단, 전남대학교산학협력단, 울산대학교 산학협력단, 홍익대학교 산학협력단 filed Critical 서울대학교 산학협력단
Publication of WO2023018157A1 publication Critical patent/WO2023018157A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/05Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
    • H03M13/11Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits using multiple parity bits
    • H03M13/1102Codes on graphs and decoding on graphs, e.g. low-density parity check [LDPC] codes
    • H03M13/1105Decoding
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/05Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
    • H03M13/11Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits using multiple parity bits
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/05Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
    • H03M13/13Linear codes
    • H03M13/15Cyclic codes, i.e. cyclic shifts of codewords produce other codewords, e.g. codes defined by a generator polynomial, Bose-Chaudhuri-Hocquenghem [BCH] codes
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/05Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
    • H03M13/13Linear codes
    • H03M13/15Cyclic codes, i.e. cyclic shifts of codewords produce other codewords, e.g. codes defined by a generator polynomial, Bose-Chaudhuri-Hocquenghem [BCH] codes
    • H03M13/151Cyclic codes, i.e. cyclic shifts of codewords produce other codewords, e.g. codes defined by a generator polynomial, Bose-Chaudhuri-Hocquenghem [BCH] codes using error location or error correction polynomials
    • H03M13/1515Reed-Solomon codes
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/63Joint error correction and other techniques
    • H03M13/635Error control coding in combination with rate matching
    • H03M13/6362Error control coding in combination with rate matching by puncturing
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/001Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits characterised by the elements used

Definitions

  • the present invention relates to a method, program, and device for encoding and decoding DNA data using a low density parity check code.
  • Digital data is growing exponentially in real time. It is estimated that 175 zettabytes of data will be created in 2025. In order to process and store such increasing data, a need for a new storage device has arisen.
  • DNA storage unit This is a device that stores data after replacing it with four bases A, C, G, and T.
  • the DNA storage device is currently one of the fields in which research is being very actively conducted. There are three different types of errors in these DNA storage devices: substitution errors, insertion errors, and deletion errors.
  • a substitution error is literally an error in which one of the three other bases is replaced and transmitted instead of the base that should have been originally transmitted.
  • An insertion error is a transmission error in which an additional base is generated between bases that should be originally transmitted, and a deletion error is a transmission error in which a base originally intended to be transmitted is missing.
  • the length of the substitution error remains the same and an error occurs in only one base, but when an insertion/deletion error occurs, the length of the DNA changes and the position of the transmitted base is shifted one by one. Therefore, compared to substitution errors, insertion/deletion errors act as very fatal errors in storing and reading data in DNA storage devices. In order to protect the DNA storage device from these errors and to perfectly store and read data, an error correction code must be applied to the DNA storage device.
  • the error rate of DNA storage devices is greatly affected by the biochemical structure of DNA.
  • the process of synthesizing, storing and reading DNA there are two biggest limitations. It is known that a lot of errors occur when the ratio of G and C in DNA is far from 50%, and a lot of errors occur when the same base sequence is repeated several times in succession. Therefore, satisfying these two biochemical characteristics can help reduce errors in DNA storage devices.
  • An object to be solved by the present invention is to provide a method, program, and apparatus for encoding and decoding DNA data using a low density parity check code.
  • a DNA data encoding and decoding method using a low density parity check code performed by an apparatus includes the steps of converting specific data into a first binary vector having a predetermined first length. , Dividing the first binary vector into second binary vectors having a predetermined second length by a predetermined first number, arranging the second binary vectors of the first number in the vertical direction, and Performing LDPC (low density parity check) encoding in the horizontal direction using parity of length 3, second binary numbers aligned in the vertical direction of the second number corresponding to the sum of the first number and the third length Assigning an address value of a preset fourth length to each vector, each of the second binary vectors aligned in the vertical direction having a fifth length corresponding to the sum of the second length and the fourth length by 2 bits ( replacing with one DNA per bit) and performing puncturing on each of the second number of nucleotide sequences that are substituted with the DNA and are half the length of the fifth length.
  • LDPC low density parity check
  • the method may further include randomizing the long binary number vector through an XOR operation using a random number before dividing the first binary number vector by the first number of the second binary number vectors. there is.
  • the address value designation step includes calculating the number of address values necessary for encoding, encoding each of the calculated address values by applying a Reed-Solomon (RS) code, and encoding the encoded address.
  • RS Reed-Solomon
  • a step of assigning each value to each of the second binary vectors aligned in the vertical direction in order may be included.
  • the number corresponding to the number of cases in which a plurality of bits at a specific position are different bases among the total number of cases based on the number of bits of the address value is calculated as the address value.
  • the address value may include even parity.
  • the 2 bits are replaced with one of Adenine (A), Guanine (G), Cytosine (C), and Thymine (T) according to the number of 0s and 1s. It can be.
  • the ratio of guanine (G, Guanine) and cytosine (C, Cytosine) among the base sequences of the second number is the same as that of a base sequence in which the difference from the reference ratio is greater than a preset value.
  • Perforation may be performed on a nucleotide sequence in which a base is repeated a predetermined number of times or more.
  • the method includes the steps of extracting N base sequences from among the encoded second number of base sequences and performing first LDPC decoding and second LDPC decoding (where N is a natural number) and the decrypted result based on the sequence, further comprising integrating the second binary vectors of the first number into the first binary vectors of the first length according to the order of the designated address values, wherein the N base sequences are different from each other or at least two is the same, and when the N base sequences are different from each other, N is less than or equal to the second number, and when at least two of the N base sequences are identical, N may be greater than the second number. .
  • the first LDPC decoding and the second LDPC decoding steps include performing RS decoding and even parity decoding on the address values of each of the extracted N base sequences, and the same address according to the decrypted result. Grouping nucleotide sequences having values, removing address values from the grouped nucleotide sequences and replacing them with binary vectors, log likelihood ratios for each of the substituted binary vectors according to the presence or absence of nucleotide sequences in the address values.
  • the method may include performing decoding, performing second LDPC decoding based on the first LDPC decoding result, and determining whether the final decoding succeeds based on the second LDPC decoding result.
  • the first LDPC decoding step may include calculating the number of changed bits before and after performing the first LDPC decoding for each base sequence corresponding to each address value, and a base sequence in which the number of changed bits is equal to or greater than a predetermined number and initializing the log-likelihood ratio to 0 for , wherein the performing of the second LDPC decoding determines that second LDPC decoding is performed only on the horizontal LPDC codewords for which the first LDPC decoding fails,
  • the step of determining whether the final decoding is successful is whether the final decoding is successful or not through a parity check matrix for all the first LDPC-decoded base sequences and all the first LDPC-decoded and second LDPC-decoded base sequences can judge
  • the process of additionally extracting N base sequences from among the encoded second number of base sequences and performing first LDPC decoding and second LDPC decoding is repeated.
  • the log-likelihood ratio (LLR, Log-Likelihood Ratio) can be calculated using the following equation.
  • K 0 is the number of 0s at the same position within the same cluster
  • K 1 is the number of 1s at the same position within the same cluster
  • DNA data encoding and decoding apparatus using a low density parity check code stores a communication unit and at least one process for encoding and decoding DNA data using the low density parity check code. and a processor that operates according to the memory and the process, wherein the processor converts specific data into a first binary number vector having a predetermined first length based on the process, and converts the first binary number vector into a base Divided into second binary vectors of a set second length, divided by a set first number, aligning the first number of second binary vectors in the vertical direction, and horizontally using parity of a set third length LDPC (low density parity check) encoding is performed, and an address value of a fourth length preset in each of the second binary number vectors vertically aligned with the second number corresponding to the sum of the first number and the third length , and each of the second binary vectors aligned in the vertical direction having a fifth length corresponding to the sum of the second length and the fourth length
  • the processor may randomize the long binary vector through an XOR operation using a random number before dividing the first binary vector by the first number of the second binary vector.
  • the processor when designating the address value, calculates the number of address values necessary for encoding, applies a Reed-Solomon (RS) code to each of the calculated address values, and encodes them.
  • RS Reed-Solomon
  • Each address value may be sequentially assigned to each of the second binary vectors aligned in the vertical direction.
  • the processor when calculating the number of address values, calculates the number corresponding to the number of cases where a plurality of bits at a specific position are different bases among the total number of cases based on the number of bits of the address value. It can be calculated by the number of address values.
  • the address value may include even parity.
  • the processor during the substitution, adenine (A, Adenine), guanine (G, Guanine), cytosine (C, Cytosine) and thymine (T, Thymine) can be substituted with one of them.
  • A Adenine
  • G Guanine
  • C Cytosine
  • T Thymine
  • the processor when performing the perforation, among the base sequences of the second number, the ratio of guanine (G, Guanine) and cytosine (C, Cytosine) is a base whose difference from the reference ratio is greater than a predetermined value.
  • Perforation may be performed on a nucleotide sequence in which the same base as the sequence is repeated a predetermined number of times or more.
  • the processor performs first LDPC decoding and second LDPC decoding by extracting N base sequences from among the encoded second number of base sequences (where N is a natural number), and based on the decoded result So, the second binary vector of the first number is integrated into the first binary vector of the first length according to the order of the designated address values, the N base sequences are different from each other or at least two are the same, and the N When the number of nucleotide sequences are different from each other, N may be less than or equal to the second number, and when at least two of the N number of nucleotide sequences are identical, N may be greater than the second number.
  • the processor when performing the first LDPC decoding and the second LDPC decoding, performs RS decoding and even parity decoding on the address value of each of the extracted N base sequences, and according to the decrypted result Base sequences having the same address value are grouped, address values are removed from the grouped base sequences, and then replaced with binary vectors, and the log likelihood ratio for each of the replaced binary vectors is determined according to whether base sequences exist in the address values.
  • the processor when performing the first LDPC decoding, calculates the number of changed bits before and after the first LDPC decoding is performed for each base sequence corresponding to each address value, and the number of changed bits is greater than or equal to a preset number.
  • the log-likelihood ratio for the base sequence is initialized to 0, and when the second LDPC decoding is performed, it is determined that the second LDPC decoding is performed only on the horizontal LPDC codeword for which the first LDPC decoding fails, and the final decoding
  • success of final decoding can be determined through a parity check matrix for all base sequences decoded by the first LDPC and all base sequences decoded by the first LDPC and decoded second LDPC. .
  • the processor when it is determined that the final decoding has failed, the processor additionally extracts N base sequences from among the encoded second number of base sequences to perform first LDPC decoding and second LDPC decoding The process can be repeated.
  • the log-likelihood ratio (LLR, Log-Likelihood Ratio) can be calculated using the following equation.
  • K 0 is the number of 0s at the same position within the same cluster
  • K 1 is the number of 1s at the same position within the same cluster
  • the ratio of G and C and the number of consecutive bases of the finally generated base sequences are checked, except for base sequences having biochemical characteristics that cause three types of errors in the DNA storage device.
  • base sequences that do not satisfy the standard conditions can be excluded. Accordingly, it is possible to reduce the possibility of errors occurring during data decoding and to restore data with higher accuracy.
  • FIG. 1 is a schematic block diagram of a DNA data encoding and decoding apparatus using a low density parity check code according to the present invention.
  • FIG. 2 is a flowchart of a DNA data encoding method using a low density parity check code according to the present invention.
  • FIG. 3 is a diagram for explaining DNA data encoding according to the present invention.
  • FIG. 4 is a diagram for explaining an address value according to the present invention.
  • 5 is a diagram for explaining the substitution of 2 bits with a base according to the present invention.
  • FIG. 6 is a flowchart of a DNA data decoding method using a low density parity check code according to the present invention.
  • step S210 of FIG. 6 is a flowchart of a specific method of step S210 of FIG. 6 .
  • spatially relative terms “below”, “beneath”, “lower”, “above”, “upper”, etc. It can be used to easily describe a component's correlation with other components. Spatially relative terms should be understood as including different orientations of elements in use or operation in addition to the orientations shown in the drawings. For example, if you flip a component that is shown in a drawing, a component described as “below” or “beneath” another component will be placed “above” the other component. can Thus, the exemplary term “below” may include directions of both below and above. Components may also be oriented in other orientations, and thus spatially relative terms may be interpreted according to orientation.
  • 'device' includes all various devices capable of providing results to users by performing calculation processing.
  • the devices may be in the form of computers and mobile terminals.
  • the computer may be in the form of a server receiving a request from a client and processing information.
  • a computer may correspond to a sequencing device that performs sequencing.
  • the mobile terminal includes a mobile phone, a smart phone, a personal digital assistants (PDA), a portable multimedia player (PMP), a navigation device, a notebook PC, a slate PC, a tablet PC, and an ultrabook.
  • PDA personal digital assistants
  • PMP portable multimedia player
  • a navigation device e.g, a watch type terminal (smartwatch), a glass type terminal (smart glass), a head mounted display (HMD)
  • HMD head mounted display
  • 'oligo' or 'oligo read' refers to a polymer synthesized from a plurality of nucleotide units including a specific base (adenine, guanine, cytosine or thymine).
  • 'sequence' refers to a base sequence read by sequentially reading (sequencing) a specific oligo lead.
  • 'sequence' and 'oligo lead' may be used interchangeably.
  • 'stitched sequence' may specifically mean 'sequence of stitched oligo leads'.
  • FIG. 1 is a schematic block diagram of a DNA data encoding and decoding apparatus using a low density parity check code according to the present invention.
  • FIG. 2 is a flowchart of a DNA data encoding method using a low density parity check code according to the present invention.
  • FIG. 3 is a diagram for explaining DNA data encoding according to the present invention.
  • FIG. 4 is a diagram for explaining an address value according to the present invention.
  • 5 is a diagram for explaining the substitution of 2 bits with a base according to the present invention.
  • FIG. 6 is a flowchart of a DNA data decoding method using a low density parity check code according to the present invention.
  • step S210 of FIG. 6 is a flowchart of a specific method of step S210 of FIG. 6 .
  • the device 10 When a specific data file is input from an external device (not shown), the device 10 according to the present invention can express it as one long binary number vector, encode it by replacing it with four bases, and store it.
  • the device 10 When restoration of specific encoded data is requested from an external device (not shown), the device 10 according to the present invention decodes the data stored in the nucleotide sequence into a binary number vector, decodes it again in the form of input data, and provides it. there is.
  • the apparatus 10 may increase the accuracy of restored data by using a low density parity check (LDPC).
  • LDPC low density parity check
  • the device 10 according to the present invention may be a DNA storage device that stores data by replacing it with DNA.
  • the device 10 may include all of various devices capable of providing results to users by performing calculation processing.
  • the device 10 may be in the form of a computer. More specifically, the computer may include all of various devices capable of providing results to users by performing calculation processing.
  • a computer includes not only a desktop PC and a notebook (Note Book) but also a smart phone, a tablet PC, a cellular phone, a PCS phone (Personal Communication Service phone), synchronous/asynchronous A mobile terminal of IMT-2000 (International Mobile Telecommunication-2000), a Palm Personal Computer (Palm PC), and a Personal Digital Assistant (PDA) may also be applicable.
  • a Head Mounted Display (HMD) device includes a computing function, the HMD device may become a computer.
  • the computer may correspond to a server that receives a request from a client and performs information processing.
  • the device 10 may include a communication unit 12 , a memory 14 and a processor 16 .
  • the device 10 may include fewer or more components than those shown in FIG. 1 .
  • the communication unit 12 is one that enables wireless communication between the device 10 and an external device (not shown), between the device 10 and an external server (not shown), or between the device 10 and a communication network (not shown). It may contain more than one module.
  • a communication network may transmit and receive various information between the device 10, an external device (not shown), and an external server (not shown).
  • Various types of communication networks may be used as the communication network, for example, wireless communication methods such as WLAN (Wireless LAN), Wi-Fi, Wibro, Wimax, and HSDPA (High Speed Downlink Packet Access)
  • wireless communication methods such as WLAN (Wireless LAN), Wi-Fi, Wibro, Wimax, and HSDPA (High Speed Downlink Packet Access)
  • a wired communication method such as Ethernet, xDSL (ADSL, VDSL), HFC (Hybrid Fiber Coax), FTTC (Fiber to The Curb), FTTH (Fiber To The Home) may be used.
  • the communication network (not shown) is not limited to the communication methods presented above, and may include all other types of communication methods that are widely known or will be developed in the future in addition to the above-described communication methods.
  • Communication unit 12 may include one or more modules that connect device 10 to one or more networks.
  • Memory 14 may store data supporting various functions of device 10 .
  • the memory 14 may store a plurality of application programs (application programs or applications) running in the device 10 , data for operation of the device 10 , and commands. At least some of these applications may exist for basic functions of the device 10 . Meanwhile, the application program may be stored in the memory 14, installed on the device 10, and driven by the processor 16 to perform an operation (or function) of the device 10.
  • the processor 16 may control general operations of the device 10 in addition to operations related to the application program.
  • the processor 16 may provide or process appropriate information or functions to a user by processing signals, data, information, etc. input or output through the components described above or by driving an application program stored in the memory 14.
  • the processor 16 may control at least some of the components discussed in conjunction with FIG. 1 in order to drive an application program stored in the memory 14 . Furthermore, the processor 16 may combine and operate at least two or more of the components included in the device 10 to drive the application program.
  • the processor 16 may convert specific data into a first binary vector having a preset first length (S110).
  • the processor 16 divides the first binary vector into a second binary vector having a preset second length, and divides the first binary vector by a preset first number (S120).
  • the processor 16 may vertically align the first number of second binary vectors (S130).
  • the processor 16 may perform low density parity check (LDPC) encoding in the horizontal direction using a parity having a preset third length (S140).
  • LDPC low density parity check
  • the processor 16 may designate an address value of a preset fourth length to each of second binary number vectors aligned in the vertical direction of a second number corresponding to the sum of the first number and the third length. (S150).
  • the processor 16 replaces each of the vertically aligned second binary vectors having a fifth length corresponding to the sum of the second length and the fourth length with one DNA per 2 bits. It can (S160).
  • the processor 16 may perform puncturing on each of the base sequences of the second number, each of which is half the length of the fifth length by being substituted with the DNA (S170).
  • step S110 specific data may be a data file requested to be encoded.
  • the processor 16 expresses the specific data file as one long binary number vector of the first length, and then divides it into several short length binary number vectors again in step S120.
  • a random number may be generated to randomize data through an XOR operation. More specifically, the present invention may further include randomizing the long binary number vector through an XOR operation using a random number before dividing the first binary number vector by the first number of the second binary number vectors.
  • the processor 16 may divide the long binary vector of the first length by the first number (K LDPC ) of short binary vector vectors of the second length (n payload ) in step S120. That is, the first length may correspond to a value obtained by multiplying the second length (n payload ) and the first number (K LDPC ).
  • the second length (n payload ) may be set to 272 bits
  • the first number (K LDPC ) may be set to 16,572.
  • step S130 16,572 binary vectors each having a length of 272 bits may be vertically aligned like the lattice pattern block shown in FIG. 3 .
  • the reason why several binary vectors having the same length are vertically aligned is that the cost of synthesizing DNA is very expensive if the length is too long.
  • the processor 16 protects errors during encoding and decoding by using low density parity check (LDPC), which is currently the most widely used among error correction codes and has good performance, for the data aligned in the vertical direction. can do.
  • LDPC low density parity check
  • encoding of the LDPC code may be performed in a horizontal direction.
  • a parity having a length of a third length (M LDPC ), for example, 1,860 bits, is formed, and each code may be expressed in a horizontal direction.
  • insertion/deletion errors are more fatal than substitution errors. Even if one insertion/deletion error occurs, data to be restored as a whole may be damaged because the position of the data is shifted one by one. Therefore, as in the present invention, even if an insertion/deletion error occurs in vertically aligned data by making the data arrangement direction and the encoding direction vertical, it is regarded as one bit error from the point of view of the horizontal direction code, so insertion/deletion error Encoding can be performed in a very powerful way.
  • the addresses are assigned in the order of each data.
  • each address of a total of N LDPC vertical data at least The address value of is required. for example, address is required. If the numbers from 0 to 18,432 are converted into binary numbers, they can be expressed as shown in FIG.
  • step S150 the processor 16 calculates the number of address values necessary for encoding, applies RS (Reed-Solomon) codes to each of the calculated address values, and encodes them.
  • RS Random-Solomon
  • Each address value may be sequentially assigned to each of the second binary vectors aligned in the vertical direction.
  • the processor 16 calculates the number corresponding to the number of cases in which a plurality of bits at a specific position are different bases among the total number of cases based on the number of bits of the address value. It can be calculated by the number of address values.
  • N LDPC is 18,432 and k index is 15
  • k index is 15
  • 15 bits are 8 nt (x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , x 8 )
  • x 3 , x 4 , and x 6 , x 7 must be different bases.
  • There are a total of 9/16 of these cases. If this is converted into binary numbers, 2 15 x9/16 18,432 address values among a total of 2 15 possible address value candidates can be calculated.
  • Using the 18,432 address values calculated in this way address values in which the number of consecutive bases exists only up to 3 can be obtained. Through this, it is possible to reduce the possibility of errors by limiting the number of consecutive bases.
  • RS Reed-Solomon
  • the encoded address value is 32 bits in total.
  • the address values created in this way are sequentially appended to the randomized data, and vectors having a length of 4th length + 2nd length (n index +n payload ) bits are generated as many as the second number (N LDPC ).
  • N LDPC nucleotide sequences of nt are generated. For example, when n index is 32 and n payload is 272, 304 nucleotide sequences can be generated.
  • the 2 bits may be substituted with one of adenine (A), guanine (G), cytosine (C), and thymine (T) according to the number of 0's and 1's.
  • A adenine
  • G guanine
  • C cytosine
  • T thymine
  • step S170 the processor 16 determines that, among the base sequences of the second number, the ratio of guanine (G) and cytosine (C) is identical to the base sequence in which the difference from the reference ratio is greater than a preset value. Perforation may be performed on a nucleotide sequence in which is repeated a predetermined number of times or more.
  • perforation may be performed on the corresponding nucleotide sequence.
  • a preset value for example, 30 (i.e., the ratio of G and C is 80 percent or more, 20% or less)
  • perforation may be performed on the corresponding base sequence.
  • a specific base is repeated a predetermined number of times, for example, 4 or more times among the finally generated N LDPC base sequences, perforation may be performed on the corresponding base sequence.
  • the operation of the processor 16 may be performed by the device 10 .
  • the processor 16 may perform first LDPC decoding and second LDPC decoding by extracting N base sequences from among the encoded second number of base sequences (S210).
  • N may be a natural number.
  • the N base sequences may be different from each other or at least two may be the same. For example, assuming that 4 base sequences are extracted, the 4 base sequences may be different, such as AT, AC, TG, and CG, or two bases out of four, such as AT, AC, CG, and CG. Arrays may overlap.
  • the N number when the N base sequences are different from each other, the N number may be less than or equal to the second number.
  • N When at least two of the N nucleotide sequences are the same, N may be smaller or larger than the second number, or may be equal to the second number. That is, when N base sequences are extracted from among the second number of base sequences, and at this time, when a specific base sequence is extracted overlapping, the number of N base sequences extracted according to the number of overlapping base sequences is less than the second number of base sequences. There may be many, or both may be the same.
  • step S210 the processor 16 may perform RS decoding and even parity decoding on the address values of each of the extracted N base sequences (S211).
  • the processor 16 may group base sequences having the same address value according to the decryption result (S212).
  • the corresponding data is discarded because it is not encoded data. If it corresponds to the address values of N LDPC branches, even parity is finally checked, and if it is not satisfied, the corresponding data is discarded. Then, the remaining DNA sequences are gathered together with the same address value, and the data of the empty address value is left.
  • the processor 16 may remove the address value from the grouped nucleotide sequence and replace it with a binary vector (S213).
  • the processor 16 may calculate a log-likelihood ratio (LLR) for each of the permuted binary vectors according to whether a nucleotide sequence exists in the address value (S214).
  • LLR log-likelihood ratio
  • the processor 16 may vertically arrange the calculated log-likelihood ratios according to the order of the address values (S215).
  • the processor 16 may perform first LDPC decoding on the log likelihood ratios aligned in the vertical direction in the horizontal direction (S216).
  • the processor 16 may perform second LDPC decoding based on the result of the first LDPC decoding (S217).
  • the processor 16 may determine whether the final decoding is successful based on the result of the second LDPC decoding (S218).
  • hard-decision decoding and soft-decision decoding of codes of LDPC codes may be used.
  • base sequences having the same address value are collected through the decrypted address value, address values are removed from the base sequences gathered from the first address value, and then the table shown in FIG. 5 is used. can be converted into binary numbers. At this time, if there is no nucleotide sequence gathered in the address value, it is left as an empty space, and if there is even one nucleotide sequence in the address value, a representative value is designated with a higher value among 0 or 1 for each position, and one for all address values. It is possible to select only the nucleotide sequence.
  • a log-likelihood ratio (LLR) can be calculated for each bit of a base sequence for all address values.
  • step S214 the log-likelihood ratio (LLR) may be calculated using Equation 1 below.
  • K 0 is the number of 0s at the same position within the same cluster
  • K 1 is the number of 1s at the same position within the same cluster
  • base sequences having the same address value are grouped through the decoded address value, address values are removed from the base sequences gathered from the first address value, and then replaced with binary numbers using the table shown in FIG. 5 can do.
  • all log likelihood ratios may be designated as 0.
  • the number of 0's and 1's at each position are counted for the substituted binary vectors, and they are called k 0 and k 1 , respectively.
  • the lengths of the binary vectors are all the same.
  • Equation 1 the log-likelihood ratio can be obtained for each position of the nucleotide sequence of all address values.
  • the log-likelihood ratios are arranged in the vertical direction in the order of address values, and the horizontal direction becomes the LDPC code. Accordingly, if an LDPC code n payload having a length of N LDPC is finally derived and n payload codes are completely decoded without errors, it can be said that the entire decoding is successful.
  • soft decision decoding may be performed on all n payload codes of length N LDPC .
  • step S216 the processor 16 calculates the number of changed bits before and after performing the first LDPC decoding for each base sequence corresponding to each address value, and calculates the log likelihood for the base sequence in which the number of changed bits is equal to or greater than a preset number. You can initialize the ratio to 0.
  • step S217 the processor 16 may perform second LDPC decoding only on the horizontal LPDC codewords for which the first LDPC decoding fails.
  • step S2128 the processor 16 determines whether final decoding is successful through parity check matrices for all the first LDPC-decoded base sequences and all the first LDPC-decoded and second LDPC-decoded base sequences.
  • the processor 16 may obtain an editing distance of base sequences before and after decoding for the same address value in the first LDPC decoding.
  • the editing distance is a value indicating similarity between two nucleotide sequences, and may indicate the number of substitution, insertion, and deletion errors. Therefore, if the Hamming distance is greater than the editing distance, it can be inferred that an insertion/deletion error existed in the existing base sequence, and the insertion/deletion error can be determined using the editing distance and the location thereof can be found. .
  • the processor 16 may combine the second binary vectors of the first number into a first binary vector of the first length according to the order of the designated address values, based on the decoded result (S220). ).
  • the first LDPC decoding and the first LDPC decoding are both completed, if even one code does not pass the parity check matrix, it may be determined that decoding has failed.
  • the second binary vectors in the vertical direction are integrated into one long vector (first binary vector) can do. Since one long vector is the result of XORing the original data with a random number, the original data can be restored by XORing the decoding result with the same random number array again. At this time, if the restored data matches the existing data, it may be determined that the data storage and restoration using the device 10 is successful.
  • step S2128 when it is determined that the final decoding has failed, the processor 16 additionally extracts N base sequences from among the encoded second number of base sequences to perform first LDPC decoding and second LDPC decoding The process of performing can be repeated.
  • FIGS. 2, 6, and 7 describe steps S110 to S218 as being sequentially executed, this is merely an example of the technical idea of the present embodiment, and the common knowledge in the art to which this embodiment belongs Those who have various modifications and variations by changing and executing the order described in FIGS. 2, 6 and 7 or executing one or more steps of steps S110 to S218 in parallel to the extent that does not deviate from the essential characteristics of the present embodiment. 2, 6 and 7 are not limited to a time-series order.
  • the method according to an embodiment of the present invention described above may be implemented as a program (or application) to be executed in combination with a computer, which is hardware, and stored in a medium.
  • the computer may be the device 10 described above.
  • the aforementioned program is C, C++, JAVA, machine language, etc. It may include a code coded in a computer language of. These codes may include functional codes related to functions defining necessary functions for executing the methods, and include control codes related to execution procedures necessary for the processor of the computer to execute the functions according to a predetermined procedure. can do. In addition, these codes may further include memory reference related codes for which location (address address) of the computer's internal or external memory should be referenced for additional information or media required for the computer's processor to execute the functions. there is. In addition, when the processor of the computer needs to communicate with any other remote computer or server in order to execute the functions, the code uses the computer's communication module to determine how to communicate with any other remote computer or server. It may further include communication-related codes for whether to communicate, what kind of information or media to transmit/receive during communication, and the like.
  • Steps of a method or algorithm described in connection with an embodiment of the present invention may be implemented directly in hardware, implemented in a software module executed by hardware, or implemented by a combination thereof.
  • a software module may include random access memory (RAM), read only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, hard disk, removable disk, CD-ROM, or It may reside in any form of computer readable recording medium well known in the art to which the present invention pertains.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Algebra (AREA)
  • General Physics & Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Error Detection And Correction (AREA)

Abstract

La présente invention concerne un procédé de codage et de décodage de données d'ADN utilisant un code de contrôle de parité à faible densité, un programme et un dispositif. Selon la présente invention, en excluant les séquences de base ayant des caractéristiques biochimiques induisant trois types d'erreurs existant dans un dispositif de stockage d'ADN, et en vérifiant le rapport de G et de C des séquences de base finalement générées et du nombre de séquences continues, des séquences de base ne satisfaisant pas à une condition de référence peuvent être exclues. Par conséquent, lors du décodage de données, la possibilité d'occurrences d'erreur est abaissée, et des données peuvent être restaurées avec une plus grande précision.
PCT/KR2022/011804 2021-08-09 2022-08-08 Procédé de codage et de décodage de données d'adn utilisant un code de contrôle de parité à faible densité, programme et dispositif WO2023018157A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020210104372A KR102574250B1 (ko) 2021-08-09 2021-08-09 저밀도 패리티 체크 부호를 이용한 dna 데이터 부호화 및 복호화 방법, 프로그램 및 장치
KR10-2021-0104372 2021-08-09

Publications (1)

Publication Number Publication Date
WO2023018157A1 true WO2023018157A1 (fr) 2023-02-16

Family

ID=85200058

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2022/011804 WO2023018157A1 (fr) 2021-08-09 2022-08-08 Procédé de codage et de décodage de données d'adn utilisant un code de contrôle de parité à faible densité, programme et dispositif

Country Status (2)

Country Link
KR (1) KR102574250B1 (fr)
WO (1) WO2023018157A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170187390A1 (en) * 2014-03-28 2017-06-29 Thomson Licensing Methods for storing and reading digital data on a set of dna strands
WO2018148260A1 (fr) * 2017-02-13 2018-08-16 Thomson Licensing Appareil, méthode et système de mémorisation d'informations numériques dans de l'acide désoxyribonucléique (adn)
KR20210023674A (ko) * 2019-08-21 2021-03-04 울산대학교 산학협력단 Dna 저장 장치의 연성 정보 기반 복호화 방법, 프로그램 및 장치
JP2021071966A (ja) * 2019-10-31 2021-05-06 株式会社リコー データ保存方法及びデータ保存装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4563476B2 (ja) 2008-07-09 2010-10-13 パナソニック株式会社 符号化器、復号化器及び符号化方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170187390A1 (en) * 2014-03-28 2017-06-29 Thomson Licensing Methods for storing and reading digital data on a set of dna strands
WO2018148260A1 (fr) * 2017-02-13 2018-08-16 Thomson Licensing Appareil, méthode et système de mémorisation d'informations numériques dans de l'acide désoxyribonucléique (adn)
KR20210023674A (ko) * 2019-08-21 2021-03-04 울산대학교 산학협력단 Dna 저장 장치의 연성 정보 기반 복호화 방법, 프로그램 및 장치
JP2021071966A (ja) * 2019-10-31 2021-05-06 株式会社リコー データ保存方法及びデータ保存装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FEI PENG; WANG ZHIYING: "LDPC Codes for Portable DNA Storage", 2019 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY (ISIT), IEEE, 7 July 2019 (2019-07-07), pages 76 - 80, XP033620453, DOI: 10.1109/ISIT.2019.8849814 *

Also Published As

Publication number Publication date
KR20230022510A (ko) 2023-02-16
KR102574250B1 (ko) 2023-09-06

Similar Documents

Publication Publication Date Title
WO2021033981A1 (fr) Procédé de décodage flexible fondé sur des informations d'un dispositif de stockage d'adn, programme et appareil
WO2014092502A1 (fr) Procédé et appareil de codage à l'aide d'un code crc et d'un code polaire
WO2017030296A1 (fr) Appareil et procédé pour la mise en œuvre d'un code d'effacement xf dans des systèmes de stockage distribué
CN1260627C (zh) 检测和消除计算机病毒的方法和系统
US10742233B2 (en) Efficient encoding of data for storage in polymers such as DNA
CN1283905A (zh) 传输设备保护的纠错控制编码
WO1992010035A1 (fr) Systeme de communication de donnees binaires
WO2014003497A1 (fr) Génération et vérification de données additionnelles ayant un format spécifique
WO2023018157A1 (fr) Procédé de codage et de décodage de données d'adn utilisant un code de contrôle de parité à faible densité, programme et dispositif
WO2020001638A1 (fr) Procédé et appareil d'amélioration de performance de décodage turbo et dispositif informatique
JPS59500068A (ja) 文書比較器
WO2020231020A1 (fr) Procédé et appareil de décodage rapide de code linéaire sur la base d'une décision pondérée
CN1432915A (zh) 快速进行循环冗余度计算的系统和方法
WO2020179966A1 (fr) Procédé et appareil de décodage rapide de code linéaire sur la base d'une décision souple
WO2012118327A2 (fr) Procédé et appareil pour réaliser des transmissions et des réceptions dans un système de télécommunication et de radiodiffusion
CN111541457B (zh) 一种低时延低复杂度极化码串行抵消列表译码方法
WO2013069887A1 (fr) Procédé de génération de matrice de contrôle de parité adaptée à la qualité de liaison de canal et procédé et appareil de codage de code de contrôle de parité à faible densité l'utilisant
CN112000509B (zh) 一种基于向量指令的纠删码编码方法、系统及装置
WO2024076044A1 (fr) Procédé et dispositif de codage et de décodage d'adn
WO2022250195A1 (fr) Procédé et dispositif pour décoder un code linéaire à base de décision souple rapide à l'aide d'une recherche de syndrome partiel continu
WO2014021559A1 (fr) Procédé et appareil de codage/décodage utilisant un code inversé rare
WO2010032934A2 (fr) Procédé de codage, appareil de codage pour transformée b et données codées utilisées
WO2022080816A1 (fr) Procédé, programme et appareil de décodage basés sur le regroupement de séquences d'un dispositif de stockage d'adn
WO2012105778A2 (fr) Procédé et appareil d'émission et de réception dans un système de communication/diffusion
WO2014021558A1 (fr) Procédé et appareil de codage/décodage utilisant un code inversé rare

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22856154

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE