WO2024076044A1

WO2024076044A1 - Dna encoding and decoding method and device

Info

Publication number: WO2024076044A1
Application number: PCT/KR2023/014125
Authority: WO
Inventors: 이근우
Original assignee: 이근우
Priority date: 2022-10-06
Filing date: 2023-09-19
Publication date: 2024-04-11

Abstract

Disclosed herein are a method and device for DNA encoding and decoding, which is designed to reduce the likelihood of errors during DNA synthesis and restoration. A method for DNA encoding according to the specification includes converting binary data into a nucleotide sequence and then adding dummy nucleotides at points in the sequence where the same nucleotide repeats. Another method for DNA encoding according to the specification includes temporarily converting binary data into a nucleotide sequence and, if the same nucleotide repeats more than a predetermined number of times, reverting the temporarily converted nucleotide sequence back into binary data, shuffling the reverted binary data, and then converting same back into a nucleotide sequence. In this case, the nucleotide sequence information can be reverted back into binary data, and the reverted binary data can be restored back to the original binary data based on the number of times the data was shuffled. Another method for DNA encoding according to the specification includes converting binary data into nucleotide sequence information and adding dummy nucleotide information at predetermined intervals within the converted nucleotide sequence information.

Description

DNA encoding and decoding method and device

The present invention relates to information storage technology based on DNA molecules, and more specifically, to encoding technology for DNA synthesis and decoding technology for data recovery.

This application is related to Patent Application No. 10-2022-0127750, filed in Korea on October 6, 2022, Patent Application No. 10-2022-0129385, filed in Korea on October 11, 2022, Republic of Korea on October 24, 2022 Priority is claimed based on Patent Application No. 10-2022-0137061 filed in Korea and Patent Application No. 10-2022-0138893 filed in Korea on October 26, 2022, and all contents disclosed in the specification and drawings of the application are It is used in this application.

The content described in this section simply provides background information on the embodiments described in this specification and does not necessarily constitute prior art.

Magnetic tape, which is widely used as a long-term recording medium, has a data storage lifespan of about 10 years, so maintenance and management costs are continuously required. In the case of semiconductor storage devices, HDD and SSD are representative examples. The lifespan of HDDs is usually 5 years, and if the data access frequency is used less than once per quarter, the lifespan is about 10 years, but it is very vulnerable to shock and has a limit to its maximum capacity. SSDs are resistant to shocks, but their lifespan is relatively shorter than that of HDDs. Recently, the explosive amount of data produced has exceeded the capacity of storage media, causing overload, and the data storage density limit of existing information storage media has been reached, creating a need for new types of storage devices.

Among attempts to develop new storage media, attempts are being made to develop new storage media using DNA. When DNA is used as a storage medium, it is possible to overcome the data storage density, which is a disadvantage of existing storage media, and information can be stored stably for a long period of time even when subjected to physical shock.

As is well known, DNA is contained in cells, the smallest unit of living organisms, and contains all genetic information. All living things grow and move as if programmed, according to the information contained in DNA. In the case of humans, the DNA contained in one single cell consists of 3 billion base pairs, and the size of the genetic information decoded is approximately 1TB. And one single cell contains two strands of DNA that are 2 nm wide and 3 m long. Therefore, theoretically, DNA, a next-generation biostorage that can store more than EB (10^18), is very suitable as a biomaterial for ultra-intensive information storage. In addition, the storage life is more than 1,000 years, and low-cost storage is expected to be possible.

Figure 1 is a conceptual diagram of information storage using DNA base sequences.

Referring to Figure 1, binary data to be stored is encoded with nucleotides A (adenine), T (thymine), G (guanine), and C (cytosine). DNA is synthesized according to the encoded base sequence, and the synthesized DNA molecule is stored. Afterwards, the stored DNA molecules are selected through retrieval, the nucleotide sequence of the selected DNA molecule is analyzed (Sequencing), and binary data is decoded according to the analyzed nucleotide sequence.

The purpose of this specification is to provide a DNA encoding method and device that reduces the possibility of errors when synthesizing and restoring DNA.

The purpose of this specification is to provide a method and device for shuffling or restoring binary data with a reduced possibility of error when synthesizing and restoring DNA.

The purpose of this specification is to provide a DNA encoding method and device that can improve the structural stability of synthesized DNA.

This specification is not limited to the above-mentioned tasks, and other tasks not mentioned will be clearly understood by those skilled in the art from the description below.

The DNA encoding method according to the present specification to solve the above-described problem includes the steps of: (a) a processor first converting binary data into a base sequence; (b) a step where the processor finds a point where the same base is repeated in the first converted base sequence; and (c) secondary conversion by the processor into a base sequence in which a dummy base is added at a point where the same base is repeated.

According to an embodiment of the present specification, step (b) may be a step in which the processor searches for a point where a predetermined number of identical bases are repeated.

According to one embodiment of the present specification, step (b) may be a step in which the processor finds a point where the same base is repeated according to the number set for each base.

According to one embodiment of the present specification, step (b) includes: (b-1) generating a frequency table for repeated bases in the primary converted base sequence by a processor; (b-2) a step of the processor determining the number of repetitions of the same base requiring addition of a dummy base using the generated frequency table; and (b-3) the processor searches for points where the same base is repeated in the primary converted base sequence according to the number determined in step (b-2).

According to an embodiment of the present specification, the step (b-2) may be a step in which the processor determines the average value of the frequency as the number of repetitions of the same base that requires the addition of a dummy base.

According to an embodiment of the present specification, step (b-2) may be a step in which the processor determines the average frequency of each base as the number of repetitions of the same base that requires the addition of a dummy base for each base.

According to an embodiment of the present specification, step (c) may be a step of secondary conversion by the processor adding a dummy base having at least one predetermined sequence.

According to one embodiment of the present specification, step (c) may be a step in which the processor performs secondary conversion by adding bases different from the adjacent bases on both sides of the point where the same base is repeated as dummy bases.

According to one embodiment of the present specification, step (c) may be a step in which the processor performs secondary conversion by adding dummy bases with two or more sequences.

The DNA encoding method according to the present specification may be implemented in the form of a computer program written to perform each step of the DNA encoding method on a computer and recorded on a computer-readable recording medium.

The DNA encoding device according to the present specification for solving the above-mentioned problems primarily converts binary data into a base sequence, finds a point where the same base is repeated in the first converted base sequence, and detects the repeating point of the same base. It may include a processor that performs secondary conversion into a base sequence with dummy bases added at the points.

According to an embodiment of the present specification, the processor can find a point where a predetermined number of identical bases are repeated.

According to an embodiment of the present specification, the processor can find a point where the same base is repeated according to the number set for each base.

According to one embodiment of the present specification, the processor generates a frequency table for repeated bases in the primary converted base sequence, and uses the generated frequency table to determine the number of repetitions of the same base that requires addition of a dummy base. The point where the same base is repeated in the primary converted base sequence can be found according to the determined number.

According to an embodiment of the present specification, the processor may determine the average value of the frequency as the number of repetitions of the same base that require the addition of a dummy base.

According to an embodiment of the present specification, the processor may determine the average frequency for each base as the number of repetitions of the same base that requires the addition of a dummy base for each base.

According to an embodiment of the present specification, the processor may perform secondary conversion by adding a dummy base having at least one predetermined sequence.

According to an embodiment of the present specification, the processor may perform secondary conversion by adding bases different from the adjacent bases on both sides of the point where the same base is repeated as dummy bases.

According to one embodiment of the present specification, the processor can perform secondary conversion by adding dummy bases with two or more sequences.

The DNA encoding device according to the present specification includes a DNA encoding device; and a DNA synthesis device that synthesizes DNA according to the nucleotide sequence output from the DNA encoding device.

The binary data shuffling method according to the present specification to solve the above-described problem includes the steps of: (a) a processor temporarily converting binary data into a base sequence; (b) a step where the processor determines whether there is a point in the temporarily converted base sequence where the same base is repeated more than a preset number (hereinafter referred to as 'identical repeat sequence'); (c) when the same base repeat sequence exists in step (b), the processor reversely converts the temporarily converted base sequence back into binary data, shuffles the inverted binary data, and then returns to step (a); and (d) when the processor does not exist in the step (b), the temporarily converted base sequence is converted back into binary data, and the converted binary data is stored as binary data to be converted to base sequence. Step; may include.

According to one embodiment of the present specification, step (b) may be a step in which the processor searches for a point where the same base is repeated according to the number set for each base.

According to an embodiment of the present specification, step (c) may be a step in which the processor shuffles binary data using a linear feedback shift register (LFSR) method.

According to an embodiment of the present specification, step (c) includes: (c-1) storing the number of shuffles each time the processor shuffles binary data; and (c-2) when the processor reaches a preset maximum number of shuffles, moving to step (d).

According to an embodiment of the present specification, step (a) is a step in which the processor stores a temporarily converted base sequence, and step (c) is a step in which the processor stores the number of identically repeated bases in the temporarily converted base sequence. Further storing, (c-1) when the processor inverts the binary data and is not the same as the original binary data, mixing the inverse binary data and then returning to step (a); And (c-2) when the processor reversely converts the binary data to the same as the original binary data, sending the base sequence with the smallest number of identically repeated bases among the temporarily converted base sequences to step (d). can do.

The binary data shuffling method according to the present specification may be implemented in the form of a computer program written to perform each step of the binary data shuffling method on a computer and recorded on a computer-readable recording medium.

A binary data mixing device according to the present specification for solving the above-mentioned problems includes a base sequence conversion unit that converts binary data into a base sequence or reversely converts the base sequence into binary data; A repetitive base analysis unit that determines whether there is a point in the base sequence temporarily converted by the base sequence conversion unit where the same base is repeated more than a preset number (hereinafter referred to as 'identical repeat sequence'); A binary data scram unit that mixes binary data and outputs it; and a control unit that controls the base sequence conversion unit, the repetitive base analysis unit, and the binary data scram unit, wherein the control unit determines that the same base repeat sequence exists in the repetitive base analysis unit, and the temporary conversion After converting the base sequence back into binary data, mixing the back-converted binary data, controlling to convert the mixed binary data back into base sequence, and when there is no identical base repeat sequence in the repeat base analysis unit, It can be controlled to convert the temporarily converted base sequence back into binary data and store the inverted binary data as binary data subject to base sequence conversion.

According to an embodiment of the present specification, the repetitive base analysis unit can find a point where a predetermined number of identical bases are repeated.

According to an embodiment of the present specification, a point where the same base is repeated can be found according to the number set for each base in the repeated base analysis.

According to an embodiment of the present specification, the binary data scram unit can shuffle binary data using a linear feedback shift register (LFSR) method.

According to an embodiment of the present specification, the control unit stores the number of shuffles each time the binary data scram unit shuffles the binary data, and when the preset maximum shuffle number is reached, the temporarily converted nucleotide sequence is converted back into binary data. It can be controlled to inversely convert and store the inversely converted binary data as binary data subject to nucleotide sequence conversion.

According to one embodiment of the present specification, the control unit stores the base sequence each time it is temporarily converted in the base sequence conversion unit, and calculates the number of identically repeated bases analyzed in the repeat base analysis unit to the temporarily converted base sequence. When the inverted binary data is not the same as the original binary data, the inverted binary data is controlled to be mixed, and when the inverted binary data is the same as the original binary data, the number of identically repeated bases is the highest. The adversary can control the temporarily converted base sequence to store the inversely converted binary data as binary data subject to base sequence conversion.

The binary data shuffling device according to the present specification includes: a binary data shuffling device; and a DNA synthesis device that converts the binary data output from the binary data mixing device into a base sequence and then synthesizes DNA.

The binary data restoration method according to the present specification to solve the above-described problem includes the steps of: (a) receiving and storing analyzed DNA sequence information by a processor; (b) a step of the processor converting the base sequence information into binary data; (c) the processor separating the shuffle count information included in the inverted binary data; and (d) a step of the processor restoring binary data from which the shuffle count information has been separated according to the shuffle count information.

The binary data restoration method according to the present specification to solve the above-described problem includes the steps of: (a) receiving and storing analyzed DNA sequence information by a processor; (b) a step of the processor converting the base sequence information into binary data; (c) the processor reading shuffle count information from the storage unit; and (d) a step of the processor restoring the inversely converted binary data according to the shuffle count information.

The binary data restoration method according to the present specification may further include removing an error correction code of the inversely converted binary data after step (b).

Step (d) of the binary data restoration method according to the present specification may be a step in which the processor restores binary data using a linear feedback shift register (LFSR) method.

The binary data restoration method according to the present specification may be implemented in the form of a computer program written to perform each step of the binary data restoration method on a computer and recorded on a computer-readable recording medium.

The binary data restoration device according to the present specification for solving the above-described problems includes a base sequence conversion unit that reversely converts the analyzed DNA base sequence information into binary data; a binary data processing unit that separates mixed count information included in the inversely converted binary data; and a binary data disk drive unit that restores binary data from which the shuffle count information has been separated according to the shuffle count information.

The binary data restoration device according to the present specification for solving the above-described problems includes a base sequence conversion unit that reversely converts the analyzed DNA base sequence information into binary data; A storage unit that stores information on the number of times each DNA is mixed; and a binary data disk drive unit that restores the inversely converted binary data according to the shuffle count information stored in the storage unit.

The base sequence conversion unit of the binary data restoration device according to the present specification can remove the error correction code of the inversely converted binary data.

The binary data discram unit of the binary data recovery device according to the present specification,

Binary data can be restored using the Linear Feedback Shift Register (LFSR) method.

A binary data recovery device according to the present specification includes: a binary data recovery device; and a base sequence analysis unit that analyzes DNA and outputs base sequence information.

The DNA encoding method according to the present specification to solve the above-described problem includes (a) a processor first converting binary data into nucleotide sequence information; and (b) a step of secondary conversion, by the processor, into base sequence information to which dummy base information is added at every preset cycle in the primary converted base sequence information.

According to an embodiment of the present specification, the dummy base information added in step (b) may be base information that is different from adjacent base information in the first converted base sequence information.

According to an embodiment of the present specification, the dummy base information added in step (b) may be at least two or more base information.

According to an embodiment of the present specification, the dummy base information added in step (b) is base information located at both ends of the added dummy base information that is different from the adjacent base information in the first converted base sequence information. You can.

The DNA encoding method according to the present specification may further include the step of (c) the processor thirdly converting the secondarily converted base sequence information into base sequence information with protective dummy base information added to both ends.

The DNA encoding device according to the present specification for solving the above-mentioned problems primarily converts binary data into base sequence information, and adds dummy base information to the first converted base sequence information at preset cycles. It may include a processor that performs secondary conversion.

According to an embodiment of the present specification, the processor may add dummy base information having base information different from adjacent base information in the primary converted base sequence information.

According to an embodiment of the present specification, the processor may add dummy base information consisting of at least two or more base information.

According to an embodiment of the present specification, the processor may add dummy base information in which base information located at both ends of the dummy base information and adjacent base information in the first converted base sequence information have different base information.

According to an embodiment of the present specification, the processor may further convert the secondary converted base sequence information a third time into base sequence information with protective dummy base information added to both ends.

The DNA encoding device according to the present specification includes: a DNA encoding device; and a DNA synthesis device that synthesizes DNA according to the nucleotide sequence output from the DNA encoding device.

Other specific details of the invention are included in the detailed description and drawings.

According to the present specification, the possibility of errors when synthesizing and restoring DNA can be reduced. Additionally, the structural stability of synthesized DNA can be improved.

The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

Figure 2 is a reference diagram for the DNA information storage system according to the present specification.

Figure 3 is a flow chart of the DNA encoding method according to the present specification.

Figure 4 is an exemplary diagram of a DNA encoding method according to the present specification.

Figure 5 is an example of a dummy base that can be added depending on the point where the same base is repeated and the next adjacent base.

Figure 6 is a schematic configuration diagram of a binary data shuffling device according to the present specification.

Figure 7 is a flowchart of a binary data shuffling method according to an embodiment of the present specification.

Figure 8 is a reference diagram for nucleotide sequence conversion of binary data.

Figure 9 is a reference diagram to help understand the linear feedback shift register.

Figure 10 is a flowchart of a binary data shuffling method according to another embodiment of the present specification.

Figure 11 is a flowchart of a binary data shuffling method according to another embodiment of the present specification.

Figure 12 is a block diagram schematically showing the configuration of a binary data recovery device according to an embodiment of the present specification.

Figure 13 is a block diagram schematically showing the configuration of a binary data recovery device according to another embodiment of the present specification.

Figure 14 is a flowchart of a binary data restoration method according to an embodiment of the present specification.

Figure 15 is a flowchart of a binary data restoration method according to another embodiment of the present specification.

Figure 16 is a flowchart of the DNA encoding method according to the present specification.

Figure 17 is an exemplary diagram of a DNA encoding method according to an embodiment of the present specification.

Figure 18 is a table of types of dummy bases that can be added according to adjacent bases.

Figure 19 is an exemplary diagram of a DNA encoding method according to another embodiment of the present specification.

Figure 20 is an exemplary diagram of a DNA encoding method according to another embodiment of the present specification.

The advantages and features of the invention disclosed in this specification and methods for achieving them will become clear by referring to the embodiments described in detail below along with the accompanying drawings. However, the present specification is not limited to the embodiments disclosed below and may be implemented in various different forms, and the present embodiments are merely intended to ensure that the disclosure of the present specification is complete and to provide a general understanding of the technical field to which the present specification pertains. It is provided to fully inform those skilled in the art of the scope of this specification, and the scope of rights of this specification is only defined by the scope of the claims.

The terms used in this specification are for describing embodiments and are not intended to limit the scope of this specification. As used herein, singular forms also include plural forms, unless specifically stated otherwise in the context. As used in the specification, “comprises” and/or “comprising” does not exclude the presence or addition of one or more other elements in addition to the mentioned elements.

Like reference numerals refer to like elements throughout the specification, and “and/or” includes each and every combination of one or more of the referenced elements. Although “first”, “second”, etc. are used to describe various components, these components are of course not limited by these terms. These terms are merely used to distinguish one component from another. Therefore, it goes without saying that the first component mentioned below may also be a second component within the technical spirit of the present invention.

Unless otherwise defined, all terms (including technical and scientific terms) used in this specification may be used with meanings commonly understood by those skilled in the art to which this specification pertains. Additionally, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless clearly specifically defined.

Referring to FIG. 2, the DNA information storage system is largely divided into a controller and a DNA molecular unit. When the controller receives a request to store information (Write) from the host, it can compress binary data and scramble the compressed data (Scrambler). The reason for shuffling the data is to prevent the same base sequence from being repeated when the binary data is later replaced with the base sequence. An error correction code (ECC) may be added to data that has completed the shuffling process. Next, the binary data can be converted into corresponding base sequence information (DNA Library). DNA Molecular can synthesize actual DNA molecules according to data converted to base sequence.

Afterwards, when a request to read information stored in a DNA molecule occurs, the DNA Molecular analyzes the base sequence of the DNA molecule. The controller converts the data back into binary data according to the analyzed base sequence and corrects data errors using an error correction code (ECC) (Encoder). Binary data with errors corrected can be descramblered and decompressed to provide original binary data (information).

In the technical field to which this specification pertains, it is known that defects may occur if the same base is repeatedly synthesized during the process of synthesizing DNA. According to Poon and MacGregor (198) Biopolymers 45:427-434, when G (guanine) is synthesized repeatedly (continuously) more than 4 times, there is a problem of aggregation in the form of a guanine tetraplex. do. Although the above academic data mentions a problem with G (guanine), it does not rule out that the same or similar problems may occur with A (adenine), T (thymine), and C (cytosine). Accordingly, the present applicant has recognized the need for a method to reduce the possibility of defects occurring during the DNA synthesis process when storing information using DNA base sequences.

Hereinafter, the DNA encoding method of the invention according to the present specification will be described with reference to the attached drawings. Meanwhile, the DNA encoding method according to the present specification refers to a process of converting binary data into a base sequence in the sequence shown in FIG. 1. Depending on the encoded base sequence, it can later be synthesized into an actual DNA molecule. Therefore, DNA encoding is a step that determines in what order the bases will be arranged. In this specification, the base order determines the forward reading direction from 5' to 3'. Additionally, a repeated base sequence means that the same bases are arranged repeatedly (consecutively). Therefore, a point where the same base sequence is repeated means a point where the same base sequence is repeatedly arranged a predetermined number of times in the forward direction.

Meanwhile, the DNA encoding method according to the present specification may be implemented in the form of a computer program written to perform each step described below on a computer and recorded on a computer-readable recording medium. When the DNA encoding method according to the present specification is implemented in the form of a computer program, each step can be executed by a processor.

Referring to FIG. 3, first, in step S100, the processor may first convert binary data into a base sequence. In the next step S200, the processor can find a point where the same base is repeated in the first converted base sequence. In the next step S300, the processor can secondary convert the sequence into a base sequence in which a dummy base is added to the point where the same base is repeated.

First, we will look in more detail at the process of first converting binary data to base sequence in step S100. According to one example, the processor can first convert each 2 bits of data into a base sequence by matching A(adenine)=00, T(thymine)=01, G(guanine)=10, and C(cytosine)=11. . In the example shown in FIG. 4, 2 bits of data are matched 1:1 with bases, but the DNA encoding method according to the present specification is not limited to the example shown. One base can correspond to 1 bit, 2 bits, 3 bits, or 4 bits, and methods for matching binary data and base sequence can vary.

Let us take a closer look at the process of finding a point where the same base is repeated in the first converted base sequence in step S200. As explained earlier, if a specific base is placed repeatedly (consecutively), there is a high possibility that defects will occur during the synthesis process. In other words, step S200 is a process of finding a point where defects are likely to occur during the synthesis process.

According to one embodiment of the present specification, the processor can find a point where a predetermined number of identical bases are repeated. The predetermined number can be set in various ways, for example, 4 to 70. Meanwhile, the same standard may be applied to all four bases, but a different standard may be applied to each base. Therefore, the processor may find points where the same base is repeated according to the number set for each base. For example, A may have 5 bases, T may have 6 bases, G may have 7 bases, C may have 5 bases, etc., and so on. Some bases may have the same number.

According to another embodiment of the present specification, the standard can be determined based on the number of repetitions in the primary converted base sequence. To this end, the processor may generate a frequency table (see example frequency table in FIG. 4) for repeated bases in the primary converted base sequence. Additionally, the processor can use the generated frequency table to determine the number of repetitions of the same base that require addition of a dummy base. And the processor can find the point where the same base is repeated in the primary converted base sequence according to the determined number of repetitions of the same base. For reference, the Frequency Count Table can be added to the device (DNA Library) that converts binary information into base sequence information in the controller in the DNA information storage system according to the present specification.

There can be various ways to determine the number of repetitions of the same base that require the addition of a dummy base using a frequency table. As an example, the processor may determine the average value of the frequency as the number of repetitions of the same base that require the addition of a dummy base. In the example shown in Figure 4, the average value of the frequency is 1.125. Through the rounding operation, whenever the same base is repeated twice, a dummy base can be determined as the position to be added. Additionally, the processor may determine the average value of the frequency for each base as the number of repetitions of the same base that requires the addition of a dummy base for each base. In the example shown in Figure 4, the average of A is 0.5, the average of G is 2, the average of C is 1.25, and the average of T is 0.75. At this time, by excluding values less than 2, which are significant for repetition, it can be decided not to add a dummy base when A, C, and T are repeated, and to add a dummy base only when G is repeated 2 or more times.

In step S210, the processor may determine whether the number of repetitions of the same base exceeds the standard set according to the various embodiments described above. If the number of repeats of the same base does not exceed the standard ('NO' in step S210), there is no need to add a dummy base. On the other hand, if the number of repeats of the same base exceeds the standard ('YES' in step S210), the process can proceed to step S300 because the point requires the addition of a dummy base.

We will take a closer look at the process of secondary conversion to a base sequence in which a dummy base is added at the point where the same base is repeated in step S300. According to an embodiment of the present specification, the processor may perform secondary conversion by adding a dummy base having at least one predetermined sequence. The example shown in Figure 4 shows an example in which A is added as a dummy base sequence when four Gs are repeated. However, the DNA encoding method according to the present specification is not limited to the example shown in FIG. 4. In the example shown in Figure 4, an example of one dummy base 'A' is shown, but the dummy base may be 'G', 'T', or 'C', and the number of dummy bases is not one, but 'AA', There may be two or more such as 'GG', 'TT', 'CC', 'AG', 'GT', 'TC', and 'CA', and their combinations may also vary.

Meanwhile, the reason a dummy base is added is to prevent the same base from being placed repeatedly, as explained earlier. Preferably, the processor can perform secondary conversion by adding bases different from the adjacent bases on both sides of the point where the same base is repeated as dummy bases. Figure 5 is an example of a dummy base that can be added depending on the point where the same base is repeated and the next adjacent base. In the example shown in FIG. 5, an example with one dummy base is shown for simplicity of drawing and convenience of explanation. However, as explained previously, the processor can perform secondary conversion by adding dummy bases with two or more sequences. In this case, it is also possible to add a dummy base with two or more base sequences combined with different bases from the adjacent bases on both sides of the point where the same base is repeated.

In this specification, “binary data” means data consisting of 1 and 0. In this specification, “base sequence” refers to information consisting of A (adenine), T (thymine), G (guanine), and C (cytosine). In this specification, the base order determines the forward reading direction from 5' to 3'. Additionally, a repeated base sequence means that the same bases are arranged repeatedly (consecutively). Therefore, a point where the same base sequence is repeated means a point where the same base sequence is repeatedly arranged a predetermined number of times in the forward direction.

Referring to FIG. 6, the binary data mixing device 100 according to the present specification may include a base sequence conversion unit 110, a repetitive base analysis unit 120, a binary data scram unit 130, and a control unit 140. there is. The base sequence conversion unit 110 can convert binary data into a base sequence or reversely convert a base sequence into binary data. The repetitive base analysis unit 120 can determine whether there is a point in the base sequence temporarily converted by the base sequence conversion unit 110 where the same base is repeated more than a preset number (hereinafter referred to as 'same base repeat sequence'). . The binary data scram unit 130 can mix binary data and then output it. The control unit 140 can control the base sequence conversion unit 110, the repetitive base analysis unit 120, and the binary data scram unit 130. When the repeating base analysis unit 120 determines that an identical base repeating sequence exists, the control unit 140 converts the temporarily converted base sequence back into binary data, mixes the inverted binary data, and then mixes them. You can control the conversion of binary data back to base sequences. In addition, when the same nucleotide repeat sequence does not exist in the repetitive base analysis unit 120, the control unit 140 reversely converts the temporarily converted base sequence back into binary data, and converts the inverted binary data into base sequence conversion target. You can control it to be stored as binary data. The operation of the control unit 140 will be explained through the binary data shuffling method according to the present specification.

Meanwhile, the base sequence conversion unit 110, the repetitive base analysis unit 120, the binary data scram unit 130, and the control unit 140 are used in the technical field to which the present invention belongs to execute the binary data shuffling method to be described below. It may include known processors, application-specific integrated circuits (ASICs), other chipsets, logic circuits, registers, communication modems, data processing devices, etc. In addition, when the control logic to be described below is implemented in software, the base sequence conversion unit 110, repetitive base analysis unit 120, binary data scram unit 130, and control unit 140 are implemented as a set of program modules. It can be. At this time, the program module may be stored in the memory device and executed by the processor. Therefore, the binary data shuffling method according to the present specification can be implemented in the form of a computer program written to perform each step described below on a computer and recorded on a computer-readable recording medium. Hereinafter, the binary data shuffling method according to the present specification will be described on the assumption that it is executed by a processor.

Referring to Figure 7, first, in step S100, the processor may receive and store initial data. In this specification, 'original data' refers to original binary data that has not been mixed. In the next step S110, the processor may temporarily convert the initial data (binary data) into a base sequence. The reason the base sequence converted in step S110 is called a 'temporary base sequence' is to distinguish it from 'binary data subject to base sequence conversion', which will be explained later. The ‘temporary base sequence’ may not be the base sequence that will be synthesized into actual DNA later. On the other hand, 'binary data subject to base sequence conversion' is data corresponding to the base sequence to be synthesized into actual DNA.

Referring to Figure 8, the processor can first convert each 2 bits of data into a base sequence by matching it to A(adenine)=00, T(thymine)=01, G(guanine)=10, and C(cytosine)=11. there is. In the example shown in FIG. 8, 2 bits of data are matched 1:1 with bases, but the method of mixing binary data according to the present specification is not limited to the example shown. One base can correspond to 1 bit, 2 bits, 3 bits, or 4 bits, and methods for matching binary data and base sequence can vary.

Referring again to FIG. 7, in step S120, the processor may determine whether there is a point in the temporarily converted base sequence where the same base is repeated more than a preset number (hereinafter referred to as 'same base repeat sequence'). As explained earlier, if a specific base is placed repeatedly (consecutively), there is a high possibility that defects will occur during the synthesis process. In other words, step S120 is a process of finding a point where defects are likely to occur during the DNA synthesis process.

If an identical base repeat sequence exists, that is, if the number of repeats of the same base is more than a preset number ('YES' in step S120), the process moves to step S130. In step S130, the processor may reversely convert the temporarily converted base sequence back into binary data. And in step S140, the processor may mix the inversely converted binary data and then transfer the process to step S110. That is, if the number of repeats of the same base in the converted base sequence is more than a preset number, steps S110 to S140 may be repeatedly performed.

On the other hand, when the same base repeat sequence does not exist, that is, if the number of repeats of the same base is less than the preset number ('NO' in step S120), the process moves to step S150. In step S150, the processor may inversely convert the temporarily converted nucleotide sequence back into binary data and store the inversely converted binary data as binary data to which the nucleotide sequence is to be converted. This is because converting the base sequence back into binary data requires additional processing of the binary data in the encoder shown in FIG. 2.

There may be various methods for mixing binary data in step S140. According to an embodiment of the present specification, the processor may shuffle binary data using a linear feedback shift register (LFSR) method. A linear feedback shift register (LFSR) is a type of shift register and has a structure in which the value entered into the register is calculated as a linear function of the previous state values. The linear function used at this time is mainly exclusive logical sum (XOR). The initial bit value of LFSR is called the seed. LFSR is used in fields such as pseudorandom numbers, pseudorandom noise (PRN), fast digital counters, and blank sequences. In this specification, LFSR, which is used in existing pseudorandom numbers, is used as an element to solve the problem of repeated synchronous bases.

The tap sequence of LFSR can be expressed as a polynomial congruence equation. This means that the coefficients of the polynomial must be 1 or 0. This is called a feedback polynomial or characteristic polynomial. For example, if the taps are the 16th, 14th, 13th, and 11th bits, the LFSR polynomial is:

X ¹¹ +x ¹³ +X ¹⁴ +X ¹⁶ +1

In a polynomial, '1' does not match a tab. The length of LSFR can be designed according to the length of DNA to be synthesized. For example, if the length of the DNA strand to be synthesized is 150, the LSFR length is 2 ⁿ -1, so if n = 8, the LFSR length can be taken up to 255, so using an 8th degree polynomial is enough to process 150 DNA strands. You can. If the length of the DNA strand to be synthesized is longer than 150, n = 10, up to 1023, the number of polynomials in LSFR can be increased in proportion to the length of the DNA strand. LFSR includes 'External LFSR' or 'Internal LFSR' depending on the location of the XOR gate, and various methods known to those skilled in the art, such as 'Galois LFSR', can be applied. As LFSR is known to those skilled in the art, further detailed description will be omitted.

Meanwhile, the operation of LFSR is deterministic. Therefore, the sequence of values generated by LFSR is determined by the previous value. Additionally, because the number of values a register can have is finite, this sequence can be repeated at a specific period. Of course, if you choose a good linear function, you can create a long-period, seemingly random sequence. However, if the values output from LSFR are continuously input back into LSFR, there is a possibility that the sequence may be repeated. That is, in some cases, when steps S110 to S140 are repeatedly executed, the initial data may be output again. Therefore, it is necessary to prevent infinite repetition of steps S110 to S140.

One way to prevent repeated execution of LSFR is to set the number of executions in advance.

Referring to FIG. 10, it can be seen that steps S100 to S150 are the same, and steps S141 and S142 have been added. According to another embodiment of the present specification, in step S140, the processor may shuffle the binary data and then proceed to step S141. In step S141, the processor may store the number of shuffles each time the binary data is shuffled. And in step S142, it can be determined whether the number of shuffles exceeds the preset maximum number of shuffles (K). If the number of shuffles is less than the maximum number of shuffles (K) (“NO” in step S142), the process can proceed to step S110. That is, steps S110 to S142 can be repeatedly executed until the number of shuffles reaches the maximum number of shuffles (K). On the other hand, if the number of shuffles is greater than the maximum number of shuffles (K) (“YES” in step S142), the process can proceed to step S150. Without performing additional shuffling, the final shuffled binary data is stored as base sequence converted binary data.

Another way to prevent repeated execution of LSFR is to find the binary data with the lowest number of base sequence repetitions among the converted binary data.

Referring to FIG. 11, it can be seen that steps S143 and S144 have been added. First, in step S110, the processor may store the temporarily converted base sequence. That is, each time step S110 is executed after binary data shuffling, the converted temporary data can be stored. And in step S120, the processor may further store the number of identically repeated bases in the temporarily converted base sequence. In other words, it is possible to store more information about the actual number of identically repeated bases in the temporarily converted base sequence. The next steps S130 and S140 are the same as previously described. In step S143 following step S140, the processor may determine whether the inversely converted binary data is the same as the original binary data, that is, the original data.

If the inversely converted binary data is not identical to the original binary data (“NO” in step S143), the process may proceed to step S110. Thereafter, the processor may repeatedly execute steps S110 to S143. The repeated execution of steps S110 to S143 may be performed until the number of repeated sequences of the same base in step S120 is less than or equal to the standard number or until the shuffled binary data is identical to the original binary data.

On the other hand, if the inversely converted binary data is the same as the original binary data (“YES” in step S143), the process can proceed to step S144. In step S144, the processor may select a base sequence with the smallest number of identically repeated bases among the temporarily converted base sequences. Then, the processor may inversely convert the base sequence selected in step S150 back into binary data and store the inversely converted binary data as binary data subject to base sequence conversion.

Meanwhile, according to one embodiment, the processor may further add information on the number of times the nucleotide sequence has been converted to the binary data to be converted. For example, if there is no identical nucleotide repeat sequence in the initially temporarily converted nucleotide sequence, the shuffle count information may be '0'. In addition, if the number of times steps S120 to S140 are repeated two times until no identical nucleotide repeat sequence exists, the shuffle number information may be '2'. In this way, the binary data subject to base sequence conversion can be added to the binary data based on how many times it has been mixed.

According to another embodiment, the processor may store information on the number of shuffles of the binary data subject to nucleotide sequence conversion in a separate storage device. The previous example is an embodiment in which the information on the number of times of mixing is converted into a base sequence and recorded in the DNA itself, and this other embodiment is an embodiment in which the information on the number of times of mixing is stored in a separate storage device.

Below, a method of reading information stored in the DNA storage system will be described. In the DNA storage system according to the present specification, the binary data to be converted to base sequence, finally determined according to the binary data mixing method and device described above, is synthesized into an actual DNA molecule according to the base sequence. Therefore, when a read request occurs for information stored as a DNA molecule, a process is required to unmix it and restore it to the original binary data. The role of the binary data restoration method and device according to the present specification is to unravel this mixing and restore the original binary data.

Meanwhile, the binary data recovery device according to the present specification may be a component of a DNA storage system. The DNA storage system may include a base sequence analysis unit that analyzes (sequencing) DNA and outputs base sequence information. Therefore, the binary data restoration device and method according to the present specification assumes that information on the base sequence of the actual DNA molecule has been analyzed. Afterwards, the important thing in the restoration process is how many times the DNA molecule has been mixed and the restoration must be carried out according to the information on the number of times it has been mixed. As previously explained, the mixing number information may be stored within the DNA molecule or may be stored in a separate storage device.

Referring to FIG. 12, the binary data recovery device 200 according to an embodiment of the present specification may include a base sequence conversion unit 210, a binary data processing unit 220, and a binary data disk drive unit 230. . The base sequence conversion unit 210 can reversely convert the analyzed DNA base sequence information into binary data. The base sequence conversion unit 210 may be the same or similar to the base sequence conversion unit 110 of the binary data shuffling device 100 described above. The binary data processing unit 220 may separate the mixed count information included in the inversely converted binary data. The binary data recovery device 200 according to an embodiment of the present specification is a device corresponding to an embodiment in which shuffle count information is included in a DNA molecule. Therefore, the inversely converted binary data includes shuffle count information, and the shuffle count information must be removed to return to the original binary data when unshuffled and restored. The binary data disk drive unit 230 can restore binary data with the shuffle count information separated according to the shuffle count information.

Referring to FIG. 13, the binary data recovery device 300 according to another embodiment of the present specification may include a base sequence conversion unit 310, a storage unit 320, and a binary data disk drive unit 330. The base sequence conversion unit () can reversely convert the analyzed DNA base sequence information into binary data. The base sequence conversion unit 310 may be the same or similar to the base sequence conversion unit 110 of the binary data shuffling device 100 described above. The storage unit 320 can store information on the number of times each DNA has been mixed. The binary data restoration device 300 according to another embodiment of the present specification is a device corresponding to an embodiment in which shuffle count information is not included in the DNA molecule. Although the storage unit 320 is shown in this specification as being included in the binary data recovery device 300, the storage unit 320 may exist outside the binary data recovery device 300. The binary data disk drive unit 330 can restore the inversely converted binary data according to the shuffle count information stored in the storage unit 320.

Meanwhile, the base

sequence conversion units

210 and 310 can remove the error correction code of the inversely converted binary data. When storing binary data, it can be converted to a base sequence by adding a code for error correction. Therefore, the inverted binary data also needs to have error correction codes removed.

Additionally, the binary data

disk drive units

230 and 330 can restore binary data using a linear feedback shift register (LFSR) method. This corresponds to the case where binary data was shuffled using the Linear Feedback Shift Register (LFSR) method. The Linear Feedback Shift Register (LFSR) can be restored to its original state when executed in reverse, just as when shuffling. Since LSFR is known to those skilled in the art, a detailed description of the algorithm is omitted.

Meanwhile, the base

sequence conversion unit

210, 310, binary data processing unit 220, binary data

disk drive unit

230, 330, and storage unit 320 are used to execute the binary data restoration method described below. It may include processors, application-specific integrated circuits (ASICs), other chipsets, logic circuits, registers, communication modems, data processing devices, etc. known in the technical field to which the invention belongs. In addition, when the control logic to be described below is implemented in software, the base

sequence conversion unit

210, 310, binary data processing unit 220, and binary

data disk unit

230, 330 are implemented as a set of program modules. You can. At this time, the program module may be stored in the memory device and executed by the processor. Therefore, the binary data restoration method according to the present specification can be implemented in the form of a computer program written to perform each step described below on a computer and recorded on a computer-readable recording medium. Hereinafter, the binary data restoration method according to the present specification will be described on the assumption that it is executed by a processor.

The binary data restoration method according to an embodiment of the present specification, like the embodiment shown in FIG. 12, is a method corresponding to an embodiment in which information on the number of shuffles is included in a DNA molecule.

Referring to FIG. 14, first, in step S210, the processor may receive and store the analyzed DNA sequence information. In the next step S210, the processor may reversely convert the base sequence information into binary data. In the next step S220, the processor may separate the shuffled count information included in the inversely converted binary data. And in step S230, the processor may restore binary data in which the shuffle count information is separated according to the shuffle count information.

The binary data restoration method according to another embodiment of the present specification, like the embodiment shown in FIG. 13, is a method corresponding to an embodiment in which information on the number of shuffles is not included in the DNA molecule.

Referring to FIG. 15, first, in step S310, the processor may receive and store the analyzed DNA sequence information. In the next step S310, the processor may reversely convert the base sequence information into binary data. In the next step S320, the processor can read the shuffle count information from the storage unit. And in step S330, the processor may restore binary data in which the shuffle count information is separated according to the shuffle count information.

Meanwhile, after step S210 or S310, the processor may remove the error correction code of the inversely converted binary data.

Additionally, in step S230 or S330, the processor may restore binary data using a linear feedback shift register (LFSR) method.

Referring to FIG. 16, in step S100, the processor may first convert binary data into nucleotide sequence information. In the next step S200, the processor may secondary convert the primary converted nucleotide sequence information into nucleotide sequence information to which dummy nucleotide information is added at preset cycles. For convenience of understanding, the DNA encoding method according to the present specification will be explained through examples of binary data and base sequence information.

Referring to FIG. 17, we will first look in more detail at the process of first converting binary data into nucleotide sequence information in step S100. According to one example, the processor can first convert 2 bits of data into nucleotide sequence information by matching A(adenine)=00, T(thymine)=01, G(guanine)=10, and C(cytosine)=11. there is. In the example shown in FIG. 17, 2 bits of data are matched 1:1 with bases, but the DNA encoding method according to the present specification is not limited to the example shown. One base can correspond to 1 bit, 2 bits, 3 bits, or 4 bits, and methods for matching binary data and base information can vary. The base sequence information converted in this way is called ‘primary converted base sequence information’.

Referring to FIG. 17, we will look in more detail at the process of adding a dummy base in the next step S200. In the example shown in Figure 17, it can be seen that "G (guanine)" is added as a dummy base in every five base sequences. The cycle in which the dummy base is added can be set in various ways. The shorter the cycle, the more dummy bases are added, lowering the possibility that the part with actual information will be destroyed by ultraviolet rays, thereby improving structural stability. On the other hand, the shorter the cycle, the disadvantage is that the portion containing actual information within the entire DNA molecule decreases. Therefore, the cycle for adding dummy bases can be set appropriately by considering the environment and storage period in which the synthesized DNA molecule will be stored.

Meanwhile, according to an embodiment of the present specification, the dummy base information added in step S200 may be base information that is different from adjacent base information in the first converted base sequence information. It is known that defects may occur if the same base is synthesized repeatedly during the process of synthesizing DNA molecules. According to Poon and MacGregor (198) Biopolymers 45:427-434, when G (guanine) is synthesized repeatedly (continuously) more than 4 times, there is a problem of aggregation in the form of a guanine tetraplex. do. Although the above academic data mentions a problem with G (guanine), it does not rule out that the same or similar problems may occur with A (adenine), T (thymine), and C (cytosine). In the example shown in Figure 17, it can be seen that four or more G (guanine)s in the DNA molecule are repeatedly (continuously) placed due to the addition of the first, third, and fourth dummy bases from the left. Therefore, in order to reduce the possibility of defects occurring during the DNA synthesis process, the dummy base can be changed according to information on adjacent bases. Figure 18 is a table of types of dummy bases that can be added according to adjacent bases.

Meanwhile, in the example shown in FIG. 17, a dummy base consisting of one base is mentioned, but according to an embodiment of the present specification, the added dummy base information may be information of at least two or more bases.

Referring to Figure 19, it can be seen that a dummy base consisting of three bases "AGT", "GTA", "TAC", "AGT", and "ACC" was added. The number of bases constituting the dummy base information can be set in various ways. Additionally, as for the added dummy base information, the base information located at both ends of the added dummy base information may be base information that is different from the adjacent base information in the first converted base sequence information. This is also to prevent the same base from being synthesized repeatedly.

On the other hand, the examples shown in Figures 17 and 19 are examples of secondary conversion by adding a dummy base inside the primary converted base sequence. In this way, the base located inside the DNA molecule is bonded to other bases on both sides, but the base located at the very end of the DNA molecule is connected to only one side and there is no molecular bond to the other side. In this case, there is a possibility that the bond at the end may be broken and lost. As there is a possibility that the base corresponding to the actual information at the end may be damaged, this also needs to be protected.

Referring again to FIG. 16, in step S300 after step S200, the processor may thirdly convert the secondarily converted base sequence information into base sequence information with protective dummy base information added to both ends.

Referring to Figure 20, it can be seen that a protective dummy base consisting of "AAAAA" has been added to both ends of the secondary converted base sequence information shown in Figure 17. Since the protective dummy bases added to both ends are bases unrelated to the actual information, even if they are lost, the actual information can be prevented from being damaged.

Meanwhile, the processor may include a microprocessor, ASIC (application-specific integrated circuit), other chipsets, logic circuits, registers, communication modems, and data processing devices known in the technical field to which the present invention pertains to execute the above-described calculation and various control logic. It may include etc. Additionally, when the above-described control logic is implemented as software, the processor may be implemented as a set of program modules. At this time, the program module may be stored in the memory device and executed by the processor.

The above-mentioned computer program is C/C++, C#, JAVA that the processor (CPU) of the computer can read through the device interface of the computer in order for the computer to read the program and execute the methods implemented in the program. , may include code coded in computer languages such as Python and machine language. These codes may include functional codes related to functions that define the necessary functions for executing the methods, and include control codes related to execution procedures necessary for the computer's processor to execute the functions according to predetermined procedures. can do. In addition, these codes may further include memory reference-related codes that indicate at which location (address address) in the computer's internal or external memory additional information or media required for the computer's processor to execute the above functions should be referenced. there is. In addition, if the computer's processor needs to communicate with any other remote computer or server in order to execute the above functions, the code uses the computer's communication module to determine how to communicate with any other remote computer or server. It may further include communication-related codes regarding whether communication should be performed and what information or media should be transmitted and received during communication.

The storage medium refers to a medium that stores data semi-permanently and can be read by a device, rather than a medium that stores data for a short period of time, such as a register, cache, or memory. Specifically, examples of the storage medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc., but are not limited thereto. That is, the program may be stored in various recording media on various servers that the computer can access or on various recording media on the user's computer. Additionally, the medium may be distributed to computer systems connected to a network, and computer-readable code may be stored in a distributed manner.

Although the embodiments of the present specification have been described above with reference to the attached drawings, those skilled in the art will understand that the present invention can be implemented in other specific forms without changing the technical idea or essential features. You will be able to understand it. Therefore, the embodiments described above should be understood in all respects as illustrative and not restrictive.

Claims

(a) the processor first converting the binary data into a base sequence;

(b) a step where the processor finds a point where the same base is repeated in the first converted base sequence; and

(c) a step of secondary conversion by the processor into a base sequence in which dummy bases are added at points where the same bases are repeated.
In claim 1,

In step (b),

DNA encoding method, in which the processor searches for points where a predetermined number of identical bases are repeated.
In claim 2,

In step (b),

A DNA encoding method in which the processor searches for points where the same base is repeated according to the number set for each base.
In claim 1,

In step (b),

(b-1) generating a frequency table for repeated bases in the primary converted base sequence by the processor;

(b-2) a step of the processor determining the number of repetitions of the same base requiring addition of a dummy base using the generated frequency table; and

(b-3) a step where the processor searches for points where the same base is repeated in the primary converted base sequence according to the number determined in step (b-2).
In claim 4,

In step (b-2),

A DNA encoding method in which the processor determines the average value of the frequency as the number of repetitions of the same base that requires the addition of a dummy base.
In claim 5,

In step (b-2),

A DNA encoding method in which the processor determines the average value of the frequency for each base as the number of repetitions of the same base that requires the addition of a dummy base for each base.
In claim 1,

In step (c),

A DNA encoding method in which a processor performs secondary conversion by adding dummy bases with at least one predetermined sequence.
In claim 1,

In step (c),

A DNA encoding method in which the processor performs secondary conversion by adding bases different from the adjacent bases on both sides of the point where the same base is repeated as dummy bases.
In claim 8,

In step (c),

A DNA encoding method in which the processor performs secondary conversion by adding dummy bases with two or more sequences.
A computer program written to perform each step of the DNA encoding method according to any one of claims 1 to 9 on a computer and recorded on a computer-readable recording medium.