WO2022080816A1

WO2022080816A1 - Method, program, and apparatus for decoding based on sequence clustering of dna storage device

Info

Publication number: WO2022080816A1
Application number: PCT/KR2021/014015
Authority: WO
Inventors: 정재호; 박호성; 박성준; 노종선; 김성환; 노알버트
Original assignee: 서울대학교 산학협력단; 전남대학교산학협력단; 울산대학교 산학협력단; 홍익대학교 산학협력단
Priority date: 2020-10-12
Filing date: 2021-10-12
Publication date: 2022-04-21
Also published as: KR102418616B1; KR20220048362A

Abstract

The present invention relates to a method, program, and apparatus for decoding based on sequence clustering of a DNA storage device. According to the present invention, when base sequences with certain lengths are extracted from nucleotide sequences formed through a stitch algorithm, clustering is conducted on the basis of the degree of identity, wherein nucleotide sequences with small errors as well as completely identical nucleotide sequences are contained within the cluster, whereby the error data can be used to improve decoding efficiency and reduce the final error rate.

Description

Decoding method, program and apparatus based on sequence aggregation method of DNA storage device

The present invention relates to a decoding method, program and apparatus based on a sequence aggregation method of a DNA storage device.

In modern society, data is overflowing due to the spread of smart devices, the increase in SNS usage, and the spread of IoT systems, and by 2020, more than 44 trillion GB of data is expected to exist on the Internet.

As such, a large amount of data is generally stored in various places using a number of storage devices such as hardware devices themselves, external hard drives, and web hard drives.

However, this data storage method has many limitations that cause inconvenience as well as capacity and cost, and thus, a lot of research on DNA storage technology has been made recently.

DNA storage technology is a method of storing digital data consisting of 0s and 1s by replacing them with bases of DNA. This is 215,000 times more data storage capacity than a 1TB hard disk.

Therefore, recently, research on a method of encoding each digital data into DNA data and a method of decoding DNA data into digital data has been continuously conducted.

In the DNA sequence aggregation method, according to a previous study by Erlich, Y., and Zielinski, the same cluster is determined only when the nucleotide sequences synthesized through the Stitch algorithm are completely identical. In this method, even clusters with very few errors, which have room for improvement, cannot be used for decoding, and thus there is a problem in that decoding efficiency is lowered and an error value is rather increased.

In the present invention for solving the above-described problems, in DNA synthesis, when a nucleotide sequence having a specific length is extracted from a nucleotide sequence formed through a stitch algorithm, clustering is performed based on the degree of identity, but only the completely identical nucleotide sequence It is an object of the present invention to provide a decoding method, program and apparatus based on a sequence aggregation method of a DNA storage device that improves decoding efficiency by utilizing error data and reduces the final error rate by including nucleotide sequences with small errors in the cluster.

In addition, according to the present invention, when RS decoding is performed, not only successful clusters but also failed clusters are subjected to RS correction and utilized for LT decoding (Luby Transform Decoding), thereby recovering and using information that was previously discarded. , to provide a decoding method, program and apparatus based on a sequence aggregation method of a DNA storage device.

In addition, the present invention provides a decoding method, program and apparatus based on a sequence aggregation method of a DNA storage device, which can reduce the error rate of decoding by determining the priority of LT decoding based on a Q-score value. .

The problems to be solved by the present invention are not limited to the problems mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art from the following description.

In a decoding method based on a sequence aggregation method of a DNA storage device according to an embodiment of the present invention for solving the above-mentioned problems, the electronic device performs DNA sequencing to form a DNA oligo base sequence, the electronic device randomly sampling a predetermined number of bases from each of a forward read and a reverse read of the generated nucleotide sequence, and the electronic device uses the randomly sampled nucleotides using a stitch algorithm The step of merging, the step of the electronic device extracting only the nucleotide sequence having a predetermined length from among the nucleotide sequences merged through the stitch algorithm, the electronic device clustering the extracted nucleotide sequence based on the degree of identity to collect more than half Forming a representative value based on the DNA sequence base, and simply storing less than a majority of the DNA sequence base to form a first data group, wherein the electronic device has a certain distance from the clusters in the first data group removing nucleotide sequences to form a second data group, the electronic device classifying the second data group in an order of increasing the number of nucleotide sequences among clusters to form a third data group; In the third data group, RS decoding is performed starting from a cluster with a large number of nucleotide sequences, and the successful nucleotide sequence is stored, and the unsuccessful nucleotide sequence is stored in a separate heap after performing RS correction. and forming, by the electronic device, sequentially performing LT decoding (Luby Transform Decoding) on the fourth data group.

In the step of forming the DNA oligonucleotide sequence, a Hamming Distance is formed to be different from each other by a minimum length x or more.

The forming of the first data group includes a nucleotide sequence having the same entire length among nucleotide sequences having a predetermined length, one nucleotide sequence is different, and the remaining sequences are the same nucleotide sequence and two nucleotide sequences are different and the rest The sequence is determined to be the same cluster by determining that there is identity even to the same nucleotide sequence.

In the forming of the second data group, nucleotide sequences that differ from clusters by a distance of 3 nt to a minimum length of x-2 nt in the first data group are removed.

In the step of forming the third data group, the method of determining the order of the ones having the number of one among the clusters is in the order of the highest probability value P of erroneously retrieving the base calculated based on the prediction quality score (Q-score). will decide The probability value P is a value satisfying the following equation.

Here, y means the length of a predetermined base sequence from which the base sequence is extracted.

In the step of the electronic device sequentially performing LT decoding (Luby Transform Decoding) in the fourth data group, LT decoding is first performed on a nucleotide sequence that has succeeded in RS decoding, and RS correction is performed on a nucleotide sequence that has failed RS decoding. After performing the LT decoding, it is stored in a separate heap, and the nucleotide sequence stored in the heap is performed later.

A sequence aggregation method-based decoding program of a DNA storage device according to another embodiment of the present invention for solving the above-described problems is combined with hardware that is a computer, and is stored in a medium to execute any one of the methods.

A decoding computing device based on a sequence aggregation method of a DNA storage device according to another embodiment of the present invention for solving the above problems,

DNA sequencing is performed to form a DNA oligo base sequence, and a predetermined number of bases are randomly sampled from each of the forward read and reverse read of the generated base sequence, and the randomly sampled bases are merged using a stitch algorithm, only a nucleotide sequence having a predetermined length is extracted from among nucleotide sequences merged through the stitch algorithm, and the extracted nucleotide sequence is clustered based on the degree of identity. A representative value is formed based on the base sequence base, and a DNA sequence base of less than a majority is simply stored to form a first data group, and the electronic device uses a base having a certain distance from the clusters in the first data group. sequences are removed to form a second data group, the electronic device classifies the second data group in an order of increasing the number of nucleotide sequences in the cluster to form a third data group, and the electronic device performs the third data group performs RS decoding from a cluster with a large number of sequences in the The electronic device sequentially performs LT decoding (Luby Transform Decoding) in the fourth data group.

Forming the DNA oligo base sequence is formed so that the Hamming distance differs from each other by a minimum length x or more.

Forming the first data group is that, among nucleotide sequences having a predetermined length, the entire predetermined length is the same, one nucleotide sequence is different and the remaining sequences are the same nucleotide sequence and two nucleotide sequences are different, and the remaining sequences are different is determined to be the same cluster by determining that there is identity even to the same nucleotide sequence.

Forming the second data group is to remove nucleotide sequences that differ from clusters by 3 nt to a minimum length x-2 nt in the first data group.

To form the third data group, the method of determining the order of the ones having the number of one among the clusters is determined in the order of the highest probability value P of erroneously retrieving the base calculated based on the Q-score will do

The probability value P is a value satisfying the following equation.

When the electronic device sequentially performs LT decoding (Luby Transform Decoding) in the fourth data group, LT decoding is first performed on a nucleotide sequence that succeeds in RS decoding, and RS correction is performed on a nucleotide sequence that fails RS decoding. After storing it in a separate heap, the nucleotide sequence stored in the heap is to perform LT decoding later.

In addition to this, another method for implementing the present invention, another system, and a computer-readable recording medium for recording a computer program for executing the method may be further provided.

According to the present invention as described above, when extracting a nucleotide sequence having a specific length from the nucleotide sequence formed through the stitch algorithm, clustering is performed based on the degree of identity, but not only the completely identical nucleotide sequence but also nucleotide sequences with small errors By including in the cluster, it is possible to utilize the error data to improve the decoding efficiency and reduce the final error rate.

In the present invention, when RS decoding is performed, not only successful clusters but also failed clusters are subjected to RS correction and used for LT decoding, thereby recovering and using information that was previously discarded, thereby improving the efficiency of decoding.

In the present invention, it is possible to reduce the decoding error rate by determining the priority of LT decoding based on a Q-score value for a cluster having a size of 1, which is one of the major causes of the decoding error rate.

Effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

1 is a flowchart illustrating a sequence aggregation method-based decoding method of a DNA storage device according to an embodiment of the present invention.

2A and 2B are diagrams comparing performance of a decoding method based on a sequence aggregation method of a DNA storage device according to an embodiment of the present invention and a decoding method according to a comparative example.

3 is a graph illustrating a correlation in which an error rate increases as the probability value of a Q-score decreases.

Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only the present embodiments allow the disclosure of the present invention to be complete, and those of ordinary skill in the art to which the present invention pertains. It is provided to fully understand the scope of the present invention to those skilled in the art, and the present invention is only defined by the scope of the claims.

The terminology used herein is for the purpose of describing the embodiments and is not intended to limit the present invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, “comprises” and/or “comprising” does not exclude the presence or addition of one or more other components in addition to the stated components. Like reference numerals refer to like elements throughout, and "and/or" includes each and every combination of one or more of the recited elements. Although "first", "second", etc. are used to describe various elements, these elements are not limited by these terms, of course. These terms are only used to distinguish one component from another. Accordingly, it goes without saying that the first component mentioned below may be the second component within the spirit of the present invention.

Unless otherwise defined, all terms (including technical and scientific terms) used herein will have the meaning commonly understood by those of ordinary skill in the art to which this invention belongs. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless specifically defined explicitly.

Before the description, the meaning of the terms used in this specification will be briefly described. However, it should be noted that, since the description of the term is for the purpose of helping the understanding of the present specification, it is not used in the meaning of limiting the technical idea of the present invention unless explicitly described as limiting the present invention.

In the present specification, the 'decoded nucleotide sequence file' is a text-based standard nucleotide data format file indicating a DNA nucleotide sequence, for example, a FASTQ file.

The present invention relates to a process of decoding data encoded as DNA data again through a sequence aggregation method.

Accordingly, in this specification, a process of decoding encoded data is described, and data encoded by various methods may be included.

In particular, the present invention relates to a Ruby Transform (LT, Luby Transform) code, a Fountain code, a Turbo code, a Polar code, or LDPC (Low Density Parity Check) for decoding data encoded with a code. It may be suitable, but it is not limited to encoding data encoded in a specific manner, and may be applied to data encoded in any manner.

In order to help the description of the decoding method according to the present invention, a method for encoding specific data into DNA data will be briefly described.

The following example is an example of a general method of encoding using a Ruby transform code.

In the process of encoding a specific image file into DNA data, one image is divided into K _LT packets of length L. Divided K _LT packets are generated as one encoded packet through Ruby transform encoding. Thereafter, a seed bit obtained by a Linear-Feedback Shift Register (LFSR) is appended, and a Reed Solomon (RS) encoding bit corresponding to or greater than a specific value is appended.

After converting each of the base bits to the final encoded bit, it is checked whether the maximum homopolymer length of the base sequence is less than or equal to a predetermined length, and whether the guanine to cytosine content ratio falls within a predetermined ratio.

If the above condition is satisfied, the sequence is selected to be encoded, and in the case of a sequence where the above condition is not met, it is discarded. By repeating the above procedure for all data until sequences are synthesized, it is encoded as DNA data.

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

1 , the decoding method based on the sequence aggregation method of the DNA storage device according to an embodiment of the present invention performs DNA sequencing to form a DNA oligo base sequence (S100), the front end of the base sequence read and random sampling of a certain number of bases from the rear end read (S110), merging the randomly sampled bases using a stitch algorithm (S120), and extracting only nucleotide sequences having a predetermined length from among the merged bases (S130) ), forming a first data group by clustering the extracted nucleotide sequences based on the degree of identity (S140), removing nucleotide sequences having a certain distance to form a second data group (S150), bases in the cluster A third data group is formed by classifying in the order of the number of sequences (S160), a nucleotide sequence that succeeds by performing RS decoding is stored, and a nucleotide sequence that fails RS correction is performed on a separate heap (S160). ) to form a fourth data group (S170), and sequentially performing LT decoding (Luby Transform Decoding) on the fourth data group (S180).

In the step (S100) of forming a DNA oligo base sequence by performing DNA sequencing, the electronic device as a sequence identifier, each base, adenine (A, Adenine), guanine (G, Guanine), cytosine (C, Cytosine) and thymine (T, Thymine) and the stiffness information including the sequence sequence to form a DNA oligonucleotide sequence.

In the step of merging the randomly sampled bases using the stitch algorithm ( S120 ), the electronic device generates a stitched sequence of a predetermined target length with respect to the base sequence file by using the stitch algorithm.

The stitch algorithm (PEAR) is an algorithm that evaluates the overlap of interactive leads, and is characterized by parallel processing.

The target length may be a length allowed in the stitch algorithm, and may be a predetermined length, for example a length of 152 nt.

In the step of extracting only the nucleotide sequence having a predetermined length among the merged bases (S130), the sequence encoded based on the Hamming distance in the stitched sequence has a predetermined length, for example, 152 nt. to extract the catalog.

The Hamming distance is based on a predetermined minimum editing distance, and an initial sequence of stitched oligo reads is extracted based on the corresponding editing distance. The Hamming distance may be, for example, 80 nt.

In the step of forming a first data group by clustering the extracted nucleotide sequences based on the degree of identity ( S140 ), when a predetermined length is y, y - 2 sequences matching up to 2 of the nucleotide sequences are grouped into the same cluster. For example, when the predetermined length is 152 nt, the same sequence up to 150 nt in the nucleotide sequence, that is, up to two different sequences may be collected into the same cluster. In the existing technology, when all of the predetermined nucleotide sequences are completely identical, they are collected into a cluster, and even if only one nucleotide sequence is different, it is not included in the cluster. Differences up to are characterized by including as a cluster. Thus, a cluster may have exactly the same sequence, only one different sequence, or up to two different sequences may coexist. By collecting DNA oligos with 1 or 2 errors in the same cluster, the errors of the DNA oligos can be confirmed by the majority rule by position. The reason for including those that differ by up to two errors in the same cluster is that, in the case of a location where an error cannot be found due to the principle of majority vote, the corresponding error can be checked using the RS code. A method of checking the error will be described later.

When clustering is completed for all the extracted sequences, if there are DNA bases collected by a majority by base position in each cluster, a representative value unified with the corresponding base is formed. If there is no base corresponding to a majority of positions, all possible base values are left at that position.

In this step ( S140 ), the clustered sequence data is formed into a first data group.

In the step of forming the second data group by removing nucleotide sequences having a predetermined distance (S150), nucleotide sequences having a predetermined distance from the clusters are removed based on the first data group.

For example, when the minimum Hamming distance is x, the predetermined distance may include 3 to x-2. That is, DNA oligos that differ from the clusters by 3 to x-2 are removed from the first data group. For example, when the editing distance is 80 nt, the DNA oligo to be removed is a DNA oligo having a distance of 3 nt to 78 nt from the adjacent cluster.

The reason for removing such a nucleotide sequence is that even though the editing distance, which is the minimum interval, is set when synthesizing DNA oligos, DNA oligos having a Hamming distance less than that are highly likely to contain many errors. However, DNA oligos with differences of 1nt, 2nt, and x-1, x-2 are left without deletion because it can be corrected through RS correction.

The step of forming a third data group by classifying in an order of increasing the number of nucleotide sequences in the cluster (S160) is to sort the third data group from the largest to the smallest among the clusters based on the second data group. to form

Since nucleotide sequences with the number of clusters of 1 have a large influence on the error in the decoding step, the probability of retrieving the base calculated based on the predicted quality score (Q-score) incorrectly to determine the priority of LT decoding for the nucleotide sequence A third data group may be formed by determining the value P in the order of the highest.

The probability value P may be a value satisfying Equation 1 below.

For example, when the length of the predetermined base sequence is 152 nt, it may be in the form of Equation 2 below.

Equation 2 is a numerical value indicating the probability of reliability when each position of the base sequence is sequenced with the corresponding base. By multiplying this with respect to the entire nucleotide sequence, the P value can be used as an index indicating the reliability of the entire DNA oligo.

A nucleotide sequence that succeeds by performing RS decoding is stored, and a nucleotide sequence that fails is stored in a separate heap after performing RS correction to form a fourth data group (S170).

In the prior art, clusters in which RS decoding results fail are not used at all. However, in the case of the sequence aggregation method-based decoding method of the DNA storage device according to the present invention, clusters containing only one error can be corrected to the correct value through RS correction and used for LT decoding. Therefore, in the present invention, the existing discarded information can be recovered and used for LT decoding, and the clusters with correct RS decoding results are stored in a separate heap to form a fourth data group.

For example, if oligos X and Y differ by two and they are grouped into the same cluster, the primary determines whether it is the correct oligo through the principle of majority vote. If it is not possible to determine which of X and Y is the correct oligo due to a tie, it can be determined through RS code.

[Case 1] X: RS code passed, Y: RS code passed

[Case 2] X: RS code passed, Y: RS code failed

[Case 3] X: RS code failure, Y: RS code failure

In the first case, since both X and Y passed the RS code, X and Y are identical oligos, so either sequence can be used for LT decoding.

In the second case, since X passed through the RS code is a complete oligo without errors, it is used for LT decoding.

In the third case, if both X and Y fail RS, RS correction is performed. If the same code is corrected, the corresponding code is recognized as the correct code, stored in the heap, and used for LT decoding with low priority. If it is not corrected with the same code, it means that there is a fatal error in one of the two sequences, so both sequences are discarded.

Finally, in the fourth data group, sequentially performing LT decoding (Luby Transform Decoding) (S180) is,

Efficient decoding is possible by performing LT decoding in the order of a nucleotide sequence with high reliability to a nucleotide sequence with low reliability.

2 is a diagram comparing the performance of a decoding method based on a sequence aggregation method of a DNA storage device according to an embodiment of the present invention and a decoding method according to a comparative example.

Referring to FIGS. 2A and 2B , it can be confirmed that the decoding method based on the sequence aggregation method according to the present invention has superior performance in all random sampling frequency intervals compared to the comparative example.

Referring to Table 1, in the case of the present invention, it can be seen that the number of successes exceeds 190 at the random sampling number of 78000 for 200 experiments in the limited pool, and it can be confirmed that the decoding of all 200 experiments succeeds from 86000.

On the other hand, in the case of the comparative example, in the limited pool, the number of successes exceeded 190 only at the number of random sampling of 82000 for 200 experiments in the limited pool. can check that

In the case of the present invention, it can be seen that the number of successes exceeds 190 at the random sampling number of 80000 for 200 experiments in the unrestricted pool, and it can be confirmed that the decoding of all 200 experiments succeeds from 86000.

On the other hand, in the case of the comparative example, the number of successes exceeds 190 only at the number of random samplings of 86000 for 200 experiments in the unrestricted pool, and the number of random samplings greater than that of the present invention, such as successful decoding of all 200 experiments from 90000 You can check what you are asking for.

Referring to Table 2, when comparing the present invention and the comparative example with detailed data at the time when all 200 succeeds for 200 experiments, the ratio of clusters corresponding to number 1 is 38.3 to 40.5% of the present invention 11.6 ~ It can be seen that the decrease is 14.0%. In addition, it can be seen that the ratio of the total sequence read of the cluster corresponding to the number 1 is reduced from 12.8% to 14.5% to 2.8 to 3.5%.

As described above, since the ratio of clusters corresponding to the number 1 has a large influence on the error rate during decoding, a reduction in the ratio of clusters corresponding to the number 1 corresponds to data demonstrating the superiority in performance of decoding according to the present invention. do.

In addition, the number of sequences stored in the heap through RS correction and reused for LT decoding is 0% in the comparative example, whereas in the present invention, it is about 0.26 to 0.32%, so it can be confirmed that the efficiency of information utilization is improved. .

Referring to FIG. 3, Equation 2 (

), the correlation with the errors contained in the sequence reads compared to the existing DNA oligos is shown. The x-axis represents the probability corresponding to the Q-score of the corresponding oligo read, and the y-axis represents the difference in the minimum editing distance between the oligo read and the encoded oligo sequence. It can be confirmed that the error is small because the difference is small as the P value is high, and it can be confirmed that the error generally increases as the P value decreases.

The method according to an embodiment of the present invention described above may be implemented as a program (or application) to be executed in combination with a server, which is hardware, and stored in a medium.

The above-described program is C, C++, JAVA, machine language, etc. that a processor (CPU) of the computer can read through a device interface of the computer in order for the computer to read the program and execute the methods implemented as a program It may include code (Code) coded in the computer language of Such code may include functional code related to a function defining functions necessary for executing the methods, etc., and includes an execution procedure related control code necessary for the processor of the computer to execute the functions according to a predetermined procedure. can do. In addition, this code may further include additional information necessary for the processor of the computer to execute the functions or code related to memory reference for which location (address address) in the internal or external memory of the computer should be referenced. there is. In addition, when the processor of the computer needs to communicate with any other computer or server located remotely in order to execute the functions, the code uses the communication module of the computer to determine how to communicate with any other computer or server remotely. It may further include a communication-related code for whether to communicate and what information or media to transmit and receive during communication.

The storage medium is not a medium that stores data for a short moment, such as a register, a cache, a memory, etc., but a medium that stores data semi-permanently and can be read by a device. Specifically, examples of the storage medium include, but are not limited to, ROM, RAM, CD-ROM, magnetic tape, floppy disk, and an optical data storage device. That is, the program may be stored in various recording media on various servers accessible by the computer or in various recording media on the computer of the user. In addition, the medium may be distributed in a computer system connected to a network, and a computer-readable code may be stored in a distributed manner.

The steps of a method or algorithm described in relation to an embodiment of the present invention may be implemented directly in hardware, as a software module executed by hardware, or by a combination thereof. A software module may contain random access memory (RAM), read only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, hard disk, removable disk, CD-ROM, or It may reside in any type of computer-readable recording medium well known in the art to which the present invention pertains.

As mentioned above, although embodiments of the present invention have been described with reference to the accompanying drawings, those skilled in the art to which the present invention pertains know that the present invention may be embodied in other specific forms without changing the technical spirit or essential features thereof. you will be able to understand Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

Claims

forming, by the electronic device, DNA sequencing to form a DNA oligo base sequence;

randomly sampling, by the electronic device, a predetermined number of bases from each of a forward read and a reverse read of the generated nucleotide sequence;

combining, by the electronic device, the randomly sampled bases using a stitch algorithm;

extracting, by the electronic device, only a nucleotide sequence having a predetermined length from among nucleotide sequences merged through a stitch algorithm;

The electronic device clusters the extracted nucleotide sequences based on the degree of identity to form a representative value based on the DNA sequence bases collected by a majority or more, and the DNA sequence bases with less than a majority are simply stored to form the first data group forming;

forming, by the electronic device, a second data group by removing nucleotide sequences having a predetermined distance from clusters in the first data group;

forming, by the electronic device, a third data group by classifying the second data group in an order of increasing the number of nucleotide sequences among clusters;

The electronic device performs RS decoding from a cluster having a large number of nucleotide sequences in the third data group, stores a successful nucleotide sequence, performs RS correction on a failed nucleotide sequence, and then stores it in a separate heap to form a fourth data group; and

Sequence aggregation method-based decoding method of a DNA storage device comprising a; the electronic device sequentially performing LT decoding (Luby Transform Decoding) in the fourth data group.
The method of claim 1,

The step of forming the DNA oligo base sequence,

A decoding method based on a sequence aggregation method of a DNA storage device in which a Hamming distance differs from each other by a minimum length x or more.
The method of claim 1,

Forming the first data group includes:

Among nucleotide sequences having a predetermined length, a nucleotide sequence having the same entire predetermined length, one nucleotide sequence different, the remaining sequences having the same nucleotide sequence and two nucleotide sequences different, and the remaining sequences having the same nucleotide sequence A decoding method based on a sequence aggregation method of a DNA storage device that determines the same cluster by judging.
The method of claim 1,

Forming the second data group comprises:

A decoding method based on a sequence aggregation method of a DNA storage device, the step of removing nucleotide sequences that differ from clusters by a distance of 3 nt to a minimum length of x-2 nt in the first data group.
The method of claim 1,

Forming the third data group includes:

The method of determining the order of the ones with the number of one in the cluster is a sequence grouping method of a DNA storage device in which the probability value P of erroneous retrieval of the bases calculated based on the predicted quality score (Q-score) is determined in the order of the highest. based decryption method.
6. The method of claim 5,

The probability value P is a sequence aggregation method-based decoding method of a DNA storage device that is a value satisfying the following equation.

Here, y means the length of a predetermined base sequence from which the base sequence is extracted.
The method of claim 1,

The step of the electronic device sequentially performing LT decoding (Luby Transform Decoding) in the fourth data group comprises:

LT decoding is first performed on the nucleotide sequence that has succeeded in RS decoding, and the nucleotide sequence that has failed RS decoding is RS corrected and stored in a separate heap. The nucleotide sequence stored in the heap is subjected to LT decoding later. A decoding method based on a sequence aggregation method of a DNA storage device.
In combination with computer hardware, any one of claims 1 to 7 stored in a computer-readable recording medium to execute the sequence aggregation method-based decoding method of the DNA storage device of any one of claims 1 to 7.
DNA sequencing is performed to form a DNA oligo base sequence,

A predetermined number of bases are randomly sampled from each of the forward read and reverse read of the generated nucleotide sequence,

Randomly sampled bases are combined using a stitch algorithm,

Extracting only the nucleotide sequence having a predetermined length from among the nucleotide sequences merged through the stitch algorithm,

The extracted nucleotide sequences are clustered based on the degree of identity to form a representative value based on the DNA nucleotide sequence bases collected by a majority or more, and the DNA nucleotide sequence bases with less than a majority are simply stored to form a first data group,

The electronic device forms a second data group by removing nucleotide sequences having a predetermined distance from clusters in the first data group;

the electronic device forms a third data group by classifying the second data group in an order of increasing the number of nucleotide sequences among clusters;

The electronic device performs RS decoding from a cluster having a large number of nucleotide sequences in the third data group and stores the successful nucleotide sequence, and performs RS correction on the failed nucleotide sequence and then stores it in a separate heap to form a fourth data group,

A decoding computing device based on a sequence aggregation method of a DNA storage device in which the electronic device sequentially performs LT decoding (Luby Transform Decoding) in a fourth data group.
10. The method of claim 9,

The step of forming the DNA oligo base sequence,

A decoding computing device based on a sequence aggregation method of a DNA storage device in which a Hamming distance differs from each other by a minimum length x or more.
10. The method of claim 9,

Forming the first data group comprises:

Among nucleotide sequences having a predetermined length, it is determined that the entire predetermined length is the same, one nucleotide sequence is different, the remaining sequences have the same nucleotide sequence and two nucleotide sequences are different, and the remaining sequences are identical to the same nucleotide sequence A decoding computing device based on a sequence aggregation method of a DNA storage device that determines the same cluster by doing so.
10. The method of claim 9,

Forming the second data group comprises:

A decoding computing device based on a sequence aggregation method of a DNA storage device, which is a step of removing nucleotide sequences that are different from the clusters by a distance of 3 nt to a minimum length of x-2 nt in the first data group.
10. The method of claim 9,

Forming the third data group comprises:

The method of determining the order of the ones with the number of one among the clusters is a sequence grouping method of a DNA storage device in which the probability value P of erroneous retrieval of bases calculated based on the predicted quality score (Q-score) is determined in the order of the highest. based decryption computing device.
14. The method of claim 13,

The probability value P is a sequence aggregation method-based decoding computing device of a DNA storage device, which is a value satisfying the following equation.

Here, y means the length of a predetermined base sequence from which the base sequence is extracted.
10. The method of claim 9,

The step of the electronic device sequentially performing LT decoding (Luby Transform Decoding) in the fourth data group,

LT decoding is first performed on the nucleotide sequence that has succeeded in RS decoding, and the nucleotide sequence that has failed RS decoding is RS corrected and stored in a separate heap. The nucleotide sequence stored in the heap is subjected to LT decoding later. A decoding computing device based on a sequence aggregation method of a DNA storage device that performs.