CN113345521A - Coding and recovering method using large fragment DNA storage - Google Patents

Coding and recovering method using large fragment DNA storage Download PDF

Info

Publication number
CN113345521A
CN113345521A CN202110601792.8A CN202110601792A CN113345521A CN 113345521 A CN113345521 A CN 113345521A CN 202110601792 A CN202110601792 A CN 202110601792A CN 113345521 A CN113345521 A CN 113345521A
Authority
CN
China
Prior art keywords
data
sequence
dna
coding
sequencing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110601792.8A
Other languages
Chinese (zh)
Inventor
陈为刚
郭健
葛奇
韩昌彩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202110601792.8A priority Critical patent/CN113345521A/en
Publication of CN113345521A publication Critical patent/CN113345521A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Abstract

The invention discloses a coding and recovery method adopting large-segment DNA storage, which belongs to the field of DNA data storage, wherein binary data is divided into a plurality of sub-blocks after being scrambled, each sub-block is coded by a grouped error correcting code with a very high code rate, different sub-blocks are subjected to symbol interleaving and are transcoded into data DNA units through the mapping process of 'symbol/bit-base sequence', and the data DNA units and a replication start site are alternately embedded to form an artificial chromosome sequence; during data reading, the second-generation high-throughput sequencing is utilized, read segment de novo assembly, known biological sequences such as contig merging guided by a promoter sequence, and efficient error correction erasure correction decoding are combined, so that reliable data recovery is realized. The invention has the advantages of realizing the multi-element matching of an interleaved grouped error correcting code encoding method, a large-fragment DNA logic structure, second-generation high-throughput sequencing and a de novo assembly method, having very low overhead, fully utilizing the error correcting capability of the high-code-rate error correcting code and realizing very high base logic density.

Description

Coding and recovering method using large fragment DNA storage
Technical Field
The invention relates to the field of DNA data storage, in particular to a coding and recovery method adopting large-fragment DNA storage.
Background
With the rapid development of contemporary information technology, the global data volume exhibits explosive growth. According to the forecast of international data corporation, the global data volume will increase from 44ZB in 2020 to 175ZB in 2025, and the existing storage methods have not been able to satisfy the human demand for storing massive digital information. Currently, storage media are used which mainly include: optical disc, magnetic tape, flash memory, etc., the storage mode mainly includes: computer-based stand-alone storage, network storage, distributed storage, cloud storage, and the like. However, there are many challenges with existing storage media and storage methods. First, the density of storage media is approaching a limit, making it difficult to meet the rapidly increasing demand for data. Second, the currently established data storage centers face many challenges of size, cost, power consumption, security, and the like. Third, the effective storage time of the conventional storage medium is short, generally within several decades, for example, the effective storage time of a magnetic tape is 30-50 years, which cannot meet the long-term storage and backup requirements of important data, and the storage medium such as a magnetic tape and an optical disc is easily damaged, resulting in the loss of stored data.
Artificially synthesized deoxyribonucleic acid (DNA) is used as a potential data storage medium, has high storage density, long service life and low storage energy consumption, and is expected to become one of important choices for storing mass offline data in the future. The american society for semiconductor industry and semiconductor research corporation promulgated "semiconductor decade program" in 2021, listed DNA data storage as one of the major storage modes for large amounts of data in parallel with hard disks, solid state disks, and magnetic tapes, and became an important direction for global storage industry competition in the future. The DNA storage mainly uses 4 different deoxyribonucleotides of oligonucleotide, namely A (adenine), T (thymine), G (guanine) and C (cytosine) to represent data, and the storage of mass data through artificially synthesized DNA sequences or oligonucleotide molecules is an emerging data storage mode. With the increasing data volume generated in human production and life, the importance of data storage is also highlighted, and DNA data storage becomes a large-scale mass data storage mode with great potential in the future.
DNA data storage data was stored using { A, T, G, C } of oligonucleotide molecules. The current modes of DNA data storage mainly include: short fragment oligonucleotide pool (Oligo pool) storage, intracellular DNA storage, and the like. The oligonucleotide pool storage of short fragments develops rapidly by means of high-throughput chip synthesis and sequencing technology of DNA, but still has great challenges in large-scale equilibrium amplification and replication cost. The method has the advantages that intracellular DNA data storage, particularly intracellular large-fragment DNA storage, short DNA fragments are assembled into long fragments by means of an in-vivo assembly method, efficient amplification is achieved by means of in-vivo replication, replication cost is low, and the method has potential application value in scenes such as large-scale data distribution and the like.
The logical structure design of large-fragment DNA data storage is different from the data storage based on oligonucleotide pool (Oligo pool), the ratio of index and primer (or similar units, such as skeleton in yeast artificial chromosome) is relatively low, and the logical structure design has certain advantages in base utilization rate. The data reading phase requires de novo (de novo) assembly of sequencing reads, similar to de novo (de novo) sequencing of the genome of new species. The reading of large-fragment DNA data storage has high requirement on algorithm real-time performance, and the traditional bioinformatics processing flow is not suitable. Aiming at the characteristic, when the data is recovered, a completely error-free DNA sequence is not required to be obtained in the step of reading assembly, and residual errors after assembly are corrected by using an error correcting code, so that a perfect DNA sequence without any base errors is obtained. Therefore, how to design a proper error correction coding scheme to ensure the reliability of large-fragment DNA data storage becomes an important issue.
The application of error correction coding technology in the field of DNA data storage is an important means for reducing writing cost and ensuring data reliability, and is a very potential technology. On one hand, due to the limitation of biochemical conditions, the error rate of oligonucleotide molecule synthesis is high, and generally, one synthesis error occurs in hundreds of bases; if the accuracy of oligonucleotide synthesis is further improved, the yield of oligonucleotide molecules is reduced, and the cost is greatly increased. Therefore, the adoption of an efficient error correction coding technique is an important method for reducing the writing cost and ensuring the data reliability.
However, according to the fundamental theory of channel capacity and channel coding of shannon information theory, the error correcting code needs to be matched with the error characteristics of writing/reading, and only reliable and efficient data storage can be realized. Several important error correcting codes in the field of digital communication have been tried in vitro DNA data storage. For example, digital fountain codes are used to correct deletion errors caused by oligonucleotide molecule loss, reed-solomon (RS) codes to correct base deletions and random errors, and product codes composed of Low Density Parity Check (LDPC) codes and RS codes to correct deletions and random errors, but these error correction coding methods do not fully match the error characteristics of writing/reading, because they do not use the characteristic that the error characteristics of a plurality of bits corresponding to each base symbol are correlated. The coding method for storing the in-vivo large-fragment DNA adopts the watermark code consisting of the LDPC code and the pseudorandom sequence, and aims at the high error rate of the third-generation nanopore sequencing, and the base insertion/deletion error which is difficult to process is mainly considered. The coding efficiency of the method is low, 1.19bit/bp, and the theoretical limit density of 2bit/bp of information represented by 4 bases { A, T, G, C } is still large in difference.
Disclosure of Invention
The invention provides a coding and recovery method by using large-segment DNA storage, which realizes the multi-element matching of interleaved grouped error correcting code coding, large-segment DNA logic structure, next-generation high-throughput sequencing and de novo assembly method, has very high base logic density, and is described in detail as follows:
a method of encoding and recovering using large fragment DNA storage, the method comprising:
(1) scrambling binary data, and uniformly processing P sub-blocks as a processing unit, wherein the size of each sub-block is k bits; each sub-block is grouped, error correcting code coding is carried out, different sub-blocks are interleaved, transcoding is carried out through the mapping process of 'symbol/bit-base sequence' to obtain P data DNA sequences, and the P data DNA sequences and P-1 replication starting site sequences are alternately embedded to form an artificial chromosome sequence;
(2) transferring the artificial chromosome into cells, and performing second-generation high-throughput sequencing after cell proliferation and nucleic acid extraction processes;
(3) assembling the sequencing read from the head to obtain a group of contigs, determining the positions of the data segments on the contigs according to the known replication start site, intercepting partial sequencing read corresponding to the data segments, then carrying out majority merging to obtain P data segments, deinterleaving the data segments, and sending the data segments to a decoder for decoding to obtain an information sequence.
The step (1) is specifically as follows:
(1.1) superimposing binary data with the length of N with a known pseudo-random sequence to realize scrambling, dividing the binary data into P sub-blocks, wherein the length of each sub-block is k, k is N/P, P is a positive integer and is a factor of N, and N/P is an integer;
(1.2) coding each sub-block by adopting grouped error correcting code coding, wherein the information bit length of the error correcting code is k bits, and the length of each sub-block after coding is n bits;
(1.3) sequencing the encoded subblocks according to a P row mode, wherein each row comprises n bits, decomposing the n bits into a plurality of groups of P-P bit or symbol units, and interweaving each unit respectively according to a shift cycle mode; if the bit cells are, the number of the bit cells is n; if the processing is carried out according to the symbols, the number of the symbols corresponding to n bits is n/s, and each symbol comprises s bits;
(1.4) obtaining P bit groups by the interleaved subblocks, wherein the size of each group is also n bits or n/s symbols, and the symbol groups are also converted into bit groups with the length of n;
(1.5) bit grouping according to the mapping rule: 00 → A, 01 → T, 10 → G, 11 → C to be transcoded into P data DNA sequences of length n/2.
The technical scheme provided by the invention has the beneficial effects that:
1. the coding method provided by the invention realizes that the data file is coded into a plurality of DNA data units, the DNA data units are further alternately embedded and combined with the replication starting site, a flexible intracellular DNA data storage structure is constructed, and the storage of data in large-fragment DNA is realized;
2. the invention is based on the second generation high-throughput sequencing, adopts the proposed data recovery method, and can realize reliable and high-efficiency DNA data reading;
3. the invention realizes the multi-element matching of the interleaved grouped error correcting code coding, the large-fragment DNA logic structure, the next generation high-throughput sequencing and the de novo assembly method, has very low overhead, fully utilizes the error correcting capability of the high-code-rate error correcting code and can realize very high base logic density.
Drawings
FIG. 1 is a block diagram of a system for implementing the solution of the present invention;
FIG. 2 is a flow chart of the encoding of a packet error correction code in the present invention;
FIG. 3 is a flowchart of recovery in data read-out according to the present invention;
FIG. 4 is a schematic diagram of a computer-based long DNA fragment data storage simulation platform according to the present invention;
FIG. 5 is a graph of the distribution of the number of base errors (edit distance) of simulated reads in accordance with the present invention;
FIG. 6 is a graph showing a distribution of the number of base errors (edit distance) of real reads according to the present invention;
FIG. 7 is a graph of the base error location distribution of simulated reads according to the present invention;
FIG. 8 is a base error location distribution diagram of real reads according to the present invention;
FIG. 9 is a schematic diagram showing the analysis of the base deletion result in each experiment in the present invention;
FIG. 10 is a schematic diagram showing analysis of a base error result in each experiment in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Aiming at the error characteristics that the large-fragment DNA data storage in cells is adopted and the information coding method is suitable for sequencing and reading assembly, the embodiment of the invention provides a coding and recovery method adopting the large-fragment DNA storage. The binary data is divided into a plurality of sub-blocks after being scrambled, after each sub-block is coded by adopting a grouped error correcting code, different sub-blocks are subjected to symbol interleaving and are transcoded into data DNA units through the mapping process of 'symbol-bit-base sequence', and the data DNA units and the replication start sites are alternately inlaid to form an artificial chromosome sequence; when data is read out, the second-generation high-throughput sequencing is utilized, and read segment de novo (de novo) assembly, copy initial site guided contig merging and efficient error correction erasure decoding are combined, so that reliable data recovery is realized.
The method provided by the invention not only can realize reliable storage of data, but also realizes multi-element matching of an interleaved grouped error correction code encoding method, a large-fragment DNA logic structure, next-generation high-throughput sequencing and a de novo assembly method, has very high base logic density, and fully explains the advantages of large-fragment DNA storage. In addition, the method provided by the invention can construct a flexible intracellular DNA data structure and meet the requirement of designing a data file with the size of hundreds of Kb to Mb.
The invention provides a coding and recovery method adopting large-segment DNA storage, which aims at the field of DNA data storage and is basically characterized in that binary data are scrambled, subjected to grouped error correction code coding and symbol interleaving, mapped to obtain a data DNA unit, the data DNA unit and an ARS sequence are alternately embedded to obtain an artificial chromosome, and after second-generation high-throughput sequencing, read segment de novo assembly, ARS guided contig combination and efficient error correction erasure correction decoding are combined to realize reliable data recovery. The invention can ensure the reliable storage of data and has very high base logic density.
The specific embodiment of the invention is given below, the embodiment designs and simulates an 2,489,847bp circular artificial chromosome, a 589KB picture and a 2.65KB txt text are stored in the circular artificial chromosome by using a plurality of RS code symbol interleaving coding methods with extremely high code rates, and the total logic density reaches 1.947bit/bp, which is higher than the current main design method and is very close to the theoretical 2 bit/bp. Wherein the selected RS code is defined in the finite field GF (2)12) The RS code above (4095, 4040, t ═ 27), the length of the information bit portion is 4040 symbols, the code rate is 0.987, and each symbol comprises 12 bits.
The coding method for interleaving the RS codes with a plurality of extremely high code rates specifically comprises the following steps:
(1.1) superposing binary data with a known pseudo-random sequence to realize scrambling, and dividing the scrambled binary data into 100 subblocks with the same size;
(1.2) encoding each subblock by adopting RS code encoding, wherein the information bit length is 4040 symbols, and the length of each subblock after encoding is 4095 symbols;
(1.3) sorting the coded sub-blocks in a column mode, then decomposing the coded sub-blocks into a plurality of groups of 100 x 100 units, and interweaving each unit in a diagonal cycle mode;
(1.4) 4095 symbols per data segment, converted to 49,140 bits, as a basic packet;
(1.5) adjacent two bits in each packet are transcoded to 1 base according to the mapping rule of 00 → A, 01 → T, 10 → G, 11 → C. Each data segment is converted into 24,570bp DNA data sequence;
(1.6) alternately inlaying 100 DNA data segments with the length of 24,570bp and 99 ARSs with shorter lengths, and further adding a vector skeleton sequence to obtain a circular artificial chromosome with the length of 2,489,847 bp.
In the earlier stage, the embodiment of the invention designs and assembles an 254,886bp yeast artificial chromosome, adopts second-generation high-throughput sequencing to obtain a group of real sequencing data, and takes the group of real sequencing data as reference to carry out sequencing simulation on the artificial chromosome, and the specific steps are as follows:
(2.1) sending the real sequencing data into second-generation sequencing simulation software ART for parameter training to obtain a configuration simulation file;
and (2.2) inputting the configuration simulation file, the artificial chromosome sequence and other parameters into second-generation sequencing simulation software for sequencing simulation. The coverage selected in this embodiment is 20x, 25x, 30x, and each coverage generates 10 sets of independent test data with ART to obtain the corresponding sequencing read of the double-end read PE 150.
The specific recovery method during data reading is as follows:
(3.1) assembling the sequencing reads by using second-generation sequence assembling software Velvet and ABySS based on a Debrelain map to obtain a group of contigs;
(3.2) identifying ARS sequences in each contig, determining the position of a data read according to the ARS sequences and intercepting a corresponding part of sequencing reads;
(3.3) carrying out large number combination on the part of the sequencing reads corresponding to each intercepted data read to obtain a combined sequence of 100 data reads, and further converting the combined sequence into a symbol sequence;
(3.4) de-interleaving the 100 data segments according to the packet interleaving sequence to obtain 100 error-existing and deleted code words;
and (3.5) sending the 100 code words obtained by de-interleaving into a decoder for block error correction erasure decoding to obtain an information sequence.
The assembly of the second-generation sequencing reads is realized by using second-generation sequence assembly software such as Velvet or ABySS based on a Debrelain atlas under a plurality of k-mer values with different lengths, and a group of contigs is obtained specifically as follows:
(1) assembling the second-generation sequencing reads by Velvet or AbySS under the k-mer value with fixed length to obtain a group of contigs;
(2) assembling second-generation sequencing reads by Velvet or AbySS under a plurality of k-mer values with different lengths, and combining to obtain a plurality of groups of contigs;
(3) and assembling the second-generation sequencing reads by Velvet and AbySS under a plurality of k-mer values with different lengths, and combining to obtain a plurality of groups of contigs.
The specific steps of identifying the ARS sequence in each contig, determining the position of the data read according to the ARS sequence and intercepting the corresponding partial sequencing read are as follows:
(4.1) respectively comparing the ARS sequences subjected to forward and reverse complementation with the contig for an editing distance, wherein the editing distance is less than 20% of the length of the ARS sequence, judging that the ARS exists, and recording the state of the ARS at the moment and the position of the ARS on the contig;
(4.2) if the ARS is positive, placing sequencing reads on two sides of the ARS into a cache region corresponding to the data segment; if the ARS is reversed and complemented, the sequencing reads on the two sides of the ARS are reversed and complemented and then are placed into a cache region corresponding to the data segment;
(4.3) repeating steps (4.1) and (4.2) until all reads containing ARS sequences (or partial ARS sequences) are fully marked and assigned.
The method comprises the following specific steps of carrying out majority merging on partial sequencing reads corresponding to each intercepted data read to obtain a merged sequence of P data reads, and further converting the merged sequence into a symbol sequence:
and (5.1) the length of the data segment is n, and an x multiplied by n data block is constructed by assuming that the number of partial sequencing reads in the cache region corresponding to the data segment is x. Putting the intercepted sequencing read into a data block, and if the length of the intercepted data segment is less than n, completing by using 'X';
(5.2) performing majority judgment on the base at each position i of the filling block, namely taking one base with the largest number in { A, T, G, C } at each position in the filling block as the base at the position of the splicing sequence, marking the base as 'X' if the base is not filled in the position, obtaining a data segment with the length of n, and further converting the data segment into a symbol sequence.
The specific steps of sending the data segments into a decoder for decoding after deinterleaving to obtain an information sequence are as follows:
(6.1) de-interleaving 100 data segments according to the packet interleaving sequence to obtain 100 error-existing and deleted code words;
and (6.2) performing error correction and erasure correction decoding on 100 code words of the RS code obtained by de-interleaving to obtain data segments.
First, as shown in FIG. 4, the present invention establishes a computer-based long DNA fragment data storage simulation platform. In the simulation experiment, ART software was chosen to generate sequencing reads to simulate the sequencing process. Although direct synthesis and sequencing experiments are not carried out, the method is calibrated and verified by utilizing early wet experimental data, so that a simulation result has better reliability, the fusion of the wet experiments and simulation design is realized to a certain extent, and the simulation process is more reasonable.
Further, the present invention analyzes the simulated sequencing data and the end-to-end storage recovery performance. FIGS. 5, 6, 7 and 8 show the comparison of the generated simulated sequencing data with the 254Kb artificial chromosome sequencing data obtained in the previous experiment. FIGS. 5 and 6 compare the number of sequencing errors in reads. As can be seen from this figure, it is,the number of erroneous reads is around 20%. The quality of the simulated reads is slightly worse than the measured data, which better explains the error correction capability of the method. FIGS. 7 and 8 compare sequencing in reads versus position in reads. As can be seen, the second generation sequencing reads contain insertion, deletion and substitution errors, and the error rate is 10-4~10-3Left and right. The error rate of insertions and deletions is significantly lower than the substitution error probability. It can be seen from the figure that the generated sequencing data features are very consistent with the actual second generation sequencing data features, which indicates that the method of simulation for generating sequencing data has better feasibility.
Then, the present embodiment performs simulation test and analysis on the encoding scheme and the data recovery scheme under different coverage degrees. In the test, coverage degrees of 20x, 25x and 30x are selected, each coverage degree generates 10 groups of independent test data by ART, and 10 independent parallel assembling and decoding experiments are carried out. The assembly software selected was Velvet and ABySS, several different k-mer values were selected for each assembly method, and the specific simulation results are given in Table 1. Fig. 9 and 10 give an illustration of the assembled error behavior at different sequencing coverage, confirming the rationality of the proposed multiple RS code interleaving and performing erasure correction schemes. As can be seen from FIGS. 9 and 10, as sequencing coverage increases, the performance of the multi-software multi-parameter assembly method improves, the number of residual errors and deletions decreases rapidly, and reliable recovery can be achieved. At a sequencing coverage of 20x, the error rates of the Velvet and ABySS assemblies alone are both at the edge of the error correction capability of the scheme (correcting 1.34% of symbol erasures, or correcting 0.66% of symbol errors), and there are cases where data recovery fails, as detailed in table 1. Given the large variability of the assembly and contig merging strategy, fig. 9 and 10 show the specific results of each experiment (e.g., "x" and "□" in the figure).
TABLE 1 specific data recovery analysis with interleaving of multiple RS codes
Figure BDA0003093176260000081
Figure BDA0003093176260000091
From the simulation results in table 1, it can be seen that when the coverage is 25x and 30x, 10 parallel experiments of all schemes are successfully decoded, and the robustness of the encoding method and the data recovery method is verified. Meanwhile, compared with the results of adopting a single k-mer, the multi-method and multi-k-mer have the advantages that the performance of the read assembly is obviously improved, and the fragment deletion and the errors are obviously reduced. Table 1 lists only the error conditions of large segments, and these errors after assembling and ARS identification are interleaved, so that the error correction capability of multiple high-rate RS codes can be fully utilized to obtain a very high success rate, and data recovery is successful under 25x and 30x in experimental tests. Table 1 also lists all cases where each data segment cannot be successfully decoded by independently encoding it with RS codes of the same parameters without using the interleaving scheme. If multiple RS code interleaving is not adopted, each data block is encoded by using a separate RS code, and since the RS code with a high code rate has limited error correction and erasure correction capability (correcting Nerasure 55 erasures, or Nerror 27 errors, or 2 Nerror + Nerasure 55), there may be a case where some segments cannot be decoded correctly, and data cannot be completely recovered.
Under 20 × sequencing data, contigs assembled with single k-mers had high error rates and could not be decoded correctly, and the data size in this section was large and not listed in table 1. However, the case of combining multiple k-mer values still achieves better performance. Firstly, Velvet software multi-k-mer assembly is adopted, 10 independent experiments are carried out, decoding failure after de-interleaving occurs only for the third time, and decoding is correct under other conditions. Furthermore, in the mixed assembly of Velvet and ABySS, all 10 groups of independent experiments obtain gains, and all decoding succeeds after de-interleaving. Namely, the scheme of the invention is verified to realize reliable data storage and have very high base logic density.
In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (6)

1. A method for encoding and recovering using large fragment DNA storage, the method comprising:
(1) scrambling binary data, and uniformly processing P sub-blocks as a processing unit, wherein the size of each sub-block is k bits; each sub-block is grouped, error correcting code coding is carried out, different sub-blocks are interleaved, transcoding is carried out through the mapping process of 'symbol/bit-base sequence' to obtain P data DNA sequences, and the P data DNA sequences and P-1 replication starting site sequences are alternately embedded to form an artificial chromosome sequence;
(2) transferring the artificial chromosome into cells, and performing second-generation high-throughput sequencing after cell proliferation and nucleic acid extraction processes;
(3) assembling the sequencing read from the head to obtain a group of contigs, determining the positions of the data segments on the contigs according to the known replication start site, intercepting partial sequencing read corresponding to the data segments, then carrying out majority merging to obtain P data segments, deinterleaving the data segments, and sending the data segments to a decoder for decoding to obtain an information sequence.
2. The coding and recovery method using large-fragment DNA storage according to claim 1, wherein the step (1) is specifically:
(1.1) superimposing binary data with the length of N with a known pseudo-random sequence to realize scrambling, dividing the binary data into P sub-blocks, wherein the length of each sub-block is k, k is N/P, P is a positive integer and is a factor of N, and N/P is an integer;
(1.2) coding each sub-block by adopting grouped error correcting code coding, wherein the information bit length of the error correcting code is k bits, and the length of each sub-block after coding is n bits;
(1.3) sequencing the encoded subblocks according to a P row mode, wherein each row comprises n bits, decomposing the n bits into a plurality of groups of P-P bit or symbol units, and interweaving each unit respectively according to a shift cycle mode; if the bit cells are, the number of the bit cells is n; if the processing is carried out according to the symbols, the number of the symbols corresponding to n bits is n/s, and each symbol comprises s bits;
(1.4) obtaining P bit groups by the interleaved subblocks, wherein the size of each group is also n bits or n/s symbols, and the symbol groups are also converted into bit groups with the length of n;
(1.5) bit grouping according to the mapping rule: 00 → A, 01 → T, 10 → G, 11 → C to be transcoded into P data DNA sequences of length n/2.
3. The coding and recovery method using large-fragment DNA storage according to claim 1, wherein the step (3) is specifically:
(2.1) assembling the sequencing reads by utilizing a second-generation sequence assembling program based on a Debrelain map to obtain a group of contigs;
(2.2) identifying the replication start site sequence in each contig, determining the position of a data read according to the position of the replication start site sequence and intercepting a corresponding sequencing data read;
(2.3) carrying out large number merging on the part of the sequencing reads corresponding to each intercepted data read to obtain a merging sequence of P data reads;
(2.4) de-interleaving the P data segments according to the packet interleaving sequence to obtain P code words, wherein each code word has errors and deletion of bits or symbols;
and (2.5) sending the P code words obtained by de-interleaving into a decoder for block error correction erasure decoding to obtain an information sequence.
4. The method for coding and recovering by large-fragment DNA storage according to claim 3, wherein the step (2.1) is implemented by three assembly methods, respectively:
(1) the assembly of the second-generation sequencing reads is realized by using the same second-generation sequence assembly software Velvet under a k-mer value with a certain fixed length, and a group of contigs is obtained;
(2) the assembly of second-generation sequencing reads is realized under a plurality of k-mer values with different lengths by using the same second-generation assembly software, and a plurality of groups of contigs are obtained by combination;
(3) and (3) assembling the second-generation sequencing reads under a plurality of k-mer values with different lengths by using different second-generation assembling software, and combining the obtained multiple groups of contigs.
5. The coding and recovery method using large-fragment DNA storage according to claim 3, wherein the step (2.2) is specifically:
(3.1) comparing the copy starting sites subjected to forward and reverse complementation with the contigs respectively for editing distances, and judging whether the copy starting sites exist or not according to the size of the editing distances; if so, recording the state of the replication origin at that time and the position on the contig;
(3.2) if the replication starting site is positive, placing sequencing reads at two sides of the replication starting site into a cache region corresponding to the data segment; if the replication starting site is reversed and complemented, the sequencing reads at two sides of the replication starting site are reversed and complemented and then are placed into a cache region corresponding to the data segment;
(3.3) repeating steps (3.1) and (3.2) until all reads containing the replication origin or part of the replication origin are completely marked and assigned.
6. The coding and recovery method using large-fragment DNA storage according to claim 3, wherein the step (2.3) is specifically:
(4.1) the length of the DNA data segment is n/2, assuming that the number of partial sequencing reads in a cache region corresponding to the data segment is q, constructing a qn data block, putting the intercepted sequencing reads into the data block, if the length of the intercepted data segment is less than n/2, expressing the data block by 'X', and expressing the data block by X, and converting the data block into bit deletion in the subsequent process;
(4.2) performing majority judgment on the base at each position i of the filling block, namely taking one base with the largest number in { A, T, G, C } at each position in the filling block as the base at the position of the splicing sequence, marking the base as 'X' if the base is not filled in the position, obtaining a DNA data sequence with the length of n/2, and further converting the DNA data sequence into a bit sequence, wherein the 'X' is converted into a bit deletion.
CN202110601792.8A 2021-05-31 2021-05-31 Coding and recovering method using large fragment DNA storage Pending CN113345521A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110601792.8A CN113345521A (en) 2021-05-31 2021-05-31 Coding and recovering method using large fragment DNA storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110601792.8A CN113345521A (en) 2021-05-31 2021-05-31 Coding and recovering method using large fragment DNA storage

Publications (1)

Publication Number Publication Date
CN113345521A true CN113345521A (en) 2021-09-03

Family

ID=77472965

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110601792.8A Pending CN113345521A (en) 2021-05-31 2021-05-31 Coding and recovering method using large fragment DNA storage

Country Status (1)

Country Link
CN (1) CN113345521A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115514375A (en) * 2022-11-18 2022-12-23 江苏网进科技股份有限公司 Cache data compression method
CN116257146A (en) * 2023-05-15 2023-06-13 北京一起教育科技发展有限公司 Encoding and decoding method and device, electronic equipment and storage medium
WO2023130676A1 (en) * 2022-01-10 2023-07-13 天津大学 Dna storage cascade encoding and decoding methods for type-1 and type-2 segmented error correction internal codes

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110060734A (en) * 2019-03-29 2019-07-26 天津大学 A kind of high robust DNA sequencing bar code generating at and read method
CN110442472A (en) * 2019-07-03 2019-11-12 天津大学 A kind of DNA data storage mixing error correcting and data reconstruction method
CN110569974A (en) * 2018-06-06 2019-12-13 天津大学 DNA storage layered representation and interweaving coding method capable of containing artificial base

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569974A (en) * 2018-06-06 2019-12-13 天津大学 DNA storage layered representation and interweaving coding method capable of containing artificial base
CN110060734A (en) * 2019-03-29 2019-07-26 天津大学 A kind of high robust DNA sequencing bar code generating at and read method
CN110442472A (en) * 2019-07-03 2019-11-12 天津大学 A kind of DNA data storage mixing error correcting and data reconstruction method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈为刚 等: ""细胞内大片段DNA数据存储的多RS码交织编码"", 《合成生物学》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023130676A1 (en) * 2022-01-10 2023-07-13 天津大学 Dna storage cascade encoding and decoding methods for type-1 and type-2 segmented error correction internal codes
CN115514375A (en) * 2022-11-18 2022-12-23 江苏网进科技股份有限公司 Cache data compression method
CN116257146A (en) * 2023-05-15 2023-06-13 北京一起教育科技发展有限公司 Encoding and decoding method and device, electronic equipment and storage medium
CN116257146B (en) * 2023-05-15 2023-07-14 北京一起教育科技发展有限公司 Encoding and decoding method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113345521A (en) Coding and recovering method using large fragment DNA storage
Heckel et al. Fundamental limits of DNA storage systems
Song et al. Codes with run-length and GC-content constraints for DNA-based data storage
CN111600609B (en) DNA storage coding method for optimizing Chinese storage
CN107798219B (en) Method for biologically storing and restoring data
WO2018148260A1 (en) Apparatus, method and system for digital information storage in deoxyribonucleic acid (dna)
KR102269463B1 (en) How to code and decode information
EP3509018B1 (en) Method for biologically storing and restoring data
CN110706751A (en) DNA storage encryption coding method
CN110569974B (en) DNA storage layered representation and interweaving coding method capable of containing artificial base
CN104850760A (en) Artificially synthesized DNA storage medium with coding information, storage reading method for information, and applications
CN110442472B (en) DNA data storage mixed error correction and data recovery method
CN101958150B (en) Memory apparatus and operation method thereof
CN110708076B (en) DNA storage coding and decoding method based on hybrid model
CN111858507B (en) DNA-based data storage method, decoding method, system and device
CN106874140B (en) Data storage method and device
CN110060734A (en) A kind of high robust DNA sequencing bar code generating at and read method
CN114974429A (en) DNA storage coding method and device based on decimal system and readable storage medium
Wang et al. Hidden addressing encoding for DNA storage
CN102545914B (en) BCH (Broadcast Channel) encoding and decoding method and device
CN115459781A (en) Long sequence DNA storage coding method based on static interleaving coding
CN108712232A (en) A kind of multi-code word parallel decoding method in continuous variable quantum key distribution system
Haughton et al. A modified watermark synchronisation code for robust embedding of data in DNA
CN111243670A (en) DNA information storage coding method meeting biological constraint
CN114927169A (en) Distributed array storage and high-capacity error-correction DNA storage technology (Bio-RAID) based on microorganisms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210903