CN116564424A - DNA data storage method, reading method and terminal based on erasure codes and assembly technology - Google Patents

DNA data storage method, reading method and terminal based on erasure codes and assembly technology Download PDF

Info

Publication number
CN116564424A
CN116564424A CN202210114165.6A CN202210114165A CN116564424A CN 116564424 A CN116564424 A CN 116564424A CN 202210114165 A CN202210114165 A CN 202210114165A CN 116564424 A CN116564424 A CN 116564424A
Authority
CN
China
Prior art keywords
module
information
data
dna
modules
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210114165.6A
Other languages
Chinese (zh)
Inventor
姜朔
张璐帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Codon Hangzhou Technology Co ltd
Original Assignee
Codon Hangzhou Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Codon Hangzhou Technology Co ltd filed Critical Codon Hangzhou Technology Co ltd
Priority to CN202210114165.6A priority Critical patent/CN116564424A/en
Publication of CN116564424A publication Critical patent/CN116564424A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Abstract

DNA data storage method, reading method and terminal based on erasure codes and assembly technology, wherein the DNA data storage method comprises the following steps: s1, obtaining data to be stored, and obtaining binary data to be encoded through erasure code encoding; s2, selecting an adaptive module from a preset DNA module library through module mapping coding, and determining the sequence of the modules to obtain a corresponding module combination; and S3, connecting the modules in the module combination into corresponding DNA molecular chains so as to finish the data storage. The method improves the available capacity of DNA by introducing erasure codes. By introducing the module mapping codes, DNA data storage based on an assembly technology is realized, specific module combinations are selected from a preset DNA module library to react and connect into DNA molecular chains, and the speed and the cost of the DNA molecular chains are optimized compared with those of the DNA molecular chains obtained by a synthesis technology.

Description

DNA data storage method, reading method and terminal based on erasure codes and assembly technology
Technical Field
The application belongs to the technical field of data storage, and particularly relates to a data storage method, a data reading method, a data storage device, a data reading device, terminal equipment and a computer readable storage medium based on DNA, in particular to a DNA data storage method, a data reading method and a corresponding terminal based on erasure codes and an assembly technology.
Background
In the age of explosive digital data growth, DNA is being explored as a next generation molecular storage medium. Data can be encoded in vitro by synthesizing data in a DNA molecule using four natural nucleotides (a, T, G, and C), and retrieving the data from the DNA by sequencing. DNA data storage achieves extremely high data densities due to the manipulation of molecules at the atomic level. DNA materials are very stable in liquids at relatively high temperatures, providing high durability (high retention) compared to many existing dielectric materials. Data in DNA can easily be generated in hundreds of millions of copies by simple PCR reactions while maintaining low energy. DNA data retrieval greatly benefits from the revolution of sequencing technologies, including Illumina Next Generation Sequencing (NGS) and nanopore 3.RD produces sequencing, which can rapidly sequence humans and other genomes at ever decreasing prices.
However, current strategies for DNA data storage also face challenges. The storage of any data set requires template-free synthesis of specific long DNA molecules by chemical or enzymatic methods, which is still very expensive, time consuming, labor intensive and error prone. The synthesized DNA can no longer be used to store other data sets, further increasing storage costs. For data retrieval, synthesis-based Illumina sequencing must cut long data DNA molecules into short fragments (< 300 bases) to maintain low error rates (< 0.1%), and complex post-sequencing bioinformatics analysis is required to assemble the fragmented data.
For these reasons, efforts have been made to write data into universal DNA sequences, including natural DNA. There are still many problems to be solved in research on practical application. A great advantage of DNA as a storage medium is the stability of the DNA molecules, which can be preserved for up to a hundred years without human intervention. Most of the data is preprocessed by some algorithms before being stored, if the data is read after several decades or hundreds of years, the data and the preprocessing algorithm adopted by the data correspondingly need to be known, and whether the adopted preprocessing algorithm exists completely cannot be guaranteed.
Disclosure of Invention
The invention aims to provide a DNA data storage method, a reading method, a data storage device, a data recovery device, a terminal device and a computer readable storage medium based on erasure codes and an assembly technology, which can solve the problems that stored data cannot be read quickly and cannot be recovered, and writing speed is low and cost is high when the stored data is stored because whether a adopted preprocessing algorithm is complete or not cannot be guaranteed.
In a first aspect, the present embodiment provides a DNA data storage method based on erasure coding and assembly technology, including:
S1, obtaining data to be stored, and obtaining binary data to be encoded through erasure code encoding;
s2, selecting an adaptive module from a preset DNA module library through module mapping coding, and determining the sequence of the modules to obtain a corresponding module combination;
and S3, connecting the modules in the module combination into corresponding DNA molecular chains so as to finish the data storage.
In a possible implementation manner of the first aspect, step S1 further includes:
obtaining binary data of data to be stored, dividing the binary data according to a bit with a certain length to obtain k fragments, and obtaining n fragments from the divided k fragments through erasure code coding, wherein n is larger than or equal to k.
In a possible implementation manner of the first aspect, the selecting the adapted module combination from the preset DNA module library in step S2 through module mapping encoding further includes:
each segment is respectively provided with meta information and content information;
and mapping and encoding the meta information and the content information respectively by using a meta information DNA module library and a content information DNA module library, wherein the meta information DNA module library and the content information DNA module library are corresponding to the rule of recoding the address in size, if the recoding is carried out by adopting an a-bit b-system, the set module library is a.b modules, each group of modules corresponds to one bit of recoded information, and different modules in each group represent the content of the bit.
In a possible implementation manner of the first aspect, "mapping encoding the content information using a content information DNA module library" further includes:
dividing the binary information into a plurality of short messages by m bits, wherein each short message has corresponding address information;
recoding the address information according to an a-bit b-system;
obtaining a data-address pair corresponding to each recoded short message, and when m=1, storing the data-address pair with the data being 1 or 0; when m is>1, preserve Arbitrary 2 m -1 case of "data-address pair";
and carrying out module mapping on the data-address pair information by using the content information DNA module library so as to obtain a module combination of the data-address pair adaptation of each short message and the corresponding coding information of each module.
In a possible implementation manner of the first aspect, "encoding the meta information map using a meta information DNA module library" further includes:
each meta-information corresponds to a module combination, the meta-information DNA module library is used for carrying out module mapping on each bit of information of each meta-information, each group of modules corresponds to one bit of recoded information, and different modules in each group represent the content of the bit so as to obtain the meta-information adaptive module combination of each short message and the corresponding coding information of each module.
In a possible implementation manner of the first aspect, the determining, in step S2, "determining the order of the modules to obtain the corresponding module combination" further includes: the meta information and the content information of each fragment are mapped and encoded into a current fragment combination unit respectively, the sequence of the corresponding fragment combination units is obtained according to the sequence of the fragments, and one or more module combinations of the binary data and the sequence among each module in the module combinations are obtained.
In a possible implementation manner of the first aspect, when the erasure code encoding uses linear erasure codes, the meta information records information of the corresponding segments, the meta information records are set to include address information of the corresponding segments in the original data; when erasure codes are used for erasure code encoding, the meta information records the number of inclusion degrees and the information of fragment combinations.
In a possible implementation manner of the first aspect, the content information includes one of the following data: the data information of the fragment, the data information of the fragment and the redundancy information of the fragment; the redundant information is formed by connecting any error correction code based on meta information and content information and is used for information recovery when a reading error occurs in the segment. Step S3 further comprises:
Finding out the module combination of the meta information and the data-address pair of the content information of each segment of the binary data and the corresponding coding information of each module, and forming a segment combination unit corresponding to the segment;
each module of the fragment combination unit is a segment of DNA module containing a specific base sequence, and the corresponding DNA modules are connected into corresponding DNA molecule chains through enzyme catalysis reaction according to the preset module sequence.
In a possible implementation manner of the first aspect, "ligating the corresponding DNA modules into the corresponding DNA molecular chains by enzyme-catalyzed reaction according to the predetermined module order" further comprises:
the terminal module of the meta information of the segment combination unit is connected with the first module of the content information, and the modules of the content information in the same segment combination unit are sequentially connected;
the modules among the fragment combination units are matched and connected according to a preset sequence between the modules: the first module of the rear-end fragment combination unit is connected with the end module of the current fragment combination unit, or the end module of the current fragment combination unit is a tail DNA module of the current DNA molecular chain, and the first module of the rear-end fragment combination unit is independently a first DNA module of a DNA molecular chain, so that one or more DNA molecular chains are formed and stored in a concentrated manner.
In a possible implementation manner of the first aspect, "dividing into k segments according to a bit with a certain length" and obtaining n segments from the divided k segments by erasure code coding further includes:
the original binary sequence data is partitioned into k packets, ts=s 1 ,s 2 ,s 3 ,...,s k
Forming a plurality of new fragments t through LT coding n Each new segment is formed by randomly selecting d (d.ltoreq.k) data packets s i Exclusive or operation, wherein the selection of d (degrees) is controlled by a degree probability distribution function;
for each segment obtained after LT coding, the selection of the degree and the combination of the data packets are obtained through a pseudo-random number connector, the input information of the pseudo-random number connector is meta-information, the information comprises which data packets the segment is obtained by carrying out exclusive-or operation, the meta-information is connected to the front end of the segment information, the middle data information is the result of carrying out exclusive-or operation on the selected data packets, and the redundant information at the rear end is an error correction code generated by carrying out operation on the meta-information and the data information.
In a possible implementation manner of the first aspect, "dividing into k segments according to a bit with a certain length" and obtaining n segments from the divided k segments by erasure code coding further includes:
The original binary sequence data is partitioned into k data packets, i.e. s=s 1 ,s 2 ,s 3 ,...,s k
Using a linear block code as an erasure code, obtaining n data packets from the original k data packets through calculation according to the connection into a matrix G, wherein each newly generated data packet is a linear combination of the original data packets;
in the newly generated clip information: the meta information is connected to the front end of the fragment information, and is an index of each new data packet and a column number corresponding to the connection matrix; the middle data information is a new data packet newly generated by being connected into a matrix; the redundancy information of the back end is an error correction code generated by operation for meta information and data information inside the segment.
In a second aspect, a DNA data reading method based on erasure coding and assembly techniques, comprising:
carrying out module recognition on the DNA molecular chain;
decoding module mapping information according to the module information to obtain corresponding binary data;
and reconstructing the original data content of the binary data according to erasure code decoding operation, and realizing data reading.
In a possible implementation manner of the second aspect, the module identification includes obtaining the corresponding base sequence by DNA sequencing, and further obtaining the module information by base sequence matching.
In a possible implementation manner of the second aspect, "the module map encoding information is decoded and converted into the binary data" further includes:
the metadata and the content information are obtained through decoding the module coding information, when the content information contains redundant information, the redundant information is used for carrying out error correction operation on the metadata and the data information, so that independent fragment information can be obtained, and enough fragment information (m is more than or equal to k) can be obtained through reading.
In a third aspect, a DNA data storage device based on erasure coding and assembly techniques, comprising:
erasure code encoding unit: obtaining data to be stored, and obtaining binary data to be encoded through erasure code encoding;
library of DNA modules: the module library is a.b modules, each group of modules corresponds to one bit of recoded information, and different modules in each group represent the content of the bit;
a module mapping encoding unit: the module mapping codes are used for selecting an adaptive module combination from the DNA module library and converting the binary data into module coding information required by DNA assembly;
DNA information writing unit: the modules in the module combination are linked into corresponding DNA molecular chains to complete the data storage.
In a possible implementation manner of the third aspect, the module mapping encoding unit further includes a meta information module mapping unit, a content information module mapping unit and a fragment combining unit,
the meta information module mapping unit is used for dividing binary information into a plurality of short messages by m bits, wherein each short message has corresponding address information, and recoding the address information according to a bit b system; obtaining a data-address pair corresponding to each recoded short message, and carrying out module mapping on the data-address pair information by using the content information DNA module library so as to obtain a module combination adapted to the data-address pair of each short message and corresponding coding information of each module;
and the content information module mapping unit is used for carrying out module mapping on each bit of information of each piece of meta-information by using the meta-information DNA module library, each group of modules corresponds to one bit of recoded information, and different modules in each group represent the content of the bit so as to obtain a module combination adapted to meta-information of each piece of short information and corresponding piece of module coding information.
Fragment combining unit: the module combination of the 'data-address pair' used for finding the meta information and the content information of each segment of the binary data and the corresponding coding information of each module are formed into a segment combination unit corresponding to the segment; each module of the fragment combination unit is a section of DNA molecular chain containing a specific alkali group sequence, the modules of the meta information are sequentially connected, the terminal module of the meta information is connected with the first module of the content information, and the modules of the content information are sequentially connected; the modules between the segment combining units are connected with the first module of the rear segment combining unit and the tail module of the current segment combining unit, or the tail module of the current segment combining unit is the tail DNA module of the current DNA molecular chain, and the first module of the rear segment combining unit is independently the first DNA module of a DNA molecular chain, so that one or more DNA molecular chains are formed and stored in a concentrated way.
In a possible implementation manner of the third aspect, the erasure code coding unit further includes: dividing original binary data according to a bit with a certain length to obtain k fragments, obtaining n fragments (n is more than or equal to k) from the divided k fragments, and when linear erasure codes are used, keeping the original k fragments by new n fragments and newly constructing (n-k) fragments based on the k fragments; when fountain codes are used as erasure codes, each newly generated segment is obtained by performing exclusive or operation on one or more of the k original segments.
In a fourth aspect, a DNA data reading apparatus based on erasure coding and assembly techniques, comprising:
module identification unit: carrying out module recognition on the DNA molecular chain;
decoding unit: the module mapping information decoding module is used for decoding module mapping information according to the module information to obtain corresponding binary system data;
erasure code decoding unit: and the method is used for reconstructing the original data content of binary data according to erasure code decoding operation, so as to realize data reading v.
In one possible implementation manner of the fourth aspect, the module recognition unit is used for sequencing a DNA molecular chain to obtain a base sequence; the module mapping coding unit performs module matching according to a preset module library according to a base sequence result.
A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the method of any of the data storage of the first aspect or the method of any of the data reading of the second aspect when the computer program is executed.
A computer readable storage medium storing a computer program which when executed by a processor implements the method of data storage of any of the first aspects or the method of data reading of any of the second aspects.
A computer program product which, when run on a terminal device, causes the terminal device to perform the method of data storage as claimed in any one of the first aspects or the method of data reading as claimed in the second aspect.
In a fifth aspect, there is provided a molecular module data access method based on erasure codes and assembly techniques, the storing process further comprising:
s1, obtaining data to be stored, and obtaining binary data to be encoded through erasure code encoding;
s2, selecting an adaptive module from a preset molecular module library through module mapping coding, and determining the sequence of the modules to obtain a corresponding module combination;
s3, connecting the modules in the module combination into corresponding molecular modules so as to finish the data storage;
the reading process further includes:
s4, module identification is carried out on the sub-module;
s5, decoding module mapping information according to the module information to obtain corresponding binary data;
s6, reconstructing original data content of binary data according to erasure code decoding operation, and realizing data reading. .
According to the invention, by introducing erasure codes and module mapping cascade codes, based on a pre-synthesized module library, the electronic data is physically stored through DNA assembly. The erasure code converts the original k pieces of information into more than k pieces of information, each information piece further guides the completion of DNA assembly through module mapping coding to realize DNA data storage, and information reconstruction can be carried out through a subset of all new pieces when reading. That is, compared to traditional multi-backup DNA data storage, the method of the present invention increases the available capacity of DNA by introducing erasure codes. The method can reconstruct the original data only by reading any m newly connected fragments (the number of m depends on the specific implementation form of the erasure code, for example, m=k in the linear erasure code and m is slightly larger than k in the fountain code). By introducing the module mapping codes, DNA data storage based on an assembly technology is realized, specific module combinations are selected from a preset DNA module library to react and connect into DNA molecular chains, and the speed and the cost of the DNA molecular chains are optimized compared with those of the DNA molecular chains obtained by a synthesis technology.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required for the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without invasive effort for a person skilled in the art.
Fig. 1 is a schematic system architecture diagram of an application scenario provided in an embodiment of the present application;
FIG. 2 is a flow chart of a method for storing DNA data based on erasure coding and assembly techniques according to an embodiment of the present application;
FIG. 3 is a flow chart of DNA data reading based on erasure coding and assembly techniques;
FIG. 4 is a diagram of an example of a complete flow of data writing and reading;
FIG. 5 is an exemplary diagram of the overall process of concatenated coding of erasure codes and modular map coding;
fig. 6 is a second exemplary diagram of the overall process of concatenated coding of erasure codes and modular map coding.
Fig. 7 is a representation of clip information.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
As used in this specification and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determining" or "in response to determining" or "upon detecting a [ described condition or event ]" or "in response to detecting a [ described condition or event ]".
Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
With the wide application of digital information and the rapid development of big data science, information data generated by people every day is exponentially increased, and the existing traditional storage medium can not meet the demands gradually. The DNA molecular chain is taken as a novel storage medium, and has the advantages of high storage density, long storage time, low maintenance cost, strong stability and the like, so that the DNA molecular chain is widely paid attention to.
At present, research on DNA storage is more focused on improving storage efficiency through efficient encoding and decoding, improving fault tolerance and other directions, and reducing synthetic sequencing cost; the search for theory and material structure is more focused on DNA-based storage media.
The general procedure for DNA storage is that digital data is encoded into the base sequence of DNA, DNA is synthesized from the encoded base sequence, and stored in a storage medium in vitro and in vivo. Wherein, the synthesis of DNA can be realized by writing nucleotide base sequences through a synthesizer, and then the pooled liquid is used as a medium for storage. Reading the data is performed by a sequencer, reading, sequencing, and restoring the data by subsequent decoding processes.
The existing DNA storage direction is mainly aimed at the research of efficient coding and decoding methods, related data modeling and DNA storage medium materials and structures; in DNA-based data storage processes, the digital data is pre-processed, such as compressed, redundancy deleted, encrypted, encoded, etc., prior to storage. If the data can still be restored after decades or hundreds of years of data storage, it is necessary to know the data and its corresponding preprocessing algorithm; however, after a long time, it cannot be guaranteed whether the above preprocessing algorithm is still completely present. If the external preprocessing algorithm is lost in the practical application scenario facing the long-term storage of large-scale data in an uncertain environment, the data preprocessed by encryption and the like cannot be completely and reliably recovered.
For this reason, i have introduced not only erasure code coding, but also mapping coding by introducing modules. Erasure Coding (EC) is a data protection method that divides data into segments, expands and encodes redundant data blocks, and stores them in different locations, where erasure codes create mathematical functions that describe a set of numbers that can be recovered once one or a portion of the numbers are lost. The modular map coding may enable DNA data storage based on assembly techniques.
Fig. 1 is a schematic system architecture diagram of an application scenario according to an embodiment of the present application. Fig. 1 illustrates an end-to-end full-flow DNA storage system architecture for implementing data self-storage and self-recovery, which is provided in an embodiment of the present application, and as shown in fig. 1, before data is stored, erasure code encoding is performed on digital data (original data) in different formats to obtain binary data to be encoded; selecting an adaptive module combination from a preset DNA module library through module mapping coding, and converting the binary data into module coding information required by DNA assembly; the module coding information is connected into DNA molecular chains in a reaction way and stored in-vivo and in-vitro storage media. In the storage process, according to the characteristics of different storage media of DNA, the traditional silicon-based storage media are combined, and the layout of data in the DNA storage media is optimized. When analysis is needed, module identification is carried out on the DNA molecular chain, module matching is carried out according to a preset module library, and module mapping coding information is obtained; the module maps the decoding and conversion of the coded information into the binary data; and reconstructing original binary data content of the binary data according to erasure code decoding operation, so as to realize data reading. The module identification can adopt sequencing to obtain a base sequence; and then carrying out module matching according to a preset module library according to a base sequence result.
The complete recovery of all data can be realized even if the DNA data is lost during the storage of the data, and the complete original data can be recovered even through little DNA information; in the application scene of large-scale and complex data storage, the self-contained and self-analysis of the data stored in the DNA are realized; the reliability of large-scale data storage under a long-term uncertain environment and the integrity of data recovery are ensured. More importantly; selecting an adaptive module combination from a preset DNA module library through module mapping coding, and converting the binary data into module coding information required by DNA module assembly; the module coding information is reacted and connected into a DNA molecular chain, the chain length of the DNA molecular chain can be completely connected into a modularized way and can be connected into a fast way, and the problems of low writing speed and high cost when the DNA storage and the biosynthesis technology are combined for data storage are solved. The DNA storage writing speed is high, the modularized writing is realized, the cost is low, and when the data stored by the DNA is recovered, the reliability is high and the data recovery is fast.
One complete data manipulation process includes the process of writing data into the DNA molecular strand and reading or recovering data from the DNA molecular strand. However, in the scope of the present invention, whether the data in the DNA molecule chain is read or written is realized by using erasure codes and assembly techniques, it is an example of the present invention.
First example
Please refer to fig. 2, which is a flowchart of a DNA data storage method based on erasure codes and assembly technique. It comprises the following steps:
s110, obtaining data to be stored, and obtaining binary data to be encoded through erasure code encoding;
s120, selecting an adaptive module from a preset DNA module library through module mapping coding, and determining the sequence of the modules to obtain a corresponding module combination;
and S130, connecting the modules in the module combination into corresponding DNA molecular chains so as to finish the data storage.
Please refer to fig. 3, which is a flowchart of DNA data reading based on erasure coding and assembly technique. It comprises the following steps:
s210, performing module recognition on a DNA molecular chain (for example, sequencing DNA, and matching a base sequence obtained by sequencing to obtain module information);
s220, decoding module mapping information according to the module information to obtain corresponding binary data;
s230, reconstructing original data content of binary data according to erasure code decoding operation, and realizing data reading.
Module identification by DNA sequencing is only one means of identification.
First embodiment
The complete flow of data writing and reading mainly comprises the following steps: (1) erasure code encoding; (2) module mapping encoding; (3) DNA assembly and storage. The schematic diagram is shown in fig. 4.
S1 erasure code coding
If the obtained data to be stored is non-binary data, the data is converted into binary data. Dividing the bits according to a certain length to obtain k fragments. And obtaining n fragments (n is greater than or equal to k) from the divided k fragments through erasure code coding. In particular, the manner in which different erasure codes produce new fragments is different. For example, using linear erasure codes, new n fragments retain the original k fragments, based on which (n-k) fragments are newly constructed; when fountain codes are used as erasure codes, each newly generated segment is obtained by performing exclusive or operation on one or more of the k original segments.
The concept that erasure codes are frequently encountered in distributed storage is that when the redundancy level is n+m, m check blocks are calculated from n source data blocks, and the n+m data blocks are respectively stored in n+m storage spaces, so that any m storage space faults can be tolerated; when the storage space is in fault, all source data can be obtained by calculating only n normal data blocks. If n+m data blocks are spread across different storage nodes, then m node failures can be tolerated. If the storage nodes are simply replaced by a mode of synthesizing DNA by growing nucleotides, the erasure code super-strong data recovery force in the distributed storage cannot be realized. Therefore, the erasure code recoding is carried out on the address and/or the content of the information, the fragment combination units in the prefabricated module library are repeatedly utilized to realize the information storage in a large-scale parallel mode, the super-strong data recovery force can be completely realized, the existing erasure code algorithm is multiple, and the invention is not limited to a specific erasure code algorithm.
S2: and (5) module mapping coding.
After n new data fragments are obtained, binary data information in each fragment is converted into module coding information required by DNA assembly through module mapping coding.
Specifically, each new clip information contains two pieces of partial information: meta information and content information.
The meta information records information of the information piece. For example, when using linear erasure codes, meta information records the address information of the information piece in the original data; when fountain codes are used as erasure codes, meta-information records the number of inclusion degrees and the information of segment combinations. The content information includes data information and redundant information of the segment. The redundant information may be generated by any one of error correction codes based on meta information and content information for information restoration when a read error occurs inside the clip.
For meta information and content information, two different module libraries may be used for mapping encoding, or only one module library may be used to complete mapping encoding. The meta information module library and the content information module library are logically divided into two libraries and can be combined into the same module library, and only the modules for distinguishing the meta information and the modules for representing the content information are respectively mapped and matched for convenience of expression, and only one module library is required to be arranged for completing mapping coding during specific implementation.
That is, meta information and content information are set for each segment, respectively; and mapping and encoding the meta information and the content information respectively by using a pre-manufactured meta information DNA module library and a content information DNA module library, wherein the meta information DNA module library and the content information DNA module library are corresponding to the rule of address recoding, if a bit b is adopted for programming recoding, the module library is set to be a.b modules, each group of modules corresponds to one bit of recoded information, and different modules in each group represent the content of the bit.
"map-encode the content information using a content information DNA module library", further comprising: dividing the binary information into a plurality of short messages by m bits, wherein each short message has corresponding address information; recoding the address information according to the a bit b system; obtaining a data-address pair corresponding to each recoded short message, and when m=1, storing the data as 1 or the data as 0; when m is>1, save any 2 m -1 case of "data-address pair"; performing module mapping on the data-address pair information by using the content information DNA module library to obtain a module combination of data-address pair adaptation of each short message and each corresponding module Is provided.
For the above content information, the processing and encoding are performed through the following three steps, and a specific example process is as follows:
1) Splitting information: dividing the binary information by m bits to obtain a plurality of short messages, wherein each short message has own address information. For example, assuming data is 10011010, when m=1:
data 1 0 0 1 1 0 1 0
Address of 7 6 5 4 3 2 1 0
2) And (3) information reconstruction: recoding the address, assuming that the recoding rule is 3-bit 2-ary:
data 1 0 0 1 1 0 1 0
Address of 7 6 5 4 3 2 1 0
Address recoding 111 110 101 100 011 010 001 000
3) Information mapping: according to the two steps, the data-address pair is obtained as follows:
data 1 0 0 1 1 0 1 0
Address of 7 6 5 4 3 2 1 0
Address recoding 111 110 101 100 011 010 001 000
Data-address pair 1-111 0-110 0-101 1-100 1-011 0-010 1-001 0-000
Note that when m=1, only data of 1 or data of 0 need to be saved, "data-address pair" in either case; when m is>1, any 2 should be saved m -data-address pair for case 1. For this example, assume that a "data-address pair" with data of 1 is saved:
further, the module mapping is performed on the "data-address pair" information that needs to be actually stored. A library of DNA modules is prepared whose size corresponds to the rules of address recoding, i.e. a-bit b-ary recoding (a×b modules). In the above example, the module library contains 6 modules in total (3 groups of 2 modules each). Each group of modules corresponds to a bit of recoded information, and different modules in each group represent the content of the bit. I.e. M a,b Where a=0, 1,2, b=0, 1. For example, for the information "011" that is actually stored, the corresponding module combination maps to: [ M ] 2,0 ,M 1,1 ,M 0,1 ]. For the data in the above example, the module map code converts it into:
and aiming at the meta information, directly performing information mapping in the third step to obtain a module combination result. "encoding the meta-information map using a library of meta-information DNA modules", further comprising: each meta-information corresponds to a module combination, a meta-information DNA module library is used for carrying out module mapping on each bit of information of each meta-information, each group of modules corresponds to one bit of recoded information, and different modules in each group represent the content of the bit so as to obtain a module combination matched with the meta-information of each short message and corresponding coding information of each module. That is, each meta-information corresponds to one module combination, and each content information corresponds to a plurality of module combinations. The meta information and the content information in each information fragment are converted into two groups of module combination information after the module mapping coding in the steps. For example, when the meta information is "110" and the content information is the case in the above example, the following encoding is obtained:
That is, a module combination of the "data-address pair" of meta information and content information of each segment of binary data and corresponding each module encoding information are found, and are composed into a segment combination unit corresponding to the segment. In other words, the meta information and the content information of each segment are mapped and encoded to form the current segment combining unit, and the sequence of the corresponding segment combining units is obtained according to the sequence of the segments, so as to obtain one or more module combinations of the binary data and the sequence between the modules in the module combinations (part of the module combinations are shown in the table above). The DNA module may form one or more DNA molecule chains. The meta information of each module in the DNA molecular chain contains the corresponding fragment information, so that the position information of the corresponding fragment information can be also read through analyzing the meta information during data reading, and the corresponding original binary data can be read.
S3: DNA assembly and storage
Finding out the module combination of the meta information and the data-address pair of the content information of each segment of the binary data and the corresponding coding information of each module, and forming the module combination unit corresponding to the segment; each module of the fragment combination unit is a segment of DNA module containing a specific base sequence, and the corresponding DNA modules are connected into corresponding DNA molecule chains through enzyme catalysis reaction according to the preset module sequence.
"ligating the corresponding DNA modules into the corresponding DNA molecular chains by enzyme-catalyzed reaction according to the predetermined module sequence" further comprises:
the terminal module of the meta information of the segment combination unit is connected with the first module of the content information, and the modules of the content information in the same segment combination unit are sequentially connected;
the modules among the fragment combination units are matched and connected according to a preset sequence between the modules: the first module of the rear-end fragment combination unit is connected with the tail end module of the current fragment combination unit, or the tail end module of the current fragment combination unit is a tail DNA module of a current DNA molecular chain, the first module of the rear-end fragment combination unit is independently a first DNA module of a DNA molecular chain, so that one or more DNA molecular chains are formed and stored in a concentrated mode, and the connection means that different DNA modules carry out chemical reaction under an enzyme catalysis environment through the design of sticky ends to realize DNA assembly. For example, the terminal module of the meta information and the first module of the content information are connected by covalent bonds when generating a DNA molecular chain, so that assembly is realized. The DNA modules in the segment combination units can be independently preserved after being connected to form DNA molecular chains to form a plurality of DNA molecular chains, and a plurality of DNA molecular chains can be further connected to form longer molecular chains for preservation.
Illustrating: in the DNA module library, each module is a DNA module containing a specific base sequence, the DNA modules are distinguished by the difference of the base sequences, and the modules between the specific groups can be assembled through chemical reaction under the enzyme catalysis environment by designing the sticky ends. Specifically, in the above example, in the module representing meta informationAnd->Can be connected with (I)>And-> End module of meta information can be connected +.>First module M of content information 2,i May be connected, and the connection of other modules is similar. Based on erasure code and module mappingAnd (3) carrying out DNA assembly reaction on the cascade coding result to form a plurality of DNA molecular chains for centralized storage.
When the fragment combination unit is a DNA module including 10 bases, the editing distance of the fragment combination units "ATCGTAGCCA" and "TTCGTAGCCA" is 1, and the editing distance of the fragment combination units "ATCGTAGCCA" and "TAGCATCGGT" is 10. For convenient reading, when the segment combination units are designed, only a plurality of segment combination units with editing distances between every two of the segment combination units being larger than or equal to a preset distance threshold value can be selected. In this way, even if an error occurs in an individual base or other fragment in the fragment combination unit during the process of reading information, the read fragment combination unit can still be corresponding to a specific code as long as the edit distance can be determined to be smaller than the preset distance threshold, thereby improving the fault tolerance of the fragment combination unit storage.
In the method for storing information in molecules, the information is stored according to the content-address pairs, and the information is stored by recoding the address and/or the content of the information and repeatedly using the fragment combination units in the prefabricated module library for large-scale parallel assembly.
Second example
S4: data decoding
The data decoding is the inverse of the encoding, and is only described in the flow, and is not developed in detail. Sequencing the DNA molecular chain to obtain the base sequence. And carrying out module matching according to the base sequence result to obtain module mapping coding information. And further decoding the module coding information to obtain meta information and content information. When redundant information is included in the content information, the redundant information may be used to perform an error correction operation on the meta information and the data information, so that separate clip information may be obtained. After enough fragment information (m is more than or equal to k) is obtained by reading, the original binary data content can be rebuilt through the decoding operation of erasure codes, and the data reading is realized.
Application example 1
The LT fountain code is used as an erasure code to illustrate the DN data storage method of the present invention. The erasure codes can be embodied in various forms, such as linear erasure codes, fountain codes, etc. This section illustrates the encoding process (S1, S2) in the method of the present invention using LT fountain codes as an example. The overall process of concatenated coding of erasure codes and modular map coding is illustrated in fig. 5 below:
(1) S1: erasure code encoding
The original binary sequence data is partitioned into k data packets, i.e. s=s 1 ,s 2 ,s 3 ,...,s k . Forming a plurality of new fragments t through LT coding n Each new segment is formed by randomly selecting d (d.ltoreq.k) data packets s i And performing exclusive or operation. Wherein the selection of d (degrees) is controlled by a degree probability distribution function. The degree distribution function not only can meet the requirement that any data packet can participate in encoding, but also can meet the requirement that the degree of a plurality of data packets is very low, so that the decoding process can be efficiently realized. The ideal solitary and robust solitary distribution functions are as follows:
where i is the value of the selected degree, δ is the probability of acceptable irrecoverable information, and c is the proportionality constant. p (i) is an ideal solitary wave distribution function, and μ (i) is a robust solitary wave distribution function.
In the encoding, for each segment obtained after LT encoding, the selection of the degree and the combination of the data packets are obtained by a pseudo-random number generator. The input information of the pseudo-random number generator is meta information, and the meta information comprises data packets of the segment, which are obtained by exclusive-or operation. The meta information is connected to the front end of the segment information, the middle data information is the result of exclusive-or operation of the selected data packet, and the redundant information at the rear end is an error correction code generated by operation aiming at the meta information and the data information. There are various implementations of the error correction code, such as forward error correction codes. By the steps, countless newly generated fragment information can be generated, and in the decoding process, the whole data can be decoded only by slightly more than k received fragment information. The representation of each new clip information is shown in fig. 7.
(2) S2: module mapping coding
Assume that there is a clip whose meta information is "110" and whose content information is "10011010" (the same as in the foregoing example case). For content information, the split interval is m=1, recoding is 3-bit 2 system, and the data is "data-address pair" with 1:
/>
The module mapping codes correspond to two module libraries respectively: (1) For the module library of meta information, in this example, it is assumed that the maximum length of meta information is 3, and the modules included in the corresponding module library are:(2) For a module library of content information, in this example, the recoding rule is 3-bit 2-ary, and the corresponding modules included in the module library are: m is M 2,1 ,M 2,0 ,M 1,1 ,M 1,0 ,M 0,1 ,M 0,0 .
Application example two
This section illustrates the encoding process (S1, S2) in the method of the invention using linear block codes as an example. The overall process of concatenated coding of erasure codes and modular map coding is illustrated in fig. 6 below:
(1) S1: erasure code encoding
The original binary sequence data is partitioned into k data packets, i.e. s=s 1 ,s 2 ,s 3 ,...,s k . Using the linear block code as erasure code, n data packets are calculated from the original k data packets according to the generation matrix G, i.e. each newly generated data packet is a linear combination of the original data packets. In the newly generated clip information: the meta information is connected to the front end of the fragment information, and is the index of each new data packet, namely the number of columns corresponding to the generated matrix; the middle data information is a new data packet newly generated by the generation matrix; the redundancy information of the back end is an error correction code generated by operation on meta information and data information inside the segment, and the error correction code has various implementation forms, such as a forward error correction code and the like. Through the steps, the newly generated n data packets correspond to the new n pieces of fragment information, and in the decoding process, the error tolerance capacity of the newly generated n data packets is related to the generation matrix of the linear block code, so that at most (n-k) pieces of fragment information can be allowed to be in error, and the decoding of the whole data can be completed. The representation of each new clip information is shown in fig. 7.
(2) S2: module mapping coding
Assume that there is a clip whose meta information is "101" and whose content information is "01110". For content information, the split interval is m=1, recoding is 3-bit 2-ary, and the data is "data-address pair" with 1:
the module mapping codes correspond to two module libraries respectively: (1) For the module library of meta information, in this example, it is assumed that the maximum length of meta information is 3, and the modules included in the corresponding module library are:(2) For a module library of content information, in this example, the recoding rule is 3-bit 2-ary, and the corresponding modules included in the module library are: m is M 2,1 ,M 2,0 ,M 1,1 ,M 1,0 ,M 0,1 ,M 0,0
It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be found in the method embodiment section, and will not be described herein again.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the functions described above. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, and will not be described herein again.
Third example
A DNA data storage device based on erasure coding and assembly techniques, comprising:
erasure code encoding unit: obtaining data to be stored, and obtaining binary data to be encoded through erasure code encoding;
library of DNA modules: the module library is a.b modules, each group of modules corresponds to one bit of recoded information, and different modules in each group represent the content of the bit;
a module mapping encoding unit: the module mapping codes are used for selecting an adaptive module combination from the DNA module library and converting the binary data into module coding information required by DNA assembly;
DNA information writing unit: and determining corresponding module combinations through the module coding step, and connecting the corresponding module combinations into corresponding DNA molecular chains through enzyme catalysis reaction so as to finish the data storage.
The module map coding unit further includes a meta information module mapping unit, a content information module mapping unit and a clip combining unit,
the meta information module mapping unit is used for dividing binary information into a plurality of short messages by m bits, wherein each short message has corresponding address information, and recoding the address information according to a bit b system; obtaining a data-address pair corresponding to each recoded short message, and carrying out module mapping on the data-address pair information by using the content information DNA module library so as to obtain a module combination adapted to the data-address pair of each short message and corresponding coding information of each module;
A content information module mapping unit for performing module mapping on each bit of information of each meta information by using the meta information DNA module library, wherein each group of modules corresponds to one bit of recoded information, and different modules in each group represent the content of the bit, so as to obtain a meta information adapted module combination of each short information and each corresponding module coding information
Fragment combining unit: the module combination of the 'data-address pair' used for finding the meta information and the content information of each segment of the binary data and the corresponding coding information of each module are formed into a segment combination unit corresponding to the segment; each module of the fragment combination unit is a section of DNA molecular chain containing a specific alkali group sequence, the modules of the meta information are sequentially connected, the terminal module of the meta information is connected with the first module of the content information, and the modules of the content information are sequentially connected; the modules among the segment combination units are connected according to the preset sequence matching among the modules: the first module of the rear-end fragment combination unit is connected with the end module of the current fragment combination unit, or the end module of the current fragment combination unit is a tail DNA module of the current DNA molecular chain, and the first module of the rear-end fragment combination unit is independently a first DNA module of a DNA molecular chain, so that one or more DNA molecular chains are formed and stored in a concentrated mode.
The erasure code encoding unit further includes: dividing the original binary data according to a bit with a certain length to obtain k fragments, obtaining n fragments (n is more than or equal to k) from the divided k fragments, and when the linear erasure code is used, keeping the original k fragments by the new n fragments and newly constructing (n-k) fragments based on the k fragments; when fountain codes are used as erasure codes, each newly generated segment is obtained by performing exclusive or operation on one or more of the k original segments.
Fourth example
A DNA data reading apparatus based on erasure coding and assembly techniques, comprising:
module identification unit: carrying out module recognition on the DNA molecular chain;
decoding unit: the module mapping information decoding module is used for decoding module mapping information according to the module information to obtain corresponding binary system data;
erasure code decoding unit: and the binary data is used for reconstructing the original data content according to erasure code decoding operation, so as to realize data reading.
The module recognition unit can be used for sequencing a DNA molecular chain to obtain a base sequence; the module mapping coding unit performs module matching according to a preset module library according to a base sequence result.
The embodiment of the application provides a structure of terminal equipment. The terminal device of this embodiment includes: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, which when executed by the processor performs the steps of any of the various method embodiments described above
The terminal device may be a computing device such as a desktop computer, a notebook computer, a palm computer, a cloud server, etc. The terminal device may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that a terminal may include more or fewer components, or may combine certain components, or different components, for example, may also include input-output devices, network access devices, etc. The processor may be a central processing unit (Central Processing Unit, CPU), which may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Appl ication Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may in some embodiments be an internal storage unit of the terminal device, such as a hard disk or a memory of the terminal device. The memory may in other embodiments also be an external storage device of the terminal device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal device. Further, the memory may also include both an internal storage unit and an external storage device of the terminal device. The memory is used to store an operating system, application programs, boot loader (BootLoader), data, and other programs, etc., such as program code for the computer program, etc. The memory may also be used to temporarily store data that has been output or is to be output.
Embodiments of the present application also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements steps that may be implemented in the various method embodiments described above. Embodiments of the present application provide a computer program product which, when run on a mobile terminal, causes the mobile terminal to perform steps that may be performed in the various method embodiments described above.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in the form of source code, object code, executable files or in some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing device/terminal apparatus, recording medium, computer Memory, read-Only Memory (ROM), random access Memory (RAM, random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Fifth example
Substantially similar to the first and second examples, the DNA module is not only deoxyribonucleic acid (DNA) but also is extended to a molecular module including one of ribonucleic acid (RNA), peptide, organic polymer, small organic molecule, carbon nanomaterial, inorganic substance, and the like. In storing information, reference is made to combinations of different modules representing content encoding and address encoding, which modules may be combined together in covalent bonds, ionic bonds, hydrogen bonds, intermolecular forces, hydrophobic forces, base complementary pairing, etc.
In a fifth aspect, there is provided a molecular module data access method based on erasure codes and assembly techniques, the storing process further comprising:
s1, obtaining data to be stored, and obtaining binary data to be encoded through erasure code encoding;
s2, selecting an adaptive module from a preset molecular module library through module mapping coding, and determining the sequence of the modules to obtain a corresponding module combination;
s3, connecting the modules in the module combination into corresponding molecular modules so as to finish the data storage;
The reading process further includes:
s4, module identification is carried out on the sub-module;
s5, decoding module mapping information according to the module information to obtain corresponding binary data;
s6, reconstructing original data content of binary data according to erasure code decoding operation, and realizing data reading. .
Namely, the erasure codes and the assembly technology can be applied to larger molecular modules, and the same functions can be realized. The identification of the same module can also have a plurality of schemes, the corresponding base sequence is obtained through DNA sequencing, and the module information is obtained through base sequence matching, which is only one implementation scheme.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the apparatus/network device embodiments described above are merely illustrative, e.g., the division of the modules or elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some of the technical features can be replaced equivalently; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the application, and are intended to be included within the scope of the present application.

Claims (25)

1. A DNA data storage method based on erasure coding and assembly techniques, comprising:
s1, obtaining data to be stored, and obtaining binary data to be encoded through erasure code encoding;
S2, selecting an adaptive module from a preset DNA module library through module mapping coding, and determining the sequence of the modules to obtain a corresponding module combination;
and S3, connecting the modules in the module combination into corresponding DNA molecular chains so as to finish the data storage.
2. The method of claim 1, wherein step S1 further comprises:
obtaining binary data of data to be stored, dividing the binary data according to a bit with a certain length to obtain k fragments, and obtaining n fragments from the divided k fragments through erasure code coding, wherein n is larger than or equal to k.
3. The method of claim 1, wherein "picking out adapted modules from a library of pre-set DNA modules by module mapping encoding" in step S2 further comprises:
dividing the binary data into a plurality of fragments, and setting meta information and content information for each fragment;
and mapping and encoding the meta information and the content information respectively by using a meta information DNA module library and a content information DNA module library, wherein the meta information DNA module library and the content information DNA module library correspond to the rule of recoding the address in size.
4. The method of claim 3, wherein "mapping encoding the content information using a library of content information DNA modules" further comprises:
Dividing the binary information into a plurality of short messages by m bits, wherein each short message has corresponding address information;
recoding the address information according to an a-bit b-system;
obtaining a data-address pair corresponding to each recoded short message, and when m=1, storing the data as 1 or the data as 0; when m is>1, save any 2 m -1 case of "data-address pair";
and carrying out module mapping on the data-address pair information by using the content information DNA module library so as to obtain a module combination of the data-address pair adaptation of each short message and the corresponding coding information of each module.
5. The method of claim 3 or 4, wherein the meta-information is map-encoded using a library of meta-information DNA modules, further comprising:
each meta-information corresponds to a module combination, the meta-information DNA module library is used for carrying out module mapping on each bit of information of each meta-information, each group of modules corresponds to one bit of recoded information, and different modules in each group represent the content of the bit so as to obtain a module combination of meta-information adaptation of each short message and corresponding coding information of each module.
6. The method of claim 3, wherein step S2 "determining the order of the modules to obtain corresponding combinations of modules" further comprises:
the meta information and the content information of each segment are respectively mapped and encoded to form a current segment combination unit,
and obtaining the sequence of the corresponding fragment combination units according to the sequence of the fragments, thereby obtaining one or more module combinations of the binary data and the sequence among the modules in each module combination.
7. The method of claim 3, wherein the meta information records information of a corresponding clip.
8. The method of claim 7, wherein when the erasure code encoding uses a linear erasure code, setting the meta information record to contain address information of the corresponding fragment in the original data; when erasure codes are used for erasure code encoding, the meta information records the number of inclusion degrees and the information of fragment combinations.
9. The method of claim 3, wherein the content information includes one of the following data: the data information of the fragment, the data information of the fragment and the redundancy information of the fragment; the redundant information is generated by any error correction code based on meta information and content information and is used for information recovery when a reading error occurs in the segment.
10. The method of claim 3, wherein step S3 further comprises:
finding out the module combination of the meta information and the data-address pair of the content information of each segment of the binary data and the corresponding coding information of each module, and forming a segment combination unit corresponding to the segment;
each module of the fragment combination unit is a segment of DNA module containing a specific base sequence, and the corresponding DNA modules are connected into corresponding DNA molecule chains through enzyme catalysis according to the preset module sequence, wherein the number of the DNA molecule chains is one or more.
11. The method of claim 10, wherein ligating the corresponding DNA modules into the corresponding DNA molecular strands by an enzyme-catalyzed reaction according to the predetermined module order further comprises:
the terminal module of the meta information of the segment combination unit is connected with the first module of the content information, and the modules of the content information in the same segment combination unit are sequentially connected;
the modules among the segment combination units are connected according to the preset sequence matching among the modules: the first module of the rear-end fragment combination unit is connected with the end module of the current fragment combination unit, or the end module of the current fragment combination unit is a tail DNA module of the current DNA molecular chain, and the first module of the rear-end fragment combination unit is independently a first DNA module of a DNA molecular chain, so that one or more DNA molecular chains are concentrated and stored.
12. A method according to claim 2 or 3, wherein "dividing into k segments according to a bit of a certain length, and obtaining n segments from the divided k segments by erasure coding" further comprises:
the original binary sequence data is partitioned into k packets, ts=s 1 ,s 2 ,s 3 ,...,s k
Forming a plurality of new fragments t through LT coding n Each new segment is formed by randomly selecting d (d.ltoreq.k) data packets s i Exclusive or operation, wherein the selection of d (degrees) is controlled by a degree probability distribution function;
for each segment obtained after LT coding, the selection of the degree and the combination of the data packets are obtained through a pseudo-random number generator, the input information of the pseudo-random number generator is meta-information, the information comprises which data packets of the segment are obtained through exclusive OR operation, the meta-information is connected to the front end of the segment information, the middle data information is the result of exclusive OR operation of the selected data packets, and the redundant information of the rear end is an error correction code generated through operation aiming at the meta-information and the data information.
13. A method according to claim 2 or 3, wherein "dividing into k segments according to a bit of a certain length, and obtaining n segments from the divided k segments by erasure coding" further comprises:
The original binary sequence data is partitioned into k data packets, i.e. s=s 1 ,s 2 ,s 3 ,...,s k
Using a linear block code as an erasure code, obtaining n data packets from original k data packets through calculation according to a generation matrix G, wherein each newly generated data packet is a linear combination of the original data packets;
in the newly generated clip information: the meta information is connected to the front end of the fragment information, and is used for indexing each new data packet to generate the corresponding column number in the matrix; the middle data information is a new data packet newly generated by generating a matrix; the redundant information of the back end is an error correction code generated by operation for meta information and data information inside the segment.
14. A DNA data reading method based on erasure codes and assembly technology is characterized by comprising the following steps:
carrying out module recognition on the DNA molecular chain;
decoding module mapping information according to the module information to obtain corresponding binary data;
and reconstructing the original data content of the binary data according to erasure code decoding operation, and realizing data reading.
15. The method of claim 14, wherein the module identification includes at least the following identification methods: and obtaining a corresponding base sequence through DNA sequencing, and obtaining module information through base sequence matching.
16. The method of claim 14, wherein the module map encoding information is decoded into the binary data further comprising:
the metadata and the content information are obtained through decoding the module coding information, when the content information contains redundant information, the redundant information is used for carrying out error correction operation on the metadata and the data information, so that independent fragment information can be obtained, and enough fragment information (m is more than or equal to k) can be obtained through reading.
17. A DNA data storage device based on erasure coding and assembly techniques, comprising:
erasure code encoding unit: obtaining data to be stored, and obtaining binary data to be encoded through erasure code encoding;
library of DNA modules: the module library is a.b modules, each group of modules corresponds to one bit of recoded information, and different modules in each group represent the content of the bit;
a module mapping encoding unit: the module mapping code is used for selecting an adaptive module from the DNA module library and determining the sequence of the modules to obtain a corresponding module combination;
DNA information writing unit: for linking the modules in the combination of modules into corresponding DNA molecular chains to complete the data storage.
18. The DNA data storage apparatus of claim 17, wherein the module map encoding unit further comprises a meta information module mapping unit, a content information module mapping unit, and a fragment combining unit,
the meta information module mapping unit is used for dividing binary information into a plurality of short messages by m bits, wherein each short message has corresponding address information, and recoding the address information according to a bit b system; obtaining a data-address pair corresponding to each recoded short message, and carrying out module mapping on the data-address pair information by using the content information DNA module library so as to obtain a module combination adapted to the data-address pair of each short message and corresponding coding information of each module;
a content information module mapping unit for using the meta information DNA module library to perform module mapping on each bit of information of each meta information, wherein each group of modules corresponds to one bit of recoded information, and different modules in each group represent the content of the bit, so as to obtain a meta information adapted module combination of each short information and each corresponding module coding information
Fragment combining unit: the module combination of the 'data-address pair' used for finding the meta information and content information of each segment of the binary data and the corresponding coding information of each module are formed into a segment combination unit corresponding to the segment; each module of the fragment combination unit is a section of DNA molecular chain containing a specific base sequence, the modules of the meta information are sequentially connected, the terminal module of the meta information is connected with the first module of the content information, and the modules of the content information are sequentially connected; the modules between the segment combining units are connected with the first module of the rear segment combining unit and the tail module of the current segment combining unit, or the tail module of the current segment combining unit is the tail DNA module of the current DNA molecular chain, and the first module of the rear segment combining unit is independently the first DNA module of a DNA molecular chain, so that one or more DNA molecular chains are formed and stored in a concentrated way.
19. The DNA data storage device of claim 17 wherein the DNA data storage device comprises a plurality of DNA storage cells,
the erasure code encoding unit further includes: dividing original binary data according to a bit with a certain length to obtain 3 fragments, obtaining n fragments (n is more than or equal to k) from the divided k fragments, and when linear erasure codes are used, keeping the original k fragments by new n fragments and newly constructing (n-k) fragments based on the k fragments; when fountain codes are used as erasure codes, each newly generated segment is obtained by performing exclusive or operation on one or more of the k original segments.
20. A DNA data reading apparatus based on erasure coding and assembly technology, comprising:
module identification unit: carrying out module recognition on the DNA molecular chain;
decoding unit: the module mapping information decoding module is used for decoding module mapping information according to the module information to obtain corresponding binary data;
erasure code decoding unit: and the binary data is used for reconstructing the original data content according to erasure code decoding operation, so as to realize data reading.
21. The apparatus of claim 20, wherein the module identification comprises at least the following identification method: and obtaining a corresponding base sequence through DNA sequencing, and obtaining module information through base sequence matching.
22. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing a method of storing data of any one of claims 1 to 13 or a method of reading data of any one of claims 14 to 16 when the computer program is executed.
23. A computer readable storage medium storing a computer program which when executed by a processor performs the method of data storage of any one of claims 1 to 13 or the method of data reading of any one of claims 14 to 16.
24. A computer program product which, when run on a terminal device, causes the terminal device to perform the method of data storage of any one of claims 1 to 13 above or the method of data reading of claims 14 to 16 above.
25. A molecular module data access method based on erasure codes and assembly technology is characterized in that the storage process further comprises the following steps:
s1, obtaining data to be stored, and obtaining binary data to be encoded through erasure code encoding;
s2, selecting an adaptive module from a preset molecular module library through module mapping coding, and determining the sequence of the modules to obtain a corresponding module combination;
s3, connecting the modules in the module combination into corresponding molecular modules so as to finish the data storage;
the reading process further includes:
s4, module identification is carried out on the sub-module;
s5, decoding module mapping information according to the module information to obtain corresponding binary data;
s6, reconstructing original data content of binary data according to erasure code decoding operation, and realizing data reading.
CN202210114165.6A 2022-01-30 2022-01-30 DNA data storage method, reading method and terminal based on erasure codes and assembly technology Pending CN116564424A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210114165.6A CN116564424A (en) 2022-01-30 2022-01-30 DNA data storage method, reading method and terminal based on erasure codes and assembly technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210114165.6A CN116564424A (en) 2022-01-30 2022-01-30 DNA data storage method, reading method and terminal based on erasure codes and assembly technology

Publications (1)

Publication Number Publication Date
CN116564424A true CN116564424A (en) 2023-08-08

Family

ID=87498817

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210114165.6A Pending CN116564424A (en) 2022-01-30 2022-01-30 DNA data storage method, reading method and terminal based on erasure codes and assembly technology

Country Status (1)

Country Link
CN (1) CN116564424A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116954523A (en) * 2023-09-20 2023-10-27 苏州元脑智能科技有限公司 Storage system, data storage method, data reading method and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116954523A (en) * 2023-09-20 2023-10-27 苏州元脑智能科技有限公司 Storage system, data storage method, data reading method and storage medium
CN116954523B (en) * 2023-09-20 2024-01-26 苏州元脑智能科技有限公司 Storage system, data storage method, data reading method and storage medium

Similar Documents

Publication Publication Date Title
Chandak et al. Improved read/write cost tradeoff in DNA-based data storage using LDPC codes
Erlich et al. DNA Fountain enables a robust and efficient storage architecture
US10370246B1 (en) Portable and low-error DNA-based data storage
US10742233B2 (en) Efficient encoding of data for storage in polymers such as DNA
Chandak et al. Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes
CN112382340B (en) Coding and decoding method and coding and decoding device for DNA data storage
US20180211001A1 (en) Trace reconstruction from noisy polynucleotide sequencer reads
RU2659025C1 (en) Methods of encoding and decoding information
JP2020534633A (en) DNA-based data storage and data retrieval
WO2018148260A1 (en) Apparatus, method and system for digital information storage in deoxyribonucleic acid (dna)
US20170243115A1 (en) Code generation method, code generating apparatus and computer readable storage medium
US20210074380A1 (en) Reverse concatenation of error-correcting codes in dna data storage
Sabary et al. Reconstruction algorithms for DNA-storage systems
CN110569974B (en) DNA storage layered representation and interweaving coding method capable of containing artificial base
Wang et al. Oligo design with single primer binding site for high capacity DNA-based data storage
Zhang et al. A high storage density strategy for digital information based on synthetic DNA
CN116564424A (en) DNA data storage method, reading method and terminal based on erasure codes and assembly technology
Wang et al. Hidden addressing encoding for DNA storage
Wang et al. Mainstream encoding–decoding methods of DNA data storage
Wei et al. Dna storage: A promising large scale archival storage?
Lin et al. Managing reliability skew in DNA storage
Erlich et al. Capacity-approaching DNA storage
Mu et al. RBS: a rotational coding based on blocking strategy for DNA storage
Shafir et al. Sequence reconstruction under stutter noise in enzymatic DNA synthesis
CN114927169A (en) Distributed array storage and high-capacity error-correction DNA storage technology (Bio-RAID) based on microorganisms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination