WO2022120626A1

WO2022120626A1 - Dna-based data storage method and apparatus, dna-based data recovery method and apparatus, and terminal device

Info

Publication number: WO2022120626A1
Application number: PCT/CN2020/134847
Authority: WO
Inventors: 李敏; 戴俊彪; 王洋; 姜青山; 罗周卿; 姜双英
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2022-06-16

Abstract

Disclosed are a DNA-based data storage method, data recovery method, data storage apparatus and data recovery apparatus, and a terminal device and a computer-readable storage medium. The data storage method comprises: acquiring a data file to be stored, wherein the data file is a file obtained by pre-processing original data according to an algorithm file; editing the data file and the algorithm file according to a preset file format, so as to generate a binary file to be encoded, wherein the file format is used for indicating an index type between the data file and the algorithm file; and encoding the binary file to obtain a base sequence, wherein the base sequence is used for synthesizing a DNA fragment that stores the data file and the algorithm file. By means of the present application, the problem of it not being possible for data stored after pre-processing to be recovered due to it not being possible to ensure whether a pre-processing algorithm used is completely present can be solved, thereby ensuring the integrity of the storage and recovery of data in uncertain environments.

Description

DNA-based data storage method, data recovery method, device and terminal equipment

technical field

The present application belongs to the technical field of data storage, and in particular, relates to a DNA-based data storage method, a data recovery method, a data storage device, a data recovery device, a terminal device, and a computer-readable storage medium.

Background technique

With the rapid development of computer technology and network technology, the growth rate of data information volume will soon exceed the capacity of traditional storage media such as existing hard disks. Deoxyribonucleic acid (DNA) molecule, as a new type of storage medium, has attracted extensive attention in recent years due to its advantages of high storage density, long storage time and low maintenance cost.

At present, there are still many problems to be solved in the research on the practical application of DNA storage. One of the advantages of DNA as a storage medium is the stability of DNA molecules, which can be stored for up to a hundred years without human intervention. Most of the data will be preprocessed by some algorithms before being stored. If the data can be restored after decades or hundreds of years, it is necessary to know the data and the corresponding preprocessing algorithm. However, the preprocessing algorithm used cannot be guaranteed. Whether it exists completely, and the data stored after preprocessing cannot be recovered.

technical problem

One of the purposes of the embodiments of the present application is: DNA-based data storage method, data recovery method, data storage device, data recovery device, terminal equipment and computer-readable storage medium, aiming to solve the problem of whether the preprocessing algorithm adopted cannot be guaranteed because it cannot be guaranteed. It exists completely, and the data stored after preprocessing cannot be recovered.

technical solutions

In order to solve the above-mentioned technical problems, the technical solutions adopted in the embodiments of the present application are:

In a first aspect, the embodiments of the present application provide a DNA-based data storage method, including:

Obtain the data file to be stored, the data file is a file obtained by preprocessing the original data according to the algorithm file; edit the data file and the algorithm file according to the preset file format, and generate the binary file to be encoded, so The file format is used to indicate the index type between the data file and the algorithm file; the binary file is encoded to obtain a base sequence, and the base sequence is used to synthesize and store the data file and the algorithm file. The DNA fragment of the algorithm file.

In a possible implementation manner of the first aspect, before acquiring the data file to be stored, the method further includes: compressing, deleting redundancy or encrypting the original data according to the algorithm file, Get the data file.

In a possible implementation manner of the first aspect, the editing of the data file and the algorithm file according to a preset file format to generate a binary file to be encoded includes:

According to the preset file format, edit the attribute identification bits and valid data bits of the data file; according to the attribute identification bits and the valid data bits of the data file, determine the relative The offset of the valid data bits of the data file; based on the offset, edit the valid data bits of the algorithm file to obtain a binary file in which the data file and the algorithm file are located in the same file.

In a possible implementation manner of the first aspect, the encoding of the binary file to obtain a base sequence includes:

Encoding the binary file according to a preset encoding model to obtain the base sequence of the binary file; adding primer sequences to the head and tail of the base sequence of the binary file to obtain the DNA fragment for synthesizing the binary file base sequence.

According to the preset file format, edit the first attribute identification bit and the first valid data bit of the data file to obtain the first binary file corresponding to the data file; according to the preset file format, Edit the second attribute identification bit and the second valid data bit of the algorithm file to obtain a second binary file corresponding to the algorithm file; wherein, the first binary file and the second binary file The files are two independent files.

According to a preset encoding model, the first binary file and the second binary file are encoded to obtain the first base sequence of the first binary file and the second binary file. prepare the second base sequence of the file; add the first primer sequence to the head and tail of the first base sequence to obtain the base sequence for synthesizing the first fragment in the DNA fragment; add the second primer sequence To the head and tail of the second base sequence, the base sequence for synthesizing the second fragment in the DNA fragment is obtained.

According to a preset encoding model, the first binary file and the second binary file are encoded to obtain the first base sequence of the first binary file and the second binary file. preparing the second base sequence of the file; adding the head primer sequence and the tail primer sequence to the head and tail of the first base sequence to obtain the base sequence for synthesizing the third fragment in the DNA fragment; adding A universal primer sequence and a tail primer sequence of one or more first base sequences corresponding to the second base sequence to the head and tail of the second base sequence to obtain a primer for synthesizing the DNA fragment. The base sequence of the fourth fragment.

In the second aspect, the embodiments of the present application provide a DNA-based data recovery method, including:

Obtaining the DNA fragment to be decoded, the DNA fragment to be decoded is used to store data files and algorithm files; decoding the DNA fragment to be decoded to obtain a binary file conforming to a preset file format, the file format used to indicate the index type between the data file and the algorithm file; read the data file and the algorithm file in the binary file, and call the algorithm file according to the index type; The algorithm file performs parsing processing on the data file to obtain the original data corresponding to the data file.

In a possible implementation manner of the second aspect, the decoding process is performed on the DNA segment to be decoded to obtain a binary file conforming to a preset file format, including:

According to the primer sequences in the DNA fragment to be decoded, the DNA fragment to be decoded is sequenced to obtain the base sequence of the DNA fragment; according to the preset decoding model, the base sequence of the DNA fragment is sequenced Decoding is performed to obtain the binary file; wherein, the binary file is a file in which the data file and the algorithm file are located in the same file.

In a possible implementation manner of the second aspect, the attribute identification bit of the data file includes an index indicating an index type; the reading of the data file and the algorithm file in the binary file is performed according to The index type calls the algorithm file, including:

Read the attribute identification bits and valid data bits of the data file of the binary file, and determine the index type based on the attribute identification bits of the data file; read the valid data of the algorithm file of the binary file bits, and the valid data bits of the algorithm file are called according to the index type.

According to the first primer sequence of the first fragment in the DNA fragments to be decoded, the first fragment is sequenced to obtain the first base sequence and the second primer sequence; according to the second primer sequence, the to-be-decoded sequence Sequencing the second fragment of the DNA fragment obtained by obtaining the second base sequence; according to a preset decoding model, decoding the first base sequence and the second base sequence to obtain the first base sequence The first binary file corresponding to the base sequence and the second binary file corresponding to the second base sequence; wherein, the first binary file corresponds to the data file, and the second binary file corresponds to the data file. The file corresponds to the algorithm file.

According to the head primer sequence and the tail primer sequence of the third fragment in the DNA fragment to be decoded, the third fragment is sequenced to obtain the first base sequence; according to the tail primer sequence of the third fragment and For the universal primer sequence of the fourth fragment in the DNA fragments to be decoded, the fourth fragment is sequenced to obtain a second base sequence; according to a preset decoding model, the first base sequence and the The second base sequence is decoded to obtain a first binary file corresponding to the first base sequence and a second binary file corresponding to the second base sequence; wherein, the first binary file is The system file corresponds to the data file, and the second binary file corresponds to the algorithm file.

In a possible implementation manner of the second aspect, the first attribute identification bit of the data file includes an index indicating an index type; the reading the data file and the algorithm file in the binary file, And call the algorithm file according to the index type, including:

Read the first attribute identification bit and the first valid data bit of the data file in the first binary file, and determine the index type according to the first attribute identification bit; read the second two The second attribute identification bit and the second valid data bit of the algorithm file in the binary file, and the algorithm of the second valid data bit of the second binary file is called according to the index type.

In a third aspect, an embodiment of the present application provides a DNA-based data storage device, including:

a first acquiring unit, configured to acquire a data file to be stored, where the data file is a file obtained by preprocessing the original data according to the algorithm file;

The first processing unit is configured to edit the data file and the algorithm file according to a preset file format, and generate a binary file to be encoded, and the file format is used to indicate the relationship between the data file and the algorithm file. index type;

The coding unit is used for coding the binary file to obtain a base sequence, and the base sequence is used for synthesizing the DNA fragments storing the data file and the algorithm file.

In a fourth aspect, an embodiment of the present application provides a DNA-based data recovery device, including:

a second acquiring unit, configured to acquire DNA fragments to be decoded, and the DNA fragments to be decoded are used to store data files and algorithm files;

a decoding unit, configured to decode the DNA fragment to be decoded to obtain a binary file conforming to a preset file format, where the file format is used to indicate an index type between the data file and the algorithm file;

a second processing unit, configured to read the data file and the algorithm file in the binary file, and call the algorithm file according to the index type;

The parsing unit is configured to perform parsing processing on the data file according to the algorithm file to obtain original data corresponding to the data file.

In a fifth aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, which is implemented when the processor executes the computer program Any one of the data storage method in the above-mentioned first aspect or any one of the data recovery method in the above-mentioned second aspect.

In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements any one of the above-mentioned first aspect The method for data storage or the method for data recovery according to any one of the above second aspects.

In a seventh aspect, an embodiment of the present application provides a computer program product that, when the computer program product runs on a terminal device, enables the terminal device to execute the data storage method described in any one of the first aspects or the second method described above. The data recovery method of any one of the aspects.

It can be understood that, for the beneficial effects of the foregoing second aspect to the seventh aspect, reference may be made to the relevant descriptions in the foregoing first aspect, which will not be repeated here.

beneficial effect

The beneficial effects of the embodiments of the present application are as follows: the terminal device obtains the data file to be stored, and the data file is a file obtained by preprocessing the original data according to the algorithm file; the data file and the algorithm file are edited according to the preset file format, and the data file to be encoded is generated. The file format is used to indicate the index type between the data file and the algorithm file; the binary file is encoded to obtain the base sequence, and the base sequence is used to synthesize the DNA fragments that store the data file and the algorithm file. By editing the data file and algorithm file according to the preset file format, setting the index type between the data file and the algorithm file, and encoding the data file and the algorithm file at the same time, the base sequence is obtained, and the DNA is synthesized and stored, which reduces the external information. It reduces the risk of unrecoverable data due to the loss of external algorithms, and ensures the integrity and reliability of large-scale data storage in a long-term uncertain environment; it has strong ease of use and practicability.

Description of drawings

In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings that are used in the description of the embodiments or exemplary technologies. Obviously, the drawings in the following description are only for the present application. In some embodiments, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

1 is a schematic diagram of a system architecture of an application scenario provided by an embodiment of the present application;

2 is a schematic flowchart of a DNA-based data storage method provided by an embodiment of the present application;

3 is a schematic diagram of a logical relationship between a data file and an algorithm file provided by an embodiment of the present application;

4 is a schematic diagram of a logical relationship between a data file and an algorithm file provided by another embodiment of the present application;

5 is a schematic diagram of a synthetic DNA fragment provided by an embodiment of the present application;

6 is a schematic diagram of a synthetic DNA fragment provided by another embodiment of the present application;

7 is a schematic diagram of a synthetic DNA fragment provided by another embodiment of the present application;

8 is a schematic flowchart of a DNA-based data recovery method provided by another embodiment of the present application;

9 is a schematic structural diagram of a binary file provided by an embodiment of the present application;

10 is a schematic structural diagram of a binary file provided by another embodiment of the present application;

11 is a schematic structural diagram of a DNA-based data storage device provided by an embodiment of the present application;

12 is a schematic structural diagram of a DNA-based data recovery device provided by an embodiment of the present application;

FIG. 13 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.

Embodiments of the present invention

In the following description, for the purpose of illustration rather than limitation, specific details, such as specific system structures and technologies, are set forth in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to those skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It is to be understood that, when used in this specification and the appended claims, the term "comprising" indicates the presence of the described feature, integer, step, operation, element and/or component, but does not exclude one or more other The presence or addition of features, integers, steps, operations, elements, components and/or sets thereof.

It will also be understood that, as used in this specification and the appended claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.

As used in the specification of this application and the appended claims, the term "if" may be contextually interpreted as "when" or "once" or "in response to determining" or "in response to detecting ". Similarly, the phrases "if it is determined" or "if the [described condition or event] is detected" may be interpreted, depending on the context, to mean "once it is determined" or "in response to the determination" or "once the [described condition or event] is detected. ]" or "in response to detection of the [described condition or event]".

In addition, in the description of the specification of the present application and the appended claims, the terms "first", "second", "third", etc. are only used to distinguish the description, and should not be construed as indicating or implying relative importance.

References in this specification to "one embodiment" or "some embodiments" and the like mean that a particular feature, structure or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments" unless specifically emphasized otherwise. The terms "including", "including", "having" and their variants mean "including but not limited to" unless specifically emphasized otherwise.

With the widespread application of digital information and the rapid development of big data science, the information data that people generate every day is increasing exponentially, and the existing traditional storage media has gradually been unable to meet the demand. As a new type of storage medium, DNA molecules have attracted extensive attention due to their advantages of high storage density, long storage time, low maintenance cost, and strong stability.

At present, research on DNA storage is more focused on improving storage efficiency and reducing the cost of sequencing by synthesis through efficient encoding, decoding, and improving fault tolerance. On DNA-based storage media, more emphasis is placed on the exploration of theory and material structure.

The general process of DNA storage is that digital data is encoded into DNA base sequences, DNA fragments are synthesized according to the encoded base sequences, and stored in storage media in vivo and in vitro. Among them, the synthesis of DNA fragments can be realized by writing nucleotide base sequences through a synthesizer, and then the pooled liquid is used as a medium for storage. When reading data, the data can be read, sequenced by a sequencer, and recovered by subsequent decoding processing.

Existing DNA storage directions are mainly aimed at efficient encoding and decoding methods, related data modeling, and research on DNA storage medium materials and structures; while in the DNA-based data storage process, digital data will be pre-processed before digital data storage. Processing, such as compression, redundancy removal, encryption, encoding, etc. If the data can be recovered after being stored for decades or hundreds of years, it is necessary to know the data and its corresponding preprocessing algorithm; however, after a long period of time, there is no guarantee that the above preprocessing algorithms still exist completely. In a practical application scenario where large-scale data is stored for a long time in an uncertain environment, if the external preprocessing algorithm is lost, even preprocessed data such as encryption cannot be recovered completely and reliably.

Please refer to FIG. 1 , which is a schematic diagram of a system architecture of an application scenario provided by an embodiment of the present application. FIG. 1 shows an end-to-end full-process DNA storage system architecture that realizes data self-contained and self-recovery provided by an embodiment of the present application. As shown in FIG. 1 , before data is stored, for digital data in different formats (raw data) is preprocessed, and the preprocessing methods include but are not limited to compression, redundant deletion, encryption, encoding, etc., to obtain data files, and at the same time, obtain the algorithm files of the preprocessing algorithm. If the data file and the algorithm file are defined according to the preset file format, a digital file is obtained, and the self-contained data to be encoded is realized. Wherein, the digital file may be a file in binary, quaternary, or octal format, which is not specifically limited. Then, through computer mathematical algorithms, different encoding techniques are used to encode digital files to obtain base sequences; through synthetic biology, base sequences are synthesized into DNA fragments and stored in in vivo and in vitro storage media. In the storage process, according to the characteristics of different DNA storage media, combined with traditional silicon-based storage media, the layout of data in the DNA storage media is optimized, such as the index algorithm corresponding to the test tube number of the DNA storage file, etc., to improve the search for DNA storage files. read speed. When the data needs to be read, the DNA fragment is sequenced to obtain the base sequence; and the base sequence is decoded into a digital file through a computer mathematical algorithm (the inverse operation of the encoding process); the digital file includes binary, quaternary or File formats such as octal represent files for data files and algorithm files. Read the data file and algorithm file in the digital file, analyze the data file according to the algorithm file, restore the original data, and realize the self-analysis of the data.

Through the embodiment of the present application, the algorithm for preprocessing the original data is also stored in DNA according to a certain file format in the storage stage, so that the data can be completely restored without the aid of an external algorithm, or the data can be restored with the least external information. Complete original data; in the application scenario of large-scale and complex data storage, in order to realize the self-containment and self-interpretation of data stored in DNA, a unified digital file format is defined, and data and algorithms are associated and managed in a unified manner; It ensures the reliability of large-scale data storage in a long-term uncertain environment and the integrity of data recovery.

The detailed flow of the DNA storage process for data is further described below through specific embodiments.

Referring to Fig. 2, it is a schematic flowchart of a DNA-based data storage method provided by an embodiment of the present application, including the following steps:

Step S201: Obtain a data file to be stored, where the data file is a file obtained by preprocessing the original data according to the algorithm file.

In some embodiments, the data files are files obtained by preprocessing various types of raw data. Various types of raw data include text types (txt format, doc format, etc.), image types (jpg format, etc.), and video types. Different types of raw data correspond to different preprocessing algorithms.

In some embodiments, before acquiring the data file to be stored, the method further includes: compressing, removing redundancy or encrypting the original data according to the algorithm file to obtain the data file.

In some embodiments, in order to maximize the utilization of DNA storage space, data information needs to be pre-processed before being stored in DNA, and the pre-processing methods include but are not limited to processing such as compression, redundancy deletion, or encryption. In the preprocessing process, the purpose of compression can be achieved by removing redundancy. In the DNA storage process, common preprocessing algorithms include Huffman coding, fountain codes or LZMA data compression algorithms.

Among them, the data compression algorithm of Huffman coding is suitable for application scenarios where each character of the input file appears with an unequal probability; The original data information can be recovered with high probability by using code overhead, which can greatly improve the storage efficiency in the process of DNA storage. The LZMA data compression algorithm makes full use of the structural characteristics of various original data, and can realize simple and feasible data compression processing.

Step S202: Edit the data file and the algorithm file according to a preset file format to generate a binary file to be encoded, where the file format is used to indicate an index type between the data file and the algorithm file.

In some embodiments, in order to realize the DNA storage of large-scale, multi-type data in a complex environment, a standard file storage format is preset. Edit the data file and the algorithm file according to the preset file format to obtain the binary file to be encoded. The size unit of the binary file is bytes; the format in the binary file includes an identification bit used to indicate the index type corresponding to the data file and the algorithm file; the index type includes direct index and indirect index. When the index type is direct index, the data file and the algorithm file are edited according to the file format shown in Table 1, and the corresponding binary file to be encoded is obtained.

In some embodiments, editing a data file and an algorithm file according to a preset file format to generate a binary file to be encoded includes: editing attribute identification bits and valid data bits of the data file according to the preset file format; attribute identification bit and valid data bit, determine the offset of the algorithm file relative to the valid data bit of the data file; based on the offset, edit the valid data bit of the algorithm file to obtain the data file and the algorithm file Binaries located in the same file.

Exemplarily, the file format includes an offset address corresponding to each variable name related to the data file and the algorithm file, the size of the occupied bytes, and the like. As shown in the data file format in Table 1, the file format of the data file includes attribute identification bits of the data file and valid data bits of the data file. Among them, the variable names of the attribute identification bits of the data file include the data file start flag DataB, the file type Type, the flag field Flag, the compression method ComS, the compressed data length ComLen, the data length before compression SouLen, and the data start flag PayloadB; the data file The variable name of the valid data bits includes the compressed or uncompressed data Payload; the variable name of the valid data bits of the algorithm file includes the compression algorithm code or logical representation (optional) Algr. As shown in Table 1, the address offset and size corresponding to each variable name.

The identification bit of the data file start marker DataB indicates the start of the compressed (or uncompressed) data file. The flag field Flag occupies one byte, including 8 bits, each bit represents the binary attribute of the data file; the first bit F1 indicates whether the data is compressed or not, if the data is compressed, then F1=0, if it is not compressed , then F1=1; the second F2 ratio is the index type of the algorithm that implements the data self-contained self-analysis process, F2=0 means direct index, F2=1 means indirect index; if it is an indirect index, the data file Subsequent bits of valid data bits are pending. If F1=0, the compression mode field is valid. The data start marker marks the beginning of the valid data of the data file, and the offset of the compression algorithm can be calculated through this field and the valid data length field. If the data is compressed and the index type is direct compression, that is, F1+F2=0, the binary file contains a compression algorithm.

In some embodiments, editing the data file and the algorithm file according to the preset file format to generate the binary file to be encoded includes: editing the first attribute identification bit and the first valid data bit of the data file according to the preset file format to obtain the first binary file corresponding to the data file; according to the preset file format, edit the second attribute identification bit and the second valid data bit of the algorithm file to obtain the second binary file corresponding to the algorithm file; wherein, The first binary file and the second binary file are two independent files.

In some embodiments, if the data is compressed and the index type is indirect index, that is, F1=0, F2=1, the compression algorithm is expressed in another file, as shown in Table 2, the algorithm file corresponding to the binary file alone file format. As shown in Table 2, each variable name corresponding to the algorithm file, the address offset corresponding to each variable name, the size of the occupied bytes, and the corresponding function. The variable name of the binary file of the algorithm file includes the attribute identification bit of the algorithm file and the valid data bit of the algorithm file; the variable name of the attribute identification bit of the algorithm file includes the algorithm file start tag AlgrB and the compression algorithm name AlgrName, the valid data of the algorithm file. The bit is the specific algorithm AlgrData (ie, the specific algorithm or logical representation of data compression).

It is understandable that the data files and algorithm files are edited according to the preset file format, and the obtained digital files can also be files of other formats, such as ternary files or quaternary files, which correspond to different file formats. Different meanings are set. The identification bits of the same or similar concepts all fall within the protection scope of the embodiments of the present application.

Table 1

Table 2

Step S203, the binary file is encoded to obtain a base sequence, and the base sequence is used to synthesize the DNA fragments storing the data file and the algorithm file.

In some embodiments, encoding a binary file refers to converting the binary file information that needs to be stored into a DNA base sequence (that is, a sequence containing A, G, C, and T) through a certain correspondence or rule. The base sequence DNA fragments used to synthesize storage data files and algorithm files.

Among them, different coding models are suitable for different information types, for example, some coding models are suitable for text information, some are only suitable for picture information, and some can be suitable for various types of information. Synthesis of DNA fragments is the process of linking the bases in the base sequence one by one to form a DNA chain.

Exemplarily, during the code conversion process, the code conversion may be performed through a conversion model based on a mathematical algorithm. The DNA fragment is composed of A, G, C and T4 bases. Since the data in the computer is in the form of binary (ie 0, 1), storing the data information in the DNA is to encode the binary code stream of the data information. DNA is stored for base sequences. According to the structure of DNA, common DNA storage coding models include binary model, ternary model and quaternary model.

Among them, the binary model defines any two of the A, T, C and G4 bases as 1, and the other two as 1, that is, the base sequence has only two states of 0 and 1. The binary model can better avoid the unbalanced GC content and many homopolymers in DNA, which can reduce the difficulty of synthesizing DNA fragments in the later stage. The ternary model means that the entire base sequence has only 3 states: 0, 1 and 2. First, edit the data information to be stored into a ternary code stream, and then encode 0, 1 and 2 in the code stream according to the corresponding relationship in Table 3 to obtain the base sequence. The ternary model determines the next base by the previous base, which can store more information.

In addition, the coding model also includes a quaternary model, which corresponds A, T, C, and G in the base to 0, 1, 2, and 3, and converts the binary code stream read into DNA into quaternary. Commonly prohibited The model mapping relationship is shown in Table 4, wherein the mapping relationship is not unique and includes different combination schemes, and Table 4 only shows one of the mapping relationships.

table 3

二进制数据binary data	0000	0101	1010	1111
对应碱基corresponding base	AA	TT	CC	GG

Table 4

Understandably, the quaternary model has stronger information storage capacity, and each base can encode two bits of data, which can improve storage efficiency and reduce DNA storage costs.

In some embodiments, encoding the binary file to obtain the base sequence includes: encoding the binary file according to a preset encoding model to obtain the base sequence of the binary file; adding primer sequences to the base sequence of the binary file The head and tail of the DNA fragment are obtained to obtain the base sequence used for the synthesis of DNA fragments.

In some embodiments, the preset coding model includes the above-mentioned binary coding model, ternary coding model and quaternary coding model.

As shown in FIG. 3 , a schematic diagram of the logical relationship between a data file and an algorithm file provided by an embodiment of the present application, the logical relationship of the direct index type between the data file and the algorithm file, the data and the algorithm are located in the same binary file. After encoding, the data is fragmented, and primer sequences are added to the first position of the base fragment. As shown in Figure 3, data 1 is preprocessed by algorithm 1, data 2 is preprocessed by algorithm 1, and data n is preprocessed by algorithm m. Each file x has a pair of primer identifiers, including head primer x-F and tail primer x-R, and different data can correspond to the same or different algorithms; for example, data 1 in file 1 corresponds to algorithm 1, and the first part of file 1 includes primer sequences 1-F and 1- R, data 2 in file 2 corresponds to algorithm 1, the first part of file 2 includes primer sequences 2-F and 2-R, and data n in file n corresponds to algorithm m, and the first part of file n includes primer sequences n-F and n-R. Figure 3 shows the direct index relationship between data files and algorithm files. Each binary file x contains data and a backup of the algorithm corresponding to the data. Loss or damage of the binary files in it will not affect each other, such as file 1 If it is lost, file 2 can still be recovered through its corresponding algorithm; it is suitable for application scenarios with very high data security and reliability requirements.

As shown in FIG. 5 , the schematic diagram of synthesizing DNA fragments provided by an embodiment of the present application corresponds to the logical relationship of the data corresponding algorithm shown in FIG. 3 , and the encoded files are divided to obtain DNA fragments that can be stored. When synthesizing DNA fragments, primer sequences 1-F and 1-R are added to both ends of the base sequence of data 1, 1-F is added to one end of data 1, and 1-R is added to the other end of data 1 ; For example, the primer sequence 1-F of file 1 is added to the head of the base sequence of data 1, as the head primer of the DNA fragment corresponding to data 1, and the primer sequence 1-R of file 1 is added to the base sequence of data 1 , as the tail primer of the DNA fragment corresponding to data 1. At the same time, add primer sequences 1-F and 1-R to both ends of the base sequence of Algorithm 1, add 1-F to one end of Algorithm 1, and add 1-R to the other end of Algorithm 1; The primer sequence 1-F of 1 is added to the head of the base sequence of Algorithm 1 as the head primer of the DNA fragment corresponding to Algorithm 1, and the primer sequence 1-R of File 1 is added to the tail of the base sequence of Algorithm 1, as The tail primer of the DNA fragment corresponding to Algorithm 1. By analogy, the primer sequences n-F and n-R of file n are added to both ends of the base sequence of data n, n-F is added to one end of the base sequence of data n, and n-R is added to the other end of the clip sequence of data n. For example, the primer sequence n-F of file n is added to the head of the base sequence of data n, as the head primer of the DNA fragment corresponding to data n; the primer sequence n-R of file n is added to the tail of data n, as data n corresponding The tail primer of the DNA fragment; at the same time, the primer sequences n-F and n-R of file n are added to both ends of the base sequence of algorithm m, n-F is added to one end of the base sequence of algorithm m, and n-R is added to the other end of algorithm m. One end; for example, add the primer sequence n-F of file n to the head of the base sequence of algorithm m, as the head primer of the DNA fragment corresponding to algorithm m; add the primer sequence n-R of file n to the tail of the base sequence of algorithm m, As the tail primer of the DNA fragment corresponding to the algorithm m; thus dividing the encoded file x to obtain a DNA fragment that can be stored.

Understandably, the primer sequence is information stored externally in the DNA, and the DNA fragment can be sequenced through the primer sequence to obtain the base sequence of the DNA fragment. Adding primer sequences to the head and tail of the base sequence of the binary file is not limited to the primer sequences corresponding to the above-mentioned head and tail, as long as it is added at both ends of the base sequence of the binary file.

In some embodiments, encoding the binary file to obtain the base sequence includes: encoding the first binary file and the second binary file according to a preset encoding model to obtain the The first base sequence of the first binary file and the second base sequence of the second binary file; adding the first primer sequence to the head and tail of the first base sequence to obtain for synthesis The base sequence of the first fragment in the DNA fragment; adding the second primer sequence to the head and tail of the second base sequence to obtain the base sequence for synthesizing the second fragment in the DNA fragment.

In some embodiments, the first binary file is a binary file of a data file, and the second binary file is a binary file of an algorithm file corresponding to the data file. The first binary file and the corresponding second binary file are collectively referred to as DNA files. When there are n DNA files, the first to nth DNA files are encoded to obtain the first to nth DNA files. The base sequence of the DNA file. After dividing the base sequence of each DNA file into short fragments, adding a first primer sequence to the head and tail of the first base sequence to obtain a base sequence for synthesizing the first fragment in the DNA fragments; adding A second primer sequence is applied to the head and tail of the second base sequence to obtain a base sequence for synthesizing the second fragment of the DNA fragments. Repeat this step to add the first primer sequence and the second primer sequence to the base sequence of the nth DNA file. The first base sequence is the sequence corresponding to the data file, and the second base sequence is the sequence corresponding to the algorithm file; the first primer sequence corresponding to each DNA file may be a sequence containing different base pairs.

As shown in FIG. 4 , in order to reduce the redundancy in the data storage process and reduce the cost of synthesis and sequencing, another embodiment of the present application provides a schematic diagram of the logical relationship between a data file and an algorithm file. Binary files are kept separately.

After encoding the first binary file corresponding to the data file and the second binary file corresponding to the algorithm file, the encoded data is fragmented, and primer sequences are added to the first position of the base fragment. As shown in Figure 4, both data 1 and data 2 are preprocessed by algorithm 1, and data 3 is preprocessed by algorithm 2. Each file x is identified by a pair of primers, including the head primer x-F and the tail primer x-R. For example, the first part of file 1 includes the first primer sequences 1-F and 1-R, the data file 1 includes data 1 and the second primer sequences 1'-F and 1'-R; the first part of file 2 includes the first primer sequences 2-F and 1'-R; 2-R, data file 2 includes data 2 and second primer sequences 1'-F, 1'-R; the first position of file 3 includes first primer sequences 3-F and 3-R, data file 3 includes data 3 and second Primer sequences 2'-F, 2'-R. The second primer sequence is the primer sequence corresponding to the algorithm. The base pair sequence of the second primer sequence corresponding to different algorithms can be different. As shown in Figure 4, the second primer sequences 1'-F and 1'-R corresponding to the algorithm 1 , the second primer sequences 2'-F and 2'-R corresponding to Algorithm 2.

Among them, the primers x-F and x-R corresponding to the data file x and the primers x'-F and x'-R corresponding to the algorithm file x are two different pairs of primers, and the primer sequences of the data pointing to the algorithm are included in the data file.

FIG. 6 corresponding to FIG. 4 is a schematic diagram of synthesizing DNA fragments provided by another embodiment of the present application. Corresponding to the logical relationship of the data corresponding algorithm shown in Figure 4, the encoded file is divided to obtain DNA fragments that can be stored; the DNA fragment corresponding to the synthesized data file is the first fragment, and the DNA fragment corresponding to the synthesized algorithm is the second fragment. Fragment. The data is stored separately from the algorithm, and the first primer sequence may be externally stored data for DNA storage. As shown in Figure 6, the first primer sequences 1-F and 1-R are added to the head and tail of the base sequence of data file 1 to obtain the first fragment; the first primer sequences 2-F and 2-R are added Go to the head and tail of the base sequence of data file 2 to obtain the first fragment; add the second primer sequences 1'-F and 1'-R to the head and tail of the base sequence of algorithm 1 to obtain the second fragment. Add the first primer sequence 3-F, 3-R to the head and tail of the base sequence of data file 3 to obtain the first fragment; add the second primer sequence 2'-F, 2'-R to the algorithm 2 The head and tail of the base sequence were obtained to obtain the second fragment.

It can be understood that the first primer sequence is externally stored information stored in DNA, and the first fragment is sequenced through the first primer sequence to obtain the base sequence of the first fragment. Adding the first primer sequence to the head and tail of the base sequence of the data file, and adding the second primer sequence to the head and tail of the base sequence of the algorithm, are not limited to those corresponding to the head and tail of the base sequence described above. The primer sequence can be added to both ends of the clip sequence in the binary file.

Due to the long storage time of DNA and the uncertain environment, the information stored externally is as little as possible, and the information stored in DNA is as much as possible. Only the first base sequence of the first fragment can be stored externally, but this method requires Amplified and sequenced twice. In addition, the first primer sequence of the data file and the second primer sequence of the algorithm can also be used as external storage information for DNA storage.

In some embodiments, encoding the binary file to obtain the base sequence includes: encoding the first binary file and the second binary file according to a preset encoding model to obtain the first binary file The first base sequence of the first base sequence and the second base sequence of the second binary file; the head primer sequence and the tail primer sequence are added to the head and tail of the first base sequence to obtain the DNA fragment used in the synthesis of The base sequence of the third fragment; adding a universal primer sequence and one or more tail primer sequences of the first base sequence corresponding to the second base sequence to the head and tail of the second base sequence to obtain DNA for synthesis The base sequence of the fourth fragment in the fragment.

In some embodiments, in order to reduce the number of sequencing times and costs, the above-mentioned encoding method is further improved. As shown in FIG. 7 , a schematic diagram of synthesizing DNA fragments provided in another embodiment of the present application. Among them, the logical relationship between the data and the algorithm shown in (a) in Figure 7, different data may correspond to the same algorithm, for example, data 1, data 2 and data 3 are all preprocessed by algorithm 1, Data 4 is preprocessed by Algorithm 2, but the pointer direction of the algorithm file and the data file is reversed from the algorithm file to the data file. The binary files of the data files and the binary files of the algorithm files are kept separately. After encoding the first binary file corresponding to the data file and the second binary file corresponding to the algorithm file, the encoded data is fragmented, and primer sequences are added to the first position of the base fragment.

As shown in (a) of FIG. 7 , data 1, data 2, and data 3 are all preprocessed by algorithm 1, and data 4 is preprocessed by algorithm 2. Each file x is identified by a pair of primers, including the head primer x-F and the tail primer x-R. The primers x-F and x-R corresponding to the data file x, and the primers corresponding to the algorithm file x include universal primers and one or more primers x-R; wherein, the primer x-R corresponding to the data file x and the primer x-R corresponding to the algorithm file x are the same primers, which are defined by Algorithms point to data. Among them, file 1, file 2, etc. in (a) in FIG. 7 refer to DNA files including data files and algorithm files.

In some embodiments, the head primer sequence and the tail primer sequence are added to both ends of the first base sequence, respectively, and the universal primer sequence and the one or more first base sequences corresponding to the second base sequence are added. The tail primer sequences are respectively added to both ends of the second base sequence; wherein, the first base sequence is the fragment corresponding to the data file, and the second base sequence is the fragment corresponding to the algorithm file.

As shown in (b) of FIG. 7 , the base sequence of data 1 is added to the head primer sequence 1-F and the tail primer sequence 1-R, and the base sequence of data 2 is added to the head primer sequence 2-F and tail primer sequence 2-R, add the base sequence of data 3 to the head primer sequence 3-F and tail primer sequence 3-R, add algorithm 1 to the universal primer sequence and the tail primer sequence 1-R of data 1, data The tail primer sequence 2-R of 2, the tail primer sequence 3-R of data 3; the head primer sequence 4-F and the tail primer sequence 4-R are added to the base sequence of data 4, in the base sequence of algorithm 2. The general primer sequence is added to the head, and the tail primer sequence 4-R corresponding to data 4 is added to the tail; thus, the DNA fragments corresponding to each data file and the algorithm file are synthesized, and the DNA fragment corresponding to the data is classified as the third fragment. The corresponding DNA fragment is classified as the fourth fragment.

In some embodiments, the universal primer sequence and the head primer sequences of one or more first base sequences corresponding to the second base sequence may also be added to both ends of the second base sequence, respectively.

It can be understood that the primers x-F and x-R corresponding to the data file x, as well as the general primers and one or more head (or tail) primer sequences corresponding to the algorithm file x are known primer sequences, which can be external storage for DNA storage. information. With the above-mentioned known primer sequences, the DNA fragment can be sequenced once to obtain the base sequences corresponding to the data file and the algorithm file.

In addition, a head primer sequence and a tail primer sequence are added to the head and tail of the first base sequence, and a universal primer sequence and one or more tail primer sequences of the first base sequence corresponding to the second base sequence are added to The head and tail of the second base sequence, or adding a universal primer sequence and one or more head primer sequences of the first base sequence corresponding to the second base sequence to the head and tail of the second base sequence, without It is limited to the primer sequences corresponding to the head and tail of the base sequence described above, and it suffices to be added at both ends of the clip sequence of the binary file.

Through the above embodiment, only the primer sequence of the data file needs to be saved, and in the decoding and recovery process, the DNA sequences of the data file and the algorithm file can be expanded simultaneously according to the primer sequence and the general primer sequence of the data file, and the data can be decoded at the same time. files and algorithm files, reducing the amount of information that needs to be saved externally. By storing the data files and algorithm files separately, the concurrent amplification and sequencing of the data files and the algorithm files is realized.

As shown in FIG. 8 , a schematic flowchart of a DNA-based data recovery method provided by another embodiment of the present application. The DNA-based data recovery method, as an inverse operation process of DNA-based data storage, can realize the self-recovery of the stored original data. The data preprocessing algorithm is stored in DNA. When reading the data, the system finds the primer sequence of the corresponding file, and simultaneously obtains the data and executable algorithm file through PCR sequencing. After decoding, the executable algorithm in the same directory can automatically convert the data file. Perform analysis, restore the original data, and realize the self-interpretation of the data. As shown in Figure 8, the process includes:

In step S801, the DNA fragment to be decoded is acquired, and the DNA fragment to be decoded is used to store the data file and the algorithm file.

In some embodiments, the data files and algorithm files are stored in the in vivo and ex vivo storage media in the form of DNA fragments. When the data needs to be read, the system can find the corresponding DNA storage file and the corresponding primer sequence. Data files can be various types of information such as text, picture, and video. The data files are obtained by preprocessing the original data through the algorithm files. The preprocessing algorithms include compression, redundancy deletion, encryption and other preprocessing algorithms.

Step S802, decoding the DNA segment to be decoded to obtain a binary file conforming to a preset file format, where the file format is used to indicate the index type between the data file and the algorithm file.

In some embodiments, the process of decoding the DNA fragments is the inverse of the encoding process.

In some embodiments, decoding the DNA fragment to be decoded to obtain a binary file conforming to a preset file format includes: sequencing the DNA fragment to be decoded according to the primer sequences in the DNA fragment to be decoded to obtain the DNA fragment The base sequence of the DNA fragment is decoded according to the preset decoding model to obtain a binary file.

In the decoding operation process corresponding to Fig. 5, according to the primer sequences, the DNA fragments are amplified by PCR technology, and then sequenced to obtain the base sequence of the data and the algorithm. The preset decoding model is an inverse operation model of the encoding model, and through the conversion relationship of the decoding model, the data and the base sequence of the algorithm are converted into corresponding binary files.

The binary file is a file in which the data file and the algorithm file are located in the same file, and the index type is direct index. FIG. 9 is a schematic structural diagram of a binary file provided by an embodiment of the present application. The data and the algorithm are located in the same binary file, including the attribute identification bits of the data file, the valid data bits of the data file, and the valid data bits of the algorithm file. The variable name corresponding to each identification bit as shown in (a) in Figure 9, the attribute identification bit of the data file includes the data file start mark, data file type, binary attribute mark, compression method, compressed data length, compression Pre-data length and data start marker, etc. The offset of the valid data bits of the algorithm file can be determined by the start of book marker field and the valid data bits of the data file.

Wherein, as shown in (b) of FIG. 9 , the first bit F1 of the binary attribute flag indicates whether the original data is compressed. For example, F1=0 indicates that the original data is compressed, and F1=1 indicates that the original data is not compressed. The second bit F2 of the binary attribute flag indicates the index type between the data file and the algorithm file, F2=0 indicates the direct index, and F2=1 indicates the indirect index.

In some embodiments, performing decoding processing on the DNA fragment to be decoded to obtain a binary file conforming to a preset file format includes: performing decoding on the first fragment according to the first primer sequence of the first fragment in the DNA fragment to be decoded Sequencing to obtain a first base sequence and a second primer sequence; according to the second primer sequence, sequencing the second fragment in the DNA fragment to be decoded to obtain a second base sequence; according to a preset decoding model, the first The base sequence and the second base sequence are decoded to obtain a first binary file corresponding to the first base sequence and a second binary file corresponding to the second base sequence. The first binary file corresponds to a data file, and the second binary file corresponds to an algorithm file.

In some embodiments, corresponding to the DNA fragment shown in FIG. 6 , on the basis of knowing the sequence of the first primer, PCR technology is used to amplify and sequence the first fragment to obtain the base sequence of data 1 and the second primer. sequence. By using the same PCR technology and according to the second primer sequence, the second fragment is amplified and sequenced to obtain the base sequence corresponding to the algorithm file. According to the inverse operation (decoding model) of the encoding model, the base sequence of the data file and the base sequence of the algorithm file are decoded to obtain the binary file of the data file and the binary file of the algorithm file.

Exemplarily, in the decoding process corresponding to the DNA fragment shown in FIG. 6 , according to the first primer sequences 1-F and 1-R, the first fragment can be amplified, and then sequenced to obtain the base sequence of the data file 1. , the base sequence of data file 1 includes the first base sequence corresponding to data 1 and the second primer sequence 1'-F, 1'-R; according to the second primer sequence 1'-F, 1'-R, for the first The two fragments are amplified and sequenced to obtain the base sequence corresponding to Algorithm 1. According to the first primer sequences 2-F and 2-R, the first fragment can be amplified and then sequenced to obtain the base sequence of data file 2. The base sequence of data file 2 includes the first base corresponding to data 2 sequence and the second primer sequence 1'-F, 1'-R; according to the second primer sequence 1'-F, 1'-R, the second fragment is amplified, and then sequenced to obtain the base sequence corresponding to algorithm 1 . According to the first primer sequences 3-F and 3-R, the first fragment can be amplified and then sequenced to obtain the base sequence of data file 3. The base sequence of data file 3 includes the first base corresponding to data 3 sequence and the second primer sequence 2'-F, 2'-R; according to the second primer sequence 2'-F, 2'-R, the second fragment is amplified, and then sequenced to obtain the base sequence corresponding to algorithm 2 .

As shown in FIG. 10 , a schematic structural diagram of a binary file provided by another embodiment of the present application. The binary file of the data file and the binary file of the algorithm file are two separate files, and the index type is indirect index. As shown in (a) in Figure 10, the first binary file corresponding to the data file includes the first attribute identification bit of the data file and the first valid data bit of the data file; the first attribute identification bit includes the start of the data file Variable fields such as tag, data file type, binary attribute tag, compression method, data length after compression, data length before compression, and data start tag. Among them, the binary attribute tag field includes one byte, eight bits, the first bit F1 indicates whether the original data is preprocessed, and F2 indicates the index type between the data file and the algorithm file. F1=0 means that the original data is preprocessed by the algorithm, F1=1 means that the original data is not preprocessed; F2=0 means direct indexing, F2=1 means indirect indexing.

The second binary file corresponding to the algorithm file as shown in (c) of FIG. 10 includes the second attribute identifier of the algorithm file and the second valid data bits of the algorithm file. Wherein the second attribute identification bit includes the field of the algorithm file start marker and the field of the algorithm name. The second significant data bit indicates the specific preprocessing algorithm employed.

The first binary file corresponds to a data file, and the second binary file corresponds to an algorithm file.

In some embodiments, decoding the DNA fragment to be decoded to obtain a binary file conforming to a preset file format includes: according to the head primer sequence and the tail primer sequence of the third fragment in the DNA fragment to be decoded, The third fragment is sequenced to obtain the first base sequence; according to the tail primer sequence of the third fragment and the universal primer sequence of the fourth fragment in the DNA fragments to be decoded, the fourth fragment is sequenced to obtain the second base base sequence; according to the preset decoding model, decode the first base sequence and the second base sequence to obtain the first binary file corresponding to the first base sequence and the second binary file corresponding to the second base sequence binary file.

In some embodiments, corresponding to (b) in FIG. 7 , the primer sequences corresponding to the data files only need to be stored externally, and the DNA fragments are sequenced by reading the primer sequences corresponding to the data files, and the data files and algorithm files are obtained at the same time. corresponding base sequences.

Exemplarily, on the basis of knowing the header sequence, the tail sequence and the universal primer sequence of the algorithm file corresponding to the data file, the third fragment and the fourth fragment are simultaneously amplified by using PCR technology, and then sequenced to obtain The base sequence corresponding to the data file and the base sequence corresponding to the algorithm file. For example, through the head primer sequence 1-F, the tail primer sequence 1-R and the universal primer sequence, the third fragment corresponding to data 1 and the fourth fragment corresponding to algorithm 1 are simultaneously amplified, and then sequenced to obtain the corresponding data 1. The base sequence of and the base sequence corresponding to Algorithm 1. By decoding the base sequence of the data file and the base sequence of the algorithm file, the first binary file and the second binary file in the file format shown in FIG. 10 are obtained.

Step S803, read the data file and the algorithm file in the binary file, and call the algorithm file according to the index type.

In some embodiments, as shown in FIG. 9 , when the index type between the data query and the algorithm file is a direct index, the attribute identification bit of the data file includes an index indicating the index type; read the data in the binary file file, algorithm file, and calling the algorithm file according to the index type, including: reading the attribute identification bits and valid data bits of the data file of the binary file, and determining the index type based on the attribute identification bit of the data file; reading the algorithm file of the binary file The valid data bits of the file are called according to the index type.

In some embodiments, as shown in FIG. 10 , when the index type between the data file and the algorithm file is an indirect index, the first attribute identification bit of the data file includes an identification indicating the index type; read the data file in the binary file and the algorithm file, and call the algorithm file according to the index type, including: reading the first attribute identification bit and the first valid data bit of the data file in the first binary file, and determining the index type according to the first attribute identification bit; reading; The second attribute identification bit and the second valid data bit of the algorithm file in the second binary file are obtained, and the algorithm of the second valid data bit of the second binary file is called according to the index type.

Step S804: Perform parsing processing on the data file according to the algorithm file to obtain original data corresponding to the data file.

Through the embodiments of the present application, DNA storage needs to be stored for a long time in practical applications. In the case that the data preprocessing algorithm may be lost, in order to ensure the security and integrity of the data in the long-term uncertain environment, the compression algorithm is used. A specific file format is stored in DNA fragments, and on the basis of controlling the amount of data redundancy and simplifying the complexity of data reading, it ensures that the data can be self-interpreted and self-recoverable.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Corresponding to the DNA-based data storage method described in the above embodiment, FIG. 11 shows a structural block diagram of the DNA-based data storage device provided by the embodiment of the present application. relevant part.

11, the device includes:

The first obtaining unit 111 is configured to obtain a data file to be stored, where the data file is a file obtained by preprocessing the original data according to the algorithm file;

The first processing unit 112 is configured to edit the data file and the algorithm file according to a preset file format, and generate a binary file to be encoded, and the file format is used to indicate the difference between the data file and the algorithm file. the index type;

The encoding unit 113 is configured to encode the binary file to obtain a base sequence, and the base sequence is used for synthesizing a DNA fragment storing the data file and the algorithm file.

Corresponding to the DNA-based data recovery method described in the above embodiment, FIG. 12 shows a structural block diagram of the DNA-based data recovery apparatus provided by the embodiment of the present application. relevant part.

12, the device includes:

The second obtaining unit 121 is configured to obtain DNA fragments to be decoded, and the DNA fragments to be decoded are used to store data files and algorithm files;

The decoding unit 122 is configured to perform decoding processing on the DNA fragment to be decoded to obtain a binary file conforming to a preset file format, where the file format is used to indicate the index type between the data file and the algorithm file ;

A second processing unit 123, configured to read the data file and the algorithm file in the binary file, and call the algorithm file according to the index type;

The parsing unit 124 is configured to perform parsing processing on the data file according to the algorithm file to obtain original data corresponding to the data file.

It should be noted that the information exchange, execution process and other contents between the above-mentioned devices/units are based on the same concept as the method embodiments of the present application. For specific functions and technical effects, please refer to the method embodiments section. It is not repeated here.

Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example. Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment may be integrated in one processing unit, or each unit may exist physically alone, or two or more units may be integrated in one unit, and the above-mentioned integrated units may adopt hardware. It can also be realized in the form of software functional units. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing from each other, and are not used to limit the protection scope of the present application. For the specific working processes of the units and modules in the above-mentioned system, reference may be made to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

FIG. 13 is a schematic structural diagram of a terminal device according to an embodiment of the application. As shown in FIG. 13 , the terminal device 13 of this embodiment includes: at least one processor 130 (only one is shown in FIG. 13 ), a processor, a memory 131 , and a processor 131 stored in the memory 131 and available for processing in the at least one processor The computer program 132 running on the processor 130, the processor 130 implements the steps in any of the above method embodiments when the computer program 132 is executed.

The terminal device 13 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server. The terminal device may include, but is not limited to, the processor 130 and the memory 131 . Those skilled in the art can understand that FIG. 13 is only an example of the terminal device 13, and does not constitute a limitation on the terminal device 13, and may include more or less components than the one shown, or combine some components, or different components , for example, may also include input and output devices, network access devices, and the like.

The so-called processor 130 may be a central processing unit (Central Processing Unit, CPU), and the processor 130 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuits) , ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 131 may be an internal storage unit of the terminal device 13 in some embodiments, such as a hard disk or a memory of the terminal device 13 . The memory 131 may also be an external storage device of the terminal device 13 in other embodiments, such as a plug-in hard disk equipped on the terminal device 13, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Further, the memory 131 may also include both an internal storage unit of the terminal device 13 and an external storage device. The memory 131 is used to store an operating system, an application program, a boot loader (Boot Loader), data, and other programs, for example, program codes of the computer program, and the like. The memory 131 may also be used to temporarily store data that has been output or will be output.

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps in the foregoing method embodiments can be implemented.

The embodiments of the present application provide a computer program product, when the computer program product runs on a mobile terminal, the steps in the foregoing method embodiments can be implemented when the mobile terminal executes the computer program product.

The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the present application realizes all or part of the processes in the methods of the above embodiments, which can be completed by instructing the relevant hardware through a computer program, and the computer program can be stored in a computer-readable storage medium. When executed by a processor, the steps of each of the above method embodiments can be implemented. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form, and the like. The computer-readable medium may include at least: any entity or device capable of carrying the computer program code to the photographing device/terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electrical carrier signals, telecommunication signals, and software distribution media. For example, U disk, mobile hard disk, disk or CD, etc. In some jurisdictions, under legislation and patent practice, computer readable media may not be electrical carrier signals and telecommunications signals.

In the foregoing embodiments, the description of each embodiment has its own emphasis. For parts that are not described or described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

In the embodiments provided in this application, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the apparatus/network device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units. Or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the above-mentioned embodiments, those of ordinary skill in the art should understand that: it can still be used for the above-mentioned implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the application, and should be included in the within the scope of protection of this application.

Claims

A DNA-based data storage method, comprising:

Obtaining a data file to be stored, the data file is a file obtained by preprocessing the original data according to the algorithm file;

Edit the data file and the algorithm file according to a preset file format to generate a binary file to be encoded, and the file format is used to indicate the index type between the data file and the algorithm file;

The binary file is encoded to obtain a base sequence, and the base sequence is used to synthesize the DNA fragments storing the data file and the algorithm file.
The method according to claim 1, wherein before the acquiring the data file to be stored, the method further comprises:

The original data is compressed, redundant or encrypted according to the algorithm file to obtain the data file.
The method according to claim 1 or 2, wherein the editing of the data file and the algorithm file according to a preset file format to generate a binary file to be encoded comprises:

Edit the attribute identification bits and valid data bits of the data file according to the preset file format;

Determine the offset of the algorithm file relative to the valid data bits of the data file according to the attribute identification bits and the valid data bits of the data file;

Based on the offset, the effective data bits of the algorithm file are edited to obtain a binary file in which the data file and the algorithm file are located in the same file.
The method of claim 3, wherein the encoding the binary file to obtain a base sequence comprises:

According to a preset encoding model, the binary file is encoded to obtain the base sequence of the binary file;

Primer sequences are added to the head and tail of the base sequence of the binary file to obtain base sequences for synthesizing the DNA fragments.
The method according to claim 1 or 2, wherein the editing of the data file and the algorithm file according to a preset file format to generate a binary file to be encoded comprises:

According to the preset file format, edit the first attribute identification bit and the first valid data bit of the data file to obtain the first binary file corresponding to the data file;

According to the preset file format, edit the second attribute identification bit and the second valid data bit of the algorithm file to obtain the second binary file corresponding to the algorithm file;

Wherein, the first binary file and the second binary file are two files independent of each other.
The method of claim 5, wherein the encoding the binary file to obtain a base sequence comprises:

According to a preset encoding model, the first binary file and the second binary file are encoded to obtain the first base sequence of the first binary file and the second binary file. The second base sequence of the prepared file;

adding a first primer sequence to the head and tail of the first base sequence to obtain a base sequence for synthesizing the first fragment of the DNA fragments;

A second primer sequence is added to the head and tail of the second base sequence to obtain a base sequence for synthesizing the second fragment of the DNA fragments.
The method of claim 5, wherein the encoding the binary file to obtain a base sequence comprises:

According to a preset encoding model, the first binary file and the second binary file are encoded to obtain the first base sequence of the first binary file and the second binary file. The second base sequence of the prepared file;

adding a head primer sequence and a tail primer sequence to the head and tail of the first base sequence to obtain a base sequence for synthesizing the third fragment in the DNA fragment;

A universal primer sequence and one or more tail primer sequences of the first base sequence corresponding to the second base sequence are added to the head and tail of the second base sequence to obtain a primer for synthesizing the DNA fragment. The base sequence of the fourth fragment.
A DNA-based data recovery method, comprising:

Obtaining the DNA fragment to be decoded, the DNA fragment to be decoded is used to store data files and algorithm files;

Decoding the DNA fragment to be decoded to obtain a binary file conforming to a preset file format, where the file format is used to indicate an index type between the data file and the algorithm file;

Read the data file and the algorithm file in the binary file, and call the algorithm file according to the index type;

Perform parsing processing on the data file according to the algorithm file to obtain original data corresponding to the data file.
The method of claim 8, wherein the decoding process is performed on the DNA fragment to be decoded to obtain a binary file conforming to a preset file format, comprising:

According to the primer sequences in the DNA fragment to be decoded, the DNA fragment to be decoded is sequenced to obtain the base sequence of the DNA fragment;

According to a preset decoding model, the base sequence of the DNA fragment is decoded to obtain the binary file;

The binary file is a file in which the data file and the algorithm file are located in the same file.
The method of claim 9, wherein the attribute identification bit of the data file includes an identification indicating an index type;

The reading the data file and the algorithm file in the binary file, and calling the algorithm file according to the index type, includes:

Read the attribute identification bits and valid data bits of the data file of the binary file, and determine the index type based on the attribute identification bits of the data file;

The valid data bits of the algorithm file of the binary file are read, and the valid data bits of the algorithm file are called according to the index type.
The method of claim 8, wherein the decoding process is performed on the DNA fragment to be decoded to obtain a binary file conforming to a preset file format, comprising:

According to the first primer sequence of the first fragment in the DNA fragments to be decoded, the first fragment is sequenced to obtain a first base sequence and a second primer sequence;

According to the second primer sequence, the second fragment in the DNA fragment to be decoded is sequenced to obtain a second base sequence;

According to a preset decoding model, the first base sequence and the second base sequence are decoded to obtain the first binary file and the second base sequence corresponding to the first base sequence the corresponding second binary;

The first binary file corresponds to the data file, and the second binary file corresponds to the algorithm file.
The method of claim 8, wherein the decoding process is performed on the DNA fragment to be decoded to obtain a binary file conforming to a preset file format, comprising:

According to the head primer sequence and the tail primer sequence of the third fragment in the DNA fragment to be decoded, the third fragment is sequenced to obtain the first base sequence;

According to the tail primer sequence of the third fragment and the universal primer sequence of the fourth fragment in the DNA fragments to be decoded, the fourth fragment is sequenced to obtain the second base sequence;

According to a preset decoding model, the first base sequence and the second base sequence are decoded to obtain the first binary file and the second base sequence corresponding to the first base sequence the corresponding second binary;

The first binary file corresponds to the data file, and the second binary file corresponds to the algorithm file.
The method according to claim 11 or 12, wherein the first attribute identification bit of the data file includes an identification indicating an index type;

The reading the data file and the algorithm file in the binary file, and calling the algorithm file according to the index type, includes:

Read the first attribute identification bit and the first valid data bit of the data file in the first binary file, and determine the index type according to the first attribute identification bit;

Reading the second attribute identification bit and the second valid data bit of the algorithm file in the second binary file, and calling the second valid data of the second binary file according to the index type bit algorithm.
A DNA-based data storage device, comprising:

a first acquiring unit, configured to acquire a data file to be stored, where the data file is a file obtained by preprocessing the original data according to the algorithm file;

The first processing unit is configured to edit the data file and the algorithm file according to a preset file format, and generate a binary file to be encoded, and the file format is used to indicate the relationship between the data file and the algorithm file. index type;

The coding unit is used for coding the binary file to obtain a base sequence, and the base sequence is used for synthesizing the DNA fragments storing the data file and the algorithm file.
A DNA-based data recovery device, comprising:

a second acquiring unit, configured to acquire DNA fragments to be decoded, and the DNA fragments to be decoded are used to store data files and algorithm files;

a decoding unit, configured to decode the DNA fragment to be decoded to obtain a binary file conforming to a preset file format, where the file format is used to indicate an index type between the data file and the algorithm file;

a second processing unit, configured to read the data file and the algorithm file in the binary file, and call the algorithm file according to the index type;

The parsing unit is configured to perform parsing processing on the data file according to the algorithm file to obtain original data corresponding to the data file.
A terminal device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, when the processor executes the computer program, the process according to claim 1 to 7. The method of any one of claims 8 to 13.
A computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, any one of claims 1 to 7 or any of claims 8 to 13 is implemented. one of the methods described.