WO2019037117A1

WO2019037117A1 - Encoding and decoding method, device and data processing device

Info

Publication number: WO2019037117A1
Application number: PCT/CN2017/099152
Authority: WO
Inventors: 杨焕明; 刘斯奇; 汪建
Original assignee: 深圳华大基因研究院
Priority date: 2017-08-25
Filing date: 2017-08-25
Publication date: 2019-02-28
Also published as: CN111095423A; CN111095423B

Abstract

The present invention discloses an encoding and decoding method, a device and a data processing device relating to the technical field of data processing. The encoding method comprises: performing digital processing on information to generate sequence data (110); dividing the sequence data into N data segments (120), wherein N is an integer greater than 1; searching a genetic database for a nucleic acid segment, corresponding to each data segment, and using position information of the nucleic acid segment in the genetic database as an identifier of each data segment (130); and generating a sequence encoding according to the identifier corresponding to each data segment (140). The method and device can be used to increase the efficiency and security of encryption.

Description

Encoding/decoding method, device and data processing device

Technical field

The present invention relates to the field of data processing technologies, and in particular, to an encoding method, an encoding device, a decoding method, a decoding device, a data processing device, and a computer readable storage medium.

Background technique

With the rapid development of information technology, digital information consisting of digital coding and Internet or other various transmission channels has been widely applied to all aspects of human social life. Therefore, the security of protecting digital information is even more important, especially in special areas such as military, commercial and medical.

As an important technical means to ensure the security of digital information, digital encryption technology has received more and more attention. The related technology mainly uses the secret key to convert the plaintext of the information into meaningless ciphertext to achieve the encryption effect.

So far, methods for storing information using DNA require: 1) A computer with a program that encodes and decodes information, storing the information as a "computer language" (0 or 1, binary code of digital information), and then converting Into "biological language" (nucleotides A, T, C and G in DNA sequences). 2) A DNA synthesizer for storing "biological language" information in vitro or in vivo. 3) The DNA sequencer, after obtaining the "biological language" information, re-converts "biological language" into "computer language" and further stores the information. Although this is a fully usable system, the instruments used in steps 2) and 3) are very expensive and the entire process flow is time consuming and labor intensive and cannot be widely used.

Summary of the invention

The inventors have found that the above-mentioned related art has the following problems: the complicated and cumbersome calculation of information is performed only by a predetermined mathematical method, resulting in low encryption efficiency and low security; the existing method for storing information by using DNA requires DNA. Synthesizers and sequencers are expensive and the method operations are time consuming and labor intensive. The inventors have proposed a solution to at least one of the above problems.

An object of the present invention is to provide an encoding technology solution with high encryption efficiency and high security, and another object of the present invention is to provide an information storage solution which is simple in operation and low in price.

According to an embodiment of the present invention, there is provided an encoding method comprising: digitizing information to generate sequence data; dividing the sequence data into N data segments, N being an integer greater than 1; for each data segment Finding a corresponding nucleic acid fragment in a gene database, and arranging the nucleic acid fragment in the genetic data The location information in the library is used as an identifier of each data segment; a sequence code is generated according to the identifier corresponding to each data segment.

Optionally, for the data segment in the gene database that does not find the corresponding nucleic acid segment, further data division is performed to obtain M data segments, and each of the M data segments is searched in the genetic database. A corresponding nucleic acid fragment, M is an integer greater than one.

Optionally, the digitizing process is to transcode the binary code corresponding to the information to generate the sequence data.

Alternatively, the sequence data is data consisting of four deoxyribonucleotides of adenine A, cytosine C, guanine G, and thymine T.

Alternatively, 0 in the binary code is converted to A or T, and 1 is converted to C or G to generate the sequence data.

Alternatively, 01 in the binary code is converted to A, 00 is converted to T, 11 is converted to C, and 10 is converted to G to generate the sequence data.

Optionally, the sequence data is a binary code corresponding to the information.

Optionally, all nucleic acid fragments in the gene database are transcoded into binary code prior to the searching step.

Alternatively, A or T in the gene database is converted to binary code 0, and C or G is converted to binary code 1.

Alternatively, A in the gene database is converted to binary code 01, T is converted to binary code 00, C is converted to binary code 11, and G is converted to binary code 10.

Optionally, the identifier comprises position information of the first symbol and the last symbol of the nucleic acid fragment in the gene database.

Optionally, the identifier comprises positional information of a first symbol of the nucleic acid fragment in the gene database, and a length of the nucleic acid fragment.

Optionally, the genetic database comprises one or more animal and/or plant and/or microbial genomic data.

Optionally, the gene database comprises wild type genomic data and/or synthetic genomic data.

Optionally, the gene database comprises human genomic data.

According to another embodiment of the present invention, a decoding method is provided, including: obtaining an identifier corresponding to each data segment from the encoded data, where the encoded data is a sequence encoding generated according to the encoding method according to any of the above embodiments; Obtaining location information corresponding to each data segment according to the identifier; and counting the number of genes according to the location information A corresponding nucleic acid fragment is obtained from a library; sequence data is generated based on the nucleic acid fragment. Information is obtained based on the sequence data.

According to still another embodiment of the present invention, an encoding apparatus is provided, including: an information digitizing module, configured to digitize information to generate sequence data; a data identifier determining module, wherein the data identifier determining module is connected to the information digitizing module, Dividing the sequence data into N data segments, N being an integer greater than 1, searching for a corresponding nucleic acid fragment in the gene database for each data segment, and locating the nucleic acid fragment in the gene database The information is used as an identifier of each data segment. The code generation module is connected to the data identifier determining module, and is configured to generate a sequence code according to the identifier corresponding to each data segment.

Optionally, the data identifier determining module performs further data partitioning on the data segment in the gene database that does not find the corresponding nucleic acid fragment, obtains M data segments, and searches for M data in the gene database. Each corresponding nucleic acid fragment in the data fragment, M is an integer greater than one.

Optionally, the information digitizing module transcodes the binary code corresponding to the information to generate the sequence data.

Optionally, the information digitizing module converts 0 of the binary code to A or T, and 1 converts to C or G to generate the sequence data.

Optionally, the information digitizing module converts 01 in the binary code to A, 00 to T, 11 to C, and 10 to G to generate the sequence data.

Optionally, the apparatus further includes a genetic data transcoding module, wherein the genetic data transcoding module is respectively connected to the information digitizing module and the data identifier determining module, and is configured to convert all the nucleic acid fragments in the gene database The code is a binary code.

Optionally, the genetic data transcoding module converts A or T in the gene database into binary code 0, and C or G is converted to binary code 1.

Optionally, the genetic data transcoding module converts A in the gene database into binary code 01, T is converted to binary code 00, C is converted to binary code 11, and G is converted to binary code 10.

Optionally, the identifier includes position information of the first symbol of the nucleic acid fragment in the gene database, And the length of the nucleic acid fragment.

Optionally, the information is at least one of text information, picture information, audio information, or video information.

Optionally, the gene database comprises human genomic data.

According to still another embodiment of the present invention, a decoding apparatus is provided, including: a data identifier obtaining module, configured to acquire an identifier corresponding to each data segment from the encoded data, where the encoded data is according to any one of the foregoing embodiments. The encoding method is the sequence encoding generated by the encoding device according to any one of the above embodiments; the sequence obtaining module is connected to the data identifier obtaining module, and configured to acquire a position corresponding to each data segment according to the identifier. And generating, according to the location information, a corresponding nucleic acid fragment from a gene database; an information generating module, wherein the information generating module is connected to the sequence acquiring module, configured to generate sequence data according to the nucleic acid segment, and according to the Sequence data gets information.

According to still another embodiment of the present invention, there is provided a data processing apparatus comprising: a memory and a processor coupled to the memory, the processor being configured to perform the above based on an instruction stored in the memory device An encoding method or a decoding method in any of the embodiments.

According to still another embodiment of the present invention, there is provided a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements an encoding method or a decoding method in any of the above embodiments.

An advantage of the present invention is that the information is encrypted by matching the sequence data of the information to be encrypted to the nucleic acid fragments in the gene database and encoding the corresponding position information as a sequence. Utilizing the ultra-high storage density of nucleic acids and a unique intermolecular recognition mechanism, encryption can be completed without complicated and cumbersome mathematical calculation of information, thereby improving encryption efficiency and security.

Another advantage of the present invention is that the use of the present invention for information storage eliminates the need for expensive DNA synthesizers and sequencers, and requires only a computer having associated programs for encoding and decoding information to store information in nucleotides. The sequence includes wild-type genomes or synthetic genomes of humans or other species, and storage capacity is unlimited, allowing for the storage of an unlimited amount of information.

DRAWINGS

The accompanying drawings, which are incorporated in FIG.

The invention may be more clearly understood from the following detailed description, in which:

Figure 1 shows a flow chart of one embodiment of the encoding method of the present invention.

Fig. 2 shows a schematic diagram of one embodiment of the encoding/decoding method of the present invention.

Figure 3 shows a flow chart of one embodiment of the decoding method of the present invention.

Fig. 4 is a block diagram showing an embodiment of an encoding apparatus of the present invention.

Fig. 5 is a block diagram showing an embodiment of a decoding apparatus of the present invention.

Fig. 6 is a block diagram showing an embodiment of a data processing device of the present invention.

Detailed ways

Various exemplary embodiments of the present invention will now be described in detail with reference to the drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in the embodiments are not intended to limit the scope of the invention unless otherwise specified.

In the meantime, it should be understood that the dimensions of the various parts shown in the drawings are not drawn in the actual scale relationship for the convenience of the description.

The following description of the at least one exemplary embodiment is merely illustrative and is in no way

Techniques, methods and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, the techniques, methods and apparatus should be considered as part of the authorization specification.

In all of the examples shown and discussed herein, any specific values are to be construed as illustrative only and not as a limitation. Accordingly, other examples of the exemplary embodiments may have different values.

It should be noted that similar reference numerals and letters indicate similar items in the following figures, and therefore, once an item is defined in one figure, it is not required to be further discussed in the subsequent figures.

As shown in FIG. 1, in step 110, the information is digitized to generate sequence data.

In one embodiment, the digitizing process can include converting the information to a binary code. The binary code is transcoded to generate sequence data, which may be a series of data arranged in order. The information may be in any form such as text information, image information, or audio information. The specific steps of the encoding method of the present invention will be described below by taking document information as an example.

As shown in FIG. 2, the information to be processed is text information 21, "What I cannot create, I do not understand. Look deep into nature, and then you will understand everything better.", Convert this text information into binary code 22 "01010111011010000110000101110100...". The 0 in this binary code can be converted to A or T, and 1 is converted to C or G to generate sequence data 23 "AGACTGGCAGCTCTTTTGGTTTAGAGCGACTA...". A, C, G, and T correspond to adenine, cytosine, guanine, and thymine in DNA (Deoxyribonucleic Acid, deoxyribonucleic acid), respectively. Other forms of sequence data may also be generated according to other conversion modes between 1, 0 and A, C, G, and T.

In another embodiment, the sequence data can be a binary code corresponding to the information. In this case, all gene segments in the gene database need to be transcoded into binary code so that the binary code corresponding to the information can be found in the transformed gene database.

Through the above steps, any form of information can be mapped to the data stored in DNA, thereby linking the information with the genetic database, providing the necessary technical basis for the encryption of information. Further, the following steps can be used to encrypt and store information.

In step 120, the sequence data is divided into N data segments, N being an integer greater than one.

In step 130, for each data segment, the corresponding nucleic acid fragment is looked up in the gene database, and the position information of the nucleic acid fragment in the gene database is used as the identification of each data segment. The position information of the first symbol of the nucleic acid fragment that matches the data fragment and the length of the nucleic acid fragment may also be saved as an identification of the data fragment.

A nucleic acid fragment refers to a fragment formed by a plurality of nucleotides linked end to end, and the nucleotide may be a deoxyribonucleotide or a ribonucleotide. The nucleic acid fragment can be transcoded into a binary code according to certain rules as needed, and the nucleic acid fragment after transcoding refers to the binary code corresponding to the nucleic acid fragment.

The length of the nucleic acid fragment can be expressed by the number of nucleotides, that is, "nt"; each nucleotide is regarded as one character in the present invention, and the number of nucleotides can also be expressed by the number of characters. In the present invention, the nucleic acid fragment can be transcoded into a binary code according to a certain rule as needed, and the nucleic acid fragment after transcoding refers to a binary code corresponding to the nucleic acid fragment. In this case, the length of the nucleic acid fragment is represented by a Byte.

In the above two steps, the larger the value of N, the higher the storage efficiency of the code, but the smaller the probability of finding the corresponding nucleic acid fragment in the gene database. Therefore, the size of N can be adjusted based on the search for nucleic acid fragments in the gene database.

In one embodiment, no nucleic acid fragments corresponding to the data segments are found in the gene database, and the sequence data can be re-divided to obtain M data segments, and the gene database is searched for each of the M data segments. A nucleic acid fragment, M being an integer greater than one.

The length of the re-divided data segment is smaller than the length of the original data segment so that it can be checked in the genetic database. Find the nucleic acid fragment corresponding to the data fragment. For example, a data fragment that cannot find a corresponding nucleic acid fragment in a gene database can be divided into multiple parts, and each part is respectively searched for a corresponding nucleic acid fragment in the gene database to improve the probability of fragment matching and the efficiency of searching.

In one embodiment, as shown in FIG. 2, the gene database 24 may be a nucleotide sequence of the human nuclear pore-reporting protein gene (SEQ ID NO: 1), which contains a database of 4103 characters. The sequence data 23 is divided into a plurality of data segments, each of which contains 2 characters. The same nucleic acid fragment as each data fragment is looked up in the gene database 24.

If the same nucleic acid fragment is found, the position corresponding to the first character in the nucleic acid fragment and the length of the nucleic acid fragment are recorded as an identifier. For example, the data segment composed of the first two characters AG in the sequence data corresponds to the identifier of 3856 2, that is, the 3856th character in the AG corresponding gene database starts with a nucleic acid segment of length 2 characters.

If the same nucleic acid fragment is not found, the length of the data fragment is reduced to 1 character and the same nucleic acid fragment is looked up in the gene database 24. For example, if the data sequence AC composed of the 3rd and 4th characters in the sequence data does not have the same nucleic acid fragment in the gene database 24, the new data sequence A is composed of the third character alone. The corresponding identifier of the data sequence is 3827 1, that is, the 3827 characters in the corresponding gene database of A corresponds to a nucleic acid fragment having a length of 1 character.

In step 140, sequence encoding is generated based on the identification of the respective data segment. As shown in FIG. 2, the identifiers of the data segments can be stored in order to obtain the sequence code corresponding to the information. 25" 3856 2 3827 1 3856 1 1313 1 3275 1 1079 1 3906 1 1078 1 3856 2 853 1 949 1 3229 1 2600 1 3755 1 2496 1 714 1 2518 1 2736 1 1713 1 1789 1 1291 1 2153 1 3601 2 1159 1 537 1 2660 1 1962 1 375 1 892 1 1309 1 2620 1 2736 1...". It is also possible to add identification bits to individual data segments to indicate their generation order, and then store each data segment as a sequence code in any order.

The gene database 24 in the embodiment shown in FIG. 2 described above has a small capacity, and thus the divided data segment length is also relatively small, and only the implementation process of the method is exemplarily illustrated. In practical applications, a gene database storing a large number of gene sequences can be used as a database of encoding methods. Such as wild-type or synthetic human genomic data, bacterial genomic data or a combined database of genomic data from multiple species, these gene databases contain billions of nucleotides, which can fully support the search segmentation length of tens Even hundreds of bits of data are encoded with short identifiers. The sequence code consisting of these identifiers only contains the identifier of each data segment, which not only can realize information encryption, but also can improve storage efficiency.

The encoded data composed of the sequence encoding can be decoded by the inverse of the above steps.

As shown in FIG. 3, in step 310, an identifier corresponding to each data segment is obtained from the encoded data. For example, the encoded data may be the sequence code 25 in FIG.

In step 320, location information corresponding to each data segment is obtained according to the identifier.

In step 330, a corresponding nucleic acid fragment is obtained from the gene database based on the location information. For example, the identifier 3856 2 in the sequence code 25 in FIG. 2 represents a nucleic acid fragment of the gene database 24 starting with the 3856th character and having a length of 2 characters.

In step 340, sequence data is generated from the nucleic acid fragments.

For example, the acquired gene fragments can be combined to obtain the sequence data 23 "AGACTGGCAGCTCTTTTGGTTTAGAGCGACTA..." in FIG. The sequence data 23 is transcoded into a binary code 22 "01010111011010000110000101110100..." according to the transcoding relationship between A, C, G, T and 1, 0 employed in encoding.

In step 350, information is obtained from the sequence data. For example, the binary code 22 can be decoded into text information 21 "What I cannot create, I do not understand", thereby completing the decryption.

In the above embodiment, the information of the information to be encrypted is corresponding to the gene segment in the gene database, and the corresponding position information is encoded as a sequence, thereby realizing the encryption of the information. By using the ultra-high storage density of the gene and the unique intermolecular recognition mechanism, encryption can be completed without complicated and cumbersome mathematical calculation of information, thereby improving encryption efficiency and security.

Moreover, in the above embodiment, when information storage is performed, an expensive DNA synthesizer and a sequencer are not required, and only a computer having a related program for encoding and decoding information is required to store information in a nucleotide sequence, including humans. Or wild-type genomes or synthetic genomes of other species, and storage capacity is unlimited, allowing for the storage of an unlimited amount of information.

As shown in FIG. 4, the apparatus includes an information digitization module 41, a data identification determination module 42, and an encoding generation module 43.

The information digitization module 31 digitizes the information to generate sequence data.

In one embodiment, the information digitization module 41 transcodes the binary code corresponding to the information to generate sequence data. For example, the information digitization module 41 converts 0 in the binary code to A or T, 1 to C or G to generate sequence data, or converts 01 in the binary code to A, 00 to T, 11 to C, 10 is converted to G to generate sequence data. In this case, the sequence data is data composed of A, C, G, and T.

In another embodiment, the apparatus further includes a genetic data transcoding module 44. In the case where the sequence data is a binary code corresponding to the information, the gene data transcoding module 44 transcodes all of the nucleic acid fragments in the gene database into a binary code.

The data identification determining module 42 divides the sequence data into N data segments, N is an integer greater than 1, for each data segment, searches for a corresponding gene segment in the gene database, and uses the position information of the nucleic acid segment in the gene database as The identification of each piece of data. For example, the identification may include positional information of the first symbol and the last symbol of the nucleic acid fragment in the gene database, or the identification may include positional information of the first symbol of the nucleic acid fragment in the gene database, and the length of the nucleic acid fragment.

In one embodiment, the data identification determining module 42 performs further data partitioning on the data segments in the gene database for which the corresponding nucleic acid fragments are not found, obtains M data segments, and searches for the M data segments in the gene database. For each corresponding nucleic acid fragment, M is an integer greater than one.

The code generation module 43 generates a sequence code based on the identifier corresponding to each data segment. For example, the sequence encoding may be sequentially generated in the order in which the respective data segments are divided.

As shown in FIG. 5, the apparatus includes: a data identifier acquisition module 51, a sequence acquisition module 52, and an information generation module 53.

The data identifier obtaining module 51 acquires an identifier corresponding to each data segment from the encoded data, and the encoded data is a sequence encoding generated by the encoding method in the above embodiment or by the encoding device in the above embodiment.

The sequence obtaining module 52 acquires location information corresponding to each data segment according to the identifier, and acquires a corresponding nucleic acid fragment from the gene database according to the location information.

The information generating module 53 generates sequence data based on the nucleic acid fragments and acquires information based on the sequence data.

In the above embodiment, when information is stored, an expensive DNA synthesizer and a sequencer are not required, and only a computer having a program for encoding and decoding information is required to store information in a nucleotide sequence, including humans. Or wild-type genomes or synthetic genomes of other species, and storage capacity is unlimited, allowing for the storage of an unlimited amount of information.

As shown in FIG. 6, the apparatus 6 of this embodiment includes a memory 61 and a processor 62 coupled to the memory 61, the processor 62 being configured to perform any one of the implementations of the present invention based on instructions stored in the memory 61. The encoding method or decoding method in the example.

The memory 61 may include, for example, a system memory, a fixed non-volatile storage medium, or the like. The system memory stores, for example, an operating system, an application, a boot loader, a database, and other programs.

Those skilled in the art will appreciate that embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code. .

Heretofore, the encoding/decoding method, apparatus, and data processing apparatus according to the present invention have been described in detail. In order to avoid obscuring the concepts of the present invention, some details known in the art are not described. Those skilled in the art can fully understand how to implement the technical solutions disclosed herein according to the above description.

The methods and systems of the present invention may be implemented in a number of ways. For example, the methods and systems of the present invention can be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described sequence of steps for the method is for illustrative purposes only, and the steps of the method of the present invention are not limited to the order specifically described above unless otherwise specifically stated. Moreover, in some embodiments, the invention may also be embodied as a program recorded in a recording medium, the program comprising machine readable instructions for implementing the method according to the invention. Thus, the invention also covers a recording medium storing a program for performing the method according to the invention.

While the invention has been described in detail with reference to the specific embodiments of the present invention, it should be understood that It will be appreciated by those skilled in the art that the above embodiments may be modified without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

An encoding method comprising:

Digitizing the information to generate sequence data;

Dividing the sequence data into N data segments, where N is an integer greater than one;

Searching for a corresponding nucleic acid fragment in the gene database for each data segment, and using the position information of the nucleic acid fragment in the gene database as an identifier of each data segment;

A sequence code is generated according to the identifier corresponding to each data segment.
The encoding method according to claim 1, wherein the searching step comprises:

Performing further data partitioning on the data fragments in the gene database in which the corresponding nucleic acid fragments are not found, obtaining M data fragments, and searching for the nucleic acid fragments corresponding to each of the M data fragments in the gene database , M is an integer greater than one.
The encoding method according to claim 1, wherein said digitizing processing is to transcode a binary code corresponding to said information to generate said sequence data.
The encoding method according to claim 3, wherein the sequence data is data composed of four deoxyribonucleotides of adenine A, cytosine C, guanine G, and thymine T.
The encoding method according to claim 4, wherein 0 in the binary code is converted to A or T, and 1 is converted to C or G to generate the sequence data.
The encoding method according to claim 4, wherein 01 in the binary code is converted to A, 00 is converted to T, 11 is converted to C, and 10 is converted to G to generate the sequence data.
The encoding method according to claim 1, wherein said sequence data is a binary code corresponding to said information.
The encoding method according to claim 7, further comprising: before the searching step,

All nucleic acid fragments in the gene database are transcoded into binary code.
The encoding method according to claim 8, wherein A or T in said gene database is converted into binary code 0, and C or G is converted into binary code 1.
The encoding method according to claim 8, wherein A in said gene database is converted into binary code 01, T is converted into binary code 00, C is converted into binary code 11, and G is converted into binary code 10.
The encoding method according to any one of claims 1 to 10, wherein the identifier comprises the nucleic acid sheet The positional information of the first symbol and the last symbol of the segment in the gene database.
The encoding method according to any one of claims 1 to 10, wherein the identifier comprises positional information of the first symbol of the nucleic acid fragment in the gene database, and a length of the nucleic acid fragment.
The encoding method according to any one of claims 1 to 10, wherein the information is at least one of text information, picture information, audio information, or video information.
The encoding method according to any one of claims 1 to 10, wherein the gene database comprises one or more animal and/or plant and/or microbial genomic data.
The encoding method according to claim 14, wherein the gene database comprises wild type genomic data and/or synthetic genomic data.
The encoding method according to claim 15, wherein said gene database comprises human genome data.
A decoding method comprising:

Acquiring an identifier corresponding to each data segment from the encoded data, the encoded data being a sequence encoding generated by the encoding method according to any one of claims 1-16;

Obtaining location information corresponding to each data segment according to the identifier;

Obtaining a corresponding nucleic acid fragment from the gene database according to the position information;

Sequence data is generated based on the nucleic acid fragments.

Information is obtained based on the sequence data.
An encoding device comprising:

An information digitization module for digitizing information to generate sequence data;

a data identifier determining module, wherein the data identifier determining module is connected to the information digitizing module, configured to divide the sequence data into N data segments, where N is an integer greater than 1, and search for corresponding data in the gene database for each data segment. a nucleic acid fragment, and the positional information of the nucleic acid fragment in the gene database as an identifier of each data fragment;

The code generation module is connected to the data identifier determination module and configured to generate a sequence code according to the identifier corresponding to each data segment.
The encoding device according to claim 18, wherein

The data identifier determining module performs further data partitioning on the data segment in the gene database that does not find the corresponding nucleic acid segment, obtains M data segments, and searches in the gene database for the M data segments. For each corresponding nucleic acid fragment, M is an integer greater than one.
The encoding apparatus according to claim 18, wherein said information digitizing module transcodes a binary code corresponding to said information to generate said sequence data.
The encoding device according to claim 20, wherein said sequence data is data composed of four deoxyribonucleotides of adenine A, cytosine C, guanine G, and thymine T.
The encoding device according to claim 21, wherein

The information digitization module converts 0 of the binary code to A or T, and 1 converts to C or G to generate the sequence data.
The encoding device according to claim 21, wherein

The information digitizing module converts 01 in the binary code to A, 00 to T, 11 to C, and 10 to G to generate the sequence data.
The encoding apparatus according to claim 18, wherein said sequence data is a binary code corresponding to said information.
The encoding device of claim 18, further comprising:

And a gene data transcoding module, wherein the gene data transcoding module is respectively connected to the information digitizing module and the data identifier determining module, and is configured to transcode all the nucleic acid fragments in the gene database into a binary code.
The encoding device according to claim 25, wherein

The gene data transcoding module converts A or T in the gene database into binary code 0, and C or G is converted into binary code 1.
The encoding device according to claim 25, wherein

The gene data transcoding module converts A in the gene database into binary code 01, T is converted to binary code 00, C is converted to binary code 11, and G is converted to binary code 10.
The encoding device according to any one of claims 18 to 27, wherein the identifier comprises position information of the first symbol and the last symbol of the nucleic acid fragment in the gene database.
The encoding device according to any one of claims 18 to 27, wherein the identifier comprises positional information of the first symbol of the nucleic acid fragment in the gene database, and a length of the nucleic acid fragment.
The encoding apparatus according to any one of claims 18 to 27, wherein the information is at least one of text information, picture information, audio information, or video information.
The encoding device according to any one of claims 18 to 27, wherein the genetic database comprises one or more animal and/or plant and/or microbial genomic data.
The encoding device according to claim 31, wherein said gene database comprises a wild type genome Data and/or synthetic genomic data.
The encoding device according to claim 32, wherein said gene database comprises human genome data.
A decoding device comprising:

a data identifier obtaining module, configured to obtain, from the encoded data, an identifier corresponding to each data segment, the encoded data being the encoding method according to any one of claims 1 to 16 or according to any one of claims 18-33 a sequence code generated by the encoding device;

a sequence acquisition module, the sequence acquisition module is connected to the data identifier acquisition module, configured to acquire location information corresponding to each data segment according to the identifier, and obtain a corresponding nucleic acid fragment from the genetic database according to the location information;

An information generating module, the information generating module is connected to the sequence acquiring module, configured to generate sequence data according to the nucleic acid segment, and obtain information according to the sequence data.
A data processing device comprising:

Memory;

a processor coupled to the memory, the processor being configured to perform the encoding method of any one of claims 1-16 or the claim 17 according to an instruction stored in the memory device The decoding method described.
A computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the encoding method of any of claims 1-16 or the decoding method of claim 17.