CN114974434A - Information coding method and device based on DNA storage, computer equipment and medium - Google Patents

Information coding method and device based on DNA storage, computer equipment and medium Download PDF

Info

Publication number
CN114974434A
CN114974434A CN202210432503.0A CN202210432503A CN114974434A CN 114974434 A CN114974434 A CN 114974434A CN 202210432503 A CN202210432503 A CN 202210432503A CN 114974434 A CN114974434 A CN 114974434A
Authority
CN
China
Prior art keywords
dna
information
binary sequence
base
binary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210432503.0A
Other languages
Chinese (zh)
Inventor
黄奕翼
戴俊彪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202210432503.0A priority Critical patent/CN114974434A/en
Priority to PCT/CN2022/091100 priority patent/WO2023201782A1/en
Publication of CN114974434A publication Critical patent/CN114974434A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioethics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses an information coding method, a device, computer equipment and a medium based on DNA storage, wherein the method comprises the following steps: acquiring a file stored in a binary sequence form; based on a binary-DNA conversion mapping table, dividing a binary sequence by taking bytes as a unit to obtain binary sequence pieces, mapping each binary sequence piece to an alkali substrate, and sequentially combining all the alkali substrate pieces to form DNA information, wherein each alkali substrate piece meets the following composition conditions: the length is five, the sum of the number of G bases and C bases is t, t/5 is more than or equal to 0.4 and less than or equal to 0.6, no repeated base exists at the boundary, and no continuous three repeated bases exist in the middle; the file is stored in the form of DNA information. The method can strictly ensure that the length of continuous repeated bases in all coded DNA information is only 2 at most, the GC content is between 0.4 and 0.6, the algorithm has linear complexity, and the net information density is 1.60.

Description

Information coding method and device based on DNA storage, computer equipment and medium
Technical Field
The invention relates to the technical field of information storage, in particular to a method, a device, computer equipment and a medium for encoding information based on DNA storage.
Background
With the rapid development of information technologies and digital technologies such as the internet, artificial intelligence and the like, the information amount is exponentially and rapidly increased, and the traditional storage media such as a magnetic disk, a hard disk, a flash memory and the like can not meet the requirement of data storage worldwide. While there are natural advantages to using DNA as a storage medium: firstly, the information density is high, and according to the previous estimation of microsoft research institute, 1 cubic millimeter of DNA can store 1 EB (terabyte) data; the storage time is long, the stability is strong, and the storage can be carried out for thousands of years under proper conditions; thirdly, the energy consumption of the storage is very low.
DNA is a biological macromolecule with a defined sequence. The DNA storage technology is to encode information into a base sequence and store the base sequence in DNA. The DNA can then be copied, sequenced, and then decoded to read the information therein. During these several processes, DNA with long repeat bases or extreme GC content is prone to errors in synthesis, replication, and sequencing. How to rapidly and efficiently avoid the occurrence of long repeated bases or extreme GC content in a DNA sequence after information coding becomes an urgent problem to be solved.
Disclosure of Invention
The embodiment of the invention provides an information coding method, an information coding device, computer equipment and a medium based on DNA storage, and aims to solve the problem of how to avoid long repeated bases or extreme GC content of a DNA sequence after information coding.
A method for encoding information based on DNA storage, comprising:
acquiring a file stored in a binary sequence form;
based on a binary-DNA conversion mapping table, dividing a binary sequence by taking bytes as a unit to obtain binary sequence pieces, mapping each binary sequence piece to an alkali substrate, and sequentially combining all the alkali substrates to form DNA information, wherein each alkali substrate meets the following composition conditions: the length is five, the sum of the number of G bases and C bases is t, t/5 is more than or equal to 0.4 and less than or equal to 0.6, no repeated base exists at the boundary, and no continuous three repeated bases exist in the middle;
the file is stored in the form of DNA information.
Further, the method includes, after all the base fragments are combined in sequence to form DNA information:
acquiring a DNA information decoding request, and acquiring corresponding DNA information based on the DNA information decoding request;
and converting the DNA information into a binary sequence based on a binary-DNA conversion mapping table, and converting the DNA information into a file for storage.
Further, after the file is stored in the form of DNA information, the method further includes:
acquiring a timing task, and updating a mapping relation in a binary-DNA conversion mapping table when the system time meets the timing task;
or,
and establishing a public mapping relation based on the public document and a private mapping relation based on the private document for the mapping relation.
Further, before obtaining the file stored in the form of binary sequence, the method further includes:
determining a byte carrying eight bits as a segmentation unit of the binary sequence piece based on the principle that the number of the types of the alkali substrates is greater than or equal to the number of the types of the binary sequence piece and the conversion is convenient;
the base pieces corresponding to the 28 binary sequence pieces are determined according to the number of bits 8.
Further, determining the base fragments corresponding to the 28 binary sequence fragments respectively comprises:
taking any one of the four bases as a first base of each alkali substrate, and continuing synthesizing a second base to a last base according to the formation conditions of the alkali substrate;
taking any one of the remaining three bases as the second base of each base substrate, and repeatedly executing the step of continuing to synthesize the second base to the last base according to the formation conditions until the number of the types of the base substrates is equal to that of the binary sequence pieces.
Further, the file is converted into a DNA base sequence code and stored, and the method comprises the following steps:
DNA base sequence codes are synthesized into DNA solution or dry powder by a DNA synthesizer and stored.
An information encoding apparatus based on DNA storage, comprising:
the binary sequence file acquisition module is used for acquiring a file stored in a binary sequence form;
a DNA information forming module, which is used for dividing the binary sequence by taking bytes as units to obtain binary sequence pieces based on a binary-DNA conversion mapping table, mapping each binary sequence piece to an alkali substrate, and then combining all the alkali substrates in sequence to form DNA information, wherein each alkali substrate meets the following composition conditions: the length is five, the sum of the number of G bases and C bases is t, t/5 is more than or equal to 0.4 and less than or equal to 0.6, no repeated base exists at the boundary, and no continuous three repeated bases exist in the middle;
and the DNA information storage module is used for storing the file in the form of the DNA information.
A computer device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the above-mentioned method for encoding information based on DNA storage when executing said computer program.
A computer-readable medium, in which a computer program is stored which, when being executed by a processor, carries out the above-mentioned method for encoding information based on DNA storage.
According to the Information coding method, the device, the computer equipment and the medium based on the DNA storage, the file is converted into the DNA Information for storage through the composition condition with linear complexity, the longest continuous repeated base length in all the coded DNA Information is only 2, the GC content is between 0.4 and 0.6, the linear complexity of the DNA Information is guaranteed, the Net Information Density (NID) is 1.60, the DNA synthesis, copy and sequencing processes are effectively guaranteed, the DNA synthesis efficiency is improved, the Information Density of the Information storage is improved, the Information stable storage time is prolonged, and the storage energy consumption is reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a schematic diagram illustrating an application environment of a DNA storage-based information encoding method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for encoding information based on DNA storage according to an embodiment of the present invention;
FIG. 3 is a first flowchart of a method for encoding information based on DNA storage according to another embodiment of the present invention;
FIG. 4 is a schematic diagram showing the whole flow from encoding to decoding of an information encoding method based on DNA storage according to another embodiment of the present invention;
FIG. 5 is a second flowchart of a method for encoding information based on DNA storage according to another embodiment of the present invention;
FIG. 6 is a third flowchart of a method for encoding information based on DNA storage according to another embodiment of the present invention;
FIG. 7 is a fourth flowchart of a method for encoding information based on DNA storage according to another embodiment of the present invention;
FIG. 8 is a schematic diagram of an information encoding apparatus based on DNA storage according to an embodiment of the present invention;
FIG. 9 is a schematic diagram of a computer apparatus according to an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The information coding method based on the DNA storage provided by the embodiment of the invention can be applied to the application environment as shown in FIG. 1, and the information coding method based on the DNA storage is applied to an information coding system based on the DNA storage, and the information coding system based on the DNA storage comprises a client and a server, wherein the client communicates with the server through a network. The client is also called a client, and refers to a program corresponding to the server and providing local services for the client. Further, the client is a computer program, an APP program of the intelligent device or a third-party applet embedded with other APPs. The client can be installed on computer equipment such as but not limited to various personal computers, notebook computers, smart phones, tablet computers and portable wearable equipment. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.
Conventional storage technologies store information in binary sequences (i.e., sequences composed of 0 and 1) in media such as hard disks, optical disks, usb disks, CDs, etc., and files such as music, pictures, videos, etc. are binary sequences in the bottom layer of a computer. Due to their mechanical nature, these media fail after a certain number of reads or writes. In addition, they have low information density and large volume, and it has been difficult to satisfy storage requirements with the explosive growth of global information.
In an embodiment, as shown in fig. 2, a method for encoding information based on DNA storage is provided, which is described by taking the example that the method is applied to the server in fig. 1, and specifically includes the following steps:
s10, acquiring a file stored in a binary sequence form.
The binary sequence is a digital sequence code (only composed of 0 and 1) stored in the binary form of the file.
Specifically, the present embodiment may perform DNA encoding on documents in various forms that have been binary-saved, where the documents in various forms include voice, text, image, music, and the like, and are not limited in particular here.
S20, based on a binary-DNA conversion mapping table, dividing a binary sequence by taking bytes as units to obtain binary sequence pieces, mapping each binary sequence piece to an alkali substrate, and combining all the alkali substrate pieces in sequence to form DNA information, wherein each alkali substrate meets the following composition conditions: the length is five, the sum of the number of G bases and C bases is t, t is more than or equal to 0.4 and less than or equal to 0.6, no repeated base exists at the boundary, and no continuous three repeated bases exist in the middle.
The binary-to-DNA conversion mapping table is a table in which binary codes and DNA codes are mapped with each other, for example, a set of data in the binary-to-DNA conversion mapping table is: 00000000 < - > ATACG, which means that the binary data is 00000000, can be mapped by the embodiment and then become the DNA encoded form ATACG for storage.
The DNA base sequence code is a code obtained by synthesizing all the base fragments having n bases as a unit.
Specifically, the nitrogenous bases constituting the DNA are four kinds: adenine (a), guanine (G), thymine (T) and cytosine (C). In this embodiment, a binary sequence stored file is converted into DNA information synthesized from base fragments each having n bases by using a binary-DNA conversion mapping table, and the DNA information is stored.
This embodiment can use 8 bits, that is, one byte as the unit of the binary sequence. For example, the continuous binary source code corresponding to the file may be split according to bytes to obtain a plurality of binary sequence pieces, each sequence piece includes 8 bits, and the binary sequence is a sequence code synthesized by the plurality of 8-bit binary sequence pieces.
For example, the binary code of the file is divided into a group of one byte (i.e., a binary sequence piece having 8 binary bits, such binary sequence piece having 256 in total), and each group is mapped to 5 suitable base chips (such base chips having 256 in total) and synthesized, thereby obtaining the binary sequence.
It is understood that the decoding process is the reverse of the encoding process: the DNA sequence code is divided into groups of 5 bases, each group being mapped to a binary sequence piece of 8 binary bits.
And S30, storing the file in a form of DNA information.
Specifically, the file is stored in a form of DNA information, namely a form of biomacromolecule, so that the time for stably storing the information can be effectively prolonged, and the energy consumption for storing the information is reduced.
In the Information encoding method based on DNA storage provided in this embodiment, a file is converted into DNA Information for storage through a composition condition with linear complexity, so that the longest length of a continuous repeat base in all encoded DNA Information is only 2, and the GC content is between 0.4 and 0.6, so as to ensure that the DNA Information has linear complexity, and the Net Information Density (NID) is 1.60, thereby effectively ensuring DNA synthesis, replication and sequencing processes, improving DNA synthesis efficiency, improving Information Density of Information storage, prolonging Information stable storage time, and reducing storage energy consumption.
In one embodiment, as shown in FIG. 3, after step S20, i.e., after all the base fragments are combined in sequence to form DNA information, the method further comprises the following steps:
s201, obtaining a DNA information decoding request, and obtaining corresponding DNA information based on the DNA information decoding request.
S202, converting the DNA information into a binary sequence based on a binary-DNA conversion mapping table, and converting the DNA information into a file for storage.
The DNA decoding request is a request for decoding a file in which DNA information is stored into a binary sequence, that is, the decoding is a reverse process of encoding.
Specifically, this embodiment still uses a binary-DNA conversion mapping table to map DNA information into binary sequences after dividing the DNA information into base fragments. Preferably, the binary-to-DNA conversion mapping table includes at least one set of mapping relationships between the bases and the binary sequence pieces, that is, the specific correspondence relationship between the binary code and the arrangement of five bases between each binary sequence piece and each base.
Further, the whole flow from encoding to decoding exemplifies the implementation flow of the present embodiment:
a, an encoding process:
and (3) dividing:
the binary sequence of the input information is partitioned in groups of 8 binary bits. This results in an integer number of groups since 8 binary bits are a byte and the file is stored in units of bytes, the size of which must be an integer multiple of the byte. Taking the input information in fig. 4 as an example, "01100011100011010101101110001110101110110011011101011000" is divided into 7 groups in total.
Mapping:
and inquiring the 5-alkali substrate corresponding to each group of binary sequence pieces according to the mapping relation provided by the binary-DNA conversion mapping table. For example, "01100011" corresponds to "TCGTA", and so on. This gave 7 sets of 5-base chips.
Merging:
the resulting base fragments are combined to obtain the desired DNA information. The DNA information may then be synthesized into DNA as an information recording carrier.
B, decoding process:
and (3) dividing:
the DNA information to be decoded is divided into 5-base groups. This also leads to an integer group since the previous encoding is an integer group. Such as "TCGTA CACTG TCTCT CACGA CGTCT AGTGC TCTAC" into 7 groups.
Mapping:
and inquiring the 8-binary sequence piece corresponding to each group of alkali substrates according to the mapping relation provided by the binary-DNA conversion mapping table. For example, "TCGTA" corresponds to "01100011", and so on. This resulted in 7 sets of 8-binary sequence chips.
Merging:
and combining the obtained binary sequence slices to obtain the decoded information.
The embodiment is used for reading and decoding data from a carrier using DNA as a storage medium for subsequent processing, and is a fast, efficient and robust coding and decoding mode.
In a specific embodiment, as shown in fig. 5, after step S30, that is, after the file is saved in the form of DNA information, the method further includes the following steps:
s3011, a timing task is obtained, and when the system time meets the timing task, a mapping relation in a binary system-DNA conversion mapping table is updated;
or,
s3012, a public mapping relation based on the public document and a private mapping relation based on the private document are established for the mapping relation.
Specifically, in order to enhance the security and reliability of file storage, the method provided in this embodiment may periodically update the mapping relationship in the binary-to-DNA conversion mapping table, that is, change the arrangement order of five bases in each binary sequence piece and each base piece. Or, a set of public mapping relation is established for public documents to be used by the public, and meanwhile, a private mapping relation is established for documents with safety or privacy to be used.
In a specific embodiment, as shown in fig. 6, before step S10, that is, before obtaining the file stored in the form of the binary sequence, the method further includes the following steps:
s101, based on the principle that the number of the types of the alkali substrates is larger than or equal to the number of the types of the binary sequence pieces and the conversion is convenient, a byte carrying eight bits is determined to be used as a partition unit of the binary sequence pieces.
S102, determining base pieces corresponding to 28 binary sequence pieces according to the number of 8 bits.
Specifically, in order to make the complexity of the algorithm linear, the binary sequence of the document is cut into binary sequence pieces with certain lengths, and each binary sequence piece is mapped to an appropriate DNA alkali substrate. The principle that the number of the types of the alkali substrates is greater than or equal to that of the binary sequence pieces is that 4n is greater than or equal to 28, and 2n is greater than or equal to 8. Where 28 is the total number of binary sequence pieces. In order to save the base fragment resources, the minimum value of n can be taken as the number of bases of the base fragment in this example.
For example, since files on a computer are stored in units of bytes, each byte is 8 bits, the length of a binary sequence piece can be fixed to 8, and thus the total number of binary sequence pieces is 28= 256. Next, 256 suitable base fragments were determined.
When i.e. one byte number is a unit of the binary sequence slice, n is at least 4. However, a large fraction of base substrates of length 4 have long consecutive repeat bases (e.g., 3 consecutive repeat bases for AAAT) or extreme GC content (e.g., 75% GC content for GCCT). The continuous repeat bases of the base sequence obtained by combining such base fragments are longer and the GC content cannot be controlled. So x cannot take 4 and should at least take 5.
The method provided by the embodiment converts the file into the binary sequence synthesized by the plurality of binary sequence pieces, and is beneficial to realizing rapid conversion of the storage format through the mapping relation between the binary sequence pieces and the base pieces. Preferably, the mapping relation between the binary sequence piece and the base piece is recorded with a binary-DNA conversion mapping table, and the table comprises at least one group of mapping relations between the base piece and the binary sequence piece. The embodiment can adapt the number of the corresponding base chips based on the total number of the binary sequence chips, thereby saving the storage resource of the base chips.
In an embodiment, as shown in fig. 7, in step S102, determining the base fragments corresponding to the 28 binary sequence fragments respectively includes the following steps:
s1021, any one of the four bases is used as a first base of each base substrate, and synthesis of a second base to a last base is continued according to the formation conditions of the base substrates.
S1022. with any one of the remaining three bases as the second base of each base substrate, the step of continuing to synthesize the second base up to the last base according to the formation conditions is repeatedly performed until the number of the types of the base substrates is equal to the number of the types of the binary sequence pieces.
Specifically, the explanation is continued with n =5 as an example. 256 suitable base chips that can be used as storage media were selected from all base chips with a length of 5.
Let abcde be a base substrate with a length of 5, where a, b, C, d, e all take on the set { A, T, C, G }. In view of the length of the consecutive repeated bases, they should satisfy the condition:
condition 1: a is not equal to b, d is not equal to e;
condition 2: b. c and d are not all the same.
Condition 1 is that there are no repeating bases at the base substrate boundaries to avoid the occurrence of 3 or more than 3 consecutive repeating bases after combination (e.g., AATCG and TCGGA would have 3 consecutive repeating bases AAA in combination). The condition 2 is that 3 bases are not present in a continuous repeat in the interior of the base fragment, which is also to avoid the continuous repeat bases being too long. In addition, let t be the sum of the numbers of G and C in the alkali base. In order to control the GC content in the encoded base sequence, t should satisfy the following condition:
condition 3: t/5 is more than or equal to 0.4 and less than or equal to 0.6
That is, the GC content of each base substrate is between 0.4 and 0.6, and thus the GC content of the encoded base sequence is also in this interval. Resolution of Condition 3 to
2≤t≤3,
Meaning that the sum of the number of G and C in the base substrate can only be 2 or 3.
According to the above conditions, an appropriate alkali substrate can be selected.
Starting from a, any of { A, T, C, G } can be taken, with four possibilities. B has only 3 possibilities, since b cannot be identical to a. That is, if a is a, b cannot be a, and only T, C, G can be obtained. There are four values for c, since it is in the middle of the base substrate, and only two bases in front. d is carefully chosen so that all three of b, C and d are not made the same and the total number of G and C is not less than 1 or more than 3 (in this case 1, 2 or 3, since there is a base e in the end). For example, if the previous is GTT, then d can not take T any more, otherwise, in violation of condition 2, d can take A, C or G. The values of e are also detailed in the previous sections. e cannot be the same as d and the total number of G and C is guaranteed to be 2 or 3. for example, if GTTA is adopted, e cannot take A, otherwise, the condition 1 is violated; further, e can only take G or C to satisfy condition 3. Thus, except for the appropriate alkali substrate, a mapping relationship in a binary-to-DNA conversion map shown below is selected, wherein the symbol "< - >" indicates a mapping, the symbol is a binary sequence piece of length 8 (referred to as 8-binary sequence piece) on the left side and a base piece of length 5 (referred to as 5-alkali substrate) on the right side:
00000000 <--> ATACG 00000001 <--> ATAGC 00000010 <--> ATCAC
00000011 <--> ATCAG 00000100 <--> ATCTC 00000101 <--> ATCTG
00000110 <--> ATCGA 00000111 <--> ATCGT 00001000 <--> ATCGC
00001001 <--> ATGAC 00001010 <--> ATGAG 00001011 <--> ATGTC
00001100 <--> ATGTG 00001101 <--> ATGCA 00001110 <--> ATGCT
00001111 <--> ATGCG 00010000 <--> ACATC 00010001 <--> ACATG
00010010 <--> ACACA 00010011 <--> ACACT 00010100 <--> ACACG
00010101 <--> ACAGA 00010110 <--> ACAGT 00010111 <--> ACAGC
00011000 <--> ACTAC 00011001 <--> ACTAG 00011010 <--> ACTCA
00011011 <--> ACTCT 00011100 <--> ACTCG 00011101 <--> ACTGA
00011110 <--> ACTGT 00011111 <--> ACTGC 00100000 <--> ACGAT
00100001 <--> ACGAC 00100010 <--> ACGAG 00100011 <--> ACGTA
00100100 <--> ACGTC 00100101 <--> ACGTG 00100110 <--> ACGCA
00100111 <--> ACGCT 00101000 <--> AGATC 00101001 <--> AGATG
00101010 <--> AGACA 00101011 <--> AGACT 00101100 <--> AGACG
00101101 <--> AGAGA 00101110 <--> AGAGT 00101111 <--> AGAGC
00110000 <--> AGTAC 00110001 <--> AGTAG 00110010 <--> AGTCA
00110011 <--> AGTCT 00110100 <--> AGTCG 00110101 <--> AGTGA
00110110 <--> AGTGT 00110111 <--> AGTGC 00111000 <--> AGCAT
00111001 <--> AGCAC 00111010 <--> AGCAG 00111011 <--> AGCTA
00111100 <--> AGCTC 00111101 <--> AGCTG 00111110 <--> AGCGA
00111111 <--> AGCGT 01000000 <--> TATCG 01000001 <--> TATGC
01000010 <--> TACAC 01000011 <--> TACAG 01000100 <--> TACTC
01000101 <--> TACTG 01000110 <--> TACGA 01000111 <--> TACGT
01001000 <--> TACGC 01001001 <--> TAGAC 01001010 <--> TAGAG
01001011 <--> TAGTC 01001100 <--> TAGTG 01001101 <--> TAGCA
01001110 <--> TAGCT 01001111 <--> TAGCG 01010000 <--> TCATC
01010001 <--> TCATG 01010010 <--> TCACA 01010011 <--> TCACT
01010100 <--> TCACG 01010101 <--> TCAGA 01010110 <--> TCAGT
01010111 <--> TCAGC 01011000 <--> TCTAC 01011001 <--> TCTAG
01011010 <--> TCTCA 01011011 <--> TCTCT 01011100 <--> TCTCG
01011101 <--> TCTGA 01011110 <--> TCTGT 01011111 <--> TCTGC
01100000 <--> TCGAT 01100001 <--> TCGAC 01100010 <--> TCGAG
01100011 <--> TCGTA 01100100 <--> TCGTC 01100101 <--> TCGTG
01100110 <--> TCGCA 01100111 <--> TCGCT 01101000 <--> TGATC
01101001 <--> TGATG 01101010 <--> TGACA 01101011 <--> TGACT
01101100 <--> TGACG 01101101 <--> TGAGA 01101110 <--> TGAGT
01101111 <--> TGAGC 01110000 <--> TGTAC 01110001 <--> TGTAG
01110010 <--> TGTCA 01110011 <--> TGTCT 01110100 <--> TGTCG
01110101 <--> TGTGA 01110110 <--> TGTGT 01110111 <--> TGTGC
01111000 <--> TGCAT 01111001 <--> TGCAC 01111010 <--> TGCAG
01111011 <--> TGCTA 01111100 <--> TGCTC 01111101 <--> TGCTG
01111110 <--> TGCGA 01111111 <--> TGCGT 10000000 <--> CATAC
10000001 <--> CATAG 10000010 <--> CATCA 10000011 <--> CATCT
10000100 <--> CATCG 10000101 <--> CATGA 10000110 <--> CATGT
10000111 <--> CATGC 10001000 <--> CACAT 10001001 <--> CACAC
10001010 <--> CACAG 10001011 <--> CACTA 10001100 <--> CACTC
10001101 <--> CACTG 10001110 <--> CACGA 10001111 <--> CACGT
10010000 <--> CAGAT 10010001 <--> CAGAC 10010010 <--> CAGAG
10010011 <--> CAGTA 10010100 <--> CAGTC 10010101 <--> CAGTG
10010110 <--> CAGCA 10010111 <--> CAGCT 10011000 <--> CTATC
10011001 <--> CTATG 10011010 <--> CTACA 10011011 <--> CTACT
10011100 <--> CTACG 10011101 <--> CTAGA 10011110 <--> CTAGT
10011111 <--> CTAGC 10100000 <--> CTCAT 10100001 <--> CTCAC
10100010 <--> CTCAG 10100011 <--> CTCTA 10100100 <--> CTCTC
10100101 <--> CTCTG 10100110 <--> CTCGA 10100111 <--> CTCGT
10101000 <--> CTGAT 10101001 <--> CTGAC 10101010 <--> CTGAG
10101011 <--> CTGTA 10101100 <--> CTGTC 10101101 <--> CTGTG
10101110 <--> CTGCA 10101111 <--> CTGCT 10110000 <--> CGATA
10110001 <--> CGATC 10110010 <--> CGATG 10110011 <--> CGACA
10110100 <--> CGACT 10110101 <--> CGAGA 10110110 <--> CGAGT
10110111 <--> CGTAT 10111000 <--> CGTAC 10111001 <--> CGTAG
10111010 <--> CGTCA 10111011 <--> CGTCT 10111100 <--> CGTGA
10111101 <--> CGTGT 10111110 <--> CGCAT 10111111 <--> CGCTA
11000000 <--> GATAC 11000001 <--> GATAG 11000010 <--> GATCA
11000011 <--> GATCT 11000100 <--> GATCG 11000101 <--> GATGA
11000110 <--> GATGT 11000111 <--> GATGC 11001000 <--> GACAT
11001001 <--> GACAC 11001010 <--> GACAG 11001011 <--> GACTA
11001100 <--> GACTC 11001101 <--> GACTG 11001110 <--> GACGA
11001111 <--> GACGT 11010000 <--> GAGAT 11010001 <--> GAGAC
11010010 <--> GAGAG 11010011 <--> GAGTA 11010100 <--> GAGTC
11010101 <--> GAGTG 11010110 <--> GAGCA 11010111 <--> GAGCT
11011000 <--> GTATC 11011001 <--> GTATG 11011010 <--> GTACA
11011011 <--> GTACT 11011100 <--> GTACG 11011101 <--> GTAGA
11011110 <--> GTAGT 11011111 <--> GTAGC 11100000 <--> GTCAT
11100001 <--> GTCAC 11100010 <--> GTCAG 11100011 <--> GTCTA
11100100 <--> GTCTC 11100101 <--> GTCTG 11100110 <--> GTCGA
11100111 <--> GTCGT 11101000 <--> GTGAT 11101001 <--> GTGAC
11101010 <--> GTGAG 11101011 <--> GTGTA 11101100 <--> GTGTC
11101101 <--> GTGTG 11101110 <--> GTGCA 11101111 <--> GTGCT
11110000 <--> GCATA 11110001 <--> GCATC 11110010 <--> GCATG
11110011 <--> GCACA 11110100 <--> GCACT 11110101 <--> GCAGA
11110110 <--> GCAGT 11110111 <--> GCTAT 11111000 <--> GCTAC
11111001 <--> GCTAG 11111010 <--> GCTCA 11111011 <--> GCTCT
11111100 <--> GCTGA 11111101 <--> GCTGT 11111110 <--> GCGAT
11111111 <--> GCTTA
in an embodiment, in step S20, converting the file into a DNA base sequence code for storage includes the following steps:
and S21, synthesizing the DNA base sequence code into DNA solution or dry powder by a DNA synthesizer and storing.
Specifically, various techniques are available for artificially synthesizing a specific DNA sequence by splicing oligonucleotides, wherein chemical methods are already mature and enzymatic synthesis methods are being developed. The chemical method comprises four steps of deprotection, coupling, capping (optional) and oxidation, and is characterized by early appearance and need of using toxic reagents. The enzymatic method is relatively mild, less damages DNA, higher accuracy and less byproducts.
DNA is closely related to organisms and is not eliminated by the times as other storage media. The storage density of DNA is very high, and the storage density of the most compact hard disk in the world is only one thousandth of that of the current. Using DNA to store data, 10 complete high definition movies can be stored in a volume of one grain of salt. DNA is the core of biological research, and as time goes on and the technology matures, it becomes more and more convenient to access data on DNA.
The complexity of the methods provided herein was analyzed. 8 is the length of the binary sequence of the input file, the number of steps required to implement the mapping of encoding or decoding is a linear function with respect to 8, and thus both encoding and decoding are linear in complexity. The method of the present invention maps every 8 binary bits to 5 quaternary patches with a net information density of 8/5= 1.60.
Furthermore, the Information encoding method based on DNA storage proposed in this embodiment can store various Information originally stored as binary into DNA Information through fast, high encoding efficiency and robust composition conditions, Net Information Density (NID) can reach 1.60, and the length of consecutive repeated bases (homopolymer) in all encoded DNA sequences is only 2 bases at the maximum, and GC content is strictly controlled between 40% and 60%.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In one embodiment, there is provided a DNA storage based information encoding apparatus in one-to-one correspondence with the DNA storage based information encoding method in the above-described embodiments. As shown in fig. 8, the information encoding apparatus based on DNA storage includes a get binary sequence file module 10, a form DNA information module 20, and a save DNA information module 30. The detailed description of each functional module is as follows:
a binary sequence file obtaining module 10, configured to obtain a file stored in a binary sequence form;
a DNA information forming module 20, configured to divide the binary sequence into binary sequence pieces in units of bytes based on a binary-to-DNA conversion mapping table, map each of the binary sequence pieces to an alkali substrate, and sequentially merge all the alkali substrates to form DNA information, where each of the alkali substrates satisfies the following conditions: the length is five, the sum of the number of G bases and C bases is t, t/5 is more than or equal to 0.4 and less than or equal to 0.6, no repeated base exists at the boundary, and no continuous three repeated bases exist in the middle;
and a DNA information saving module 30 for saving the file in the form of the DNA information.
For specific limitations of the information encoding apparatus based on DNA storage, reference may be made to the above limitations of the information encoding method based on DNA storage, which are not described herein again. The respective modules in the above-described information encoding apparatus based on DNA storage may be wholly or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile medium, an internal memory. The non-volatile medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile media. The database of the computer device is used for encoding method related data based on the stored information of the DNA. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for encoding information based on DNA storage.
In an embodiment, a computer device is provided, which includes a memory, a processor and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the method for encoding information based on DNA storage according to the above embodiments, such as S10 to S20 shown in fig. 2. Alternatively, the processor, when executing the computer program, realizes the functions of each module/unit of the information encoding apparatus based on DNA storage in the above-described embodiments, for example, the functions of the modules 10 to 20 shown in fig. 8. To avoid repetition, further description is omitted here.
In one embodiment, a computer readable medium is provided, on which a computer program is stored, and the computer program is executed by a processor to implement the information encoding method based on DNA storage of the above embodiments, such as S10 to S20 shown in fig. 2. Alternatively, the computer program may be executed by a processor to implement the functions of each module/unit in the information encoding apparatus based on DNA storage in the above-described apparatus embodiment, for example, the functions of the modules 10 to 20 shown in fig. 8. To avoid repetition, further description is omitted here.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer readable medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. Any reference to memory, storage, database, or other medium used in the embodiments of the present application may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A method for encoding information based on DNA storage, comprising:
acquiring a file stored in a binary sequence form;
based on a binary-DNA conversion mapping table, dividing the binary sequence by taking bytes as units to obtain binary sequence pieces, mapping each binary sequence piece to an alkali substrate, and sequentially combining all the alkali substrates to form DNA information, wherein each alkali substrate meets the following composition conditions: the length is five, the sum of the number of G bases and C bases is t, t/5 is more than or equal to 0.4 and less than or equal to 0.6, no repeated base exists at the boundary, and no continuous three repeated bases exist in the middle;
and storing the file in the form of the DNA information.
2. The method for encoding information based on DNA storage according to claim 1, further comprising, after the sequentially combining all the alkali substrates to form DNA information:
acquiring a DNA information decoding request, and acquiring corresponding DNA information based on the DNA information decoding request;
and converting the DNA information into a binary sequence based on the binary-DNA conversion mapping table, and converting the DNA information into a file for storage.
3. The method of claim 1, wherein the binary-to-DNA conversion mapping table comprises at least one mapping relationship between the alkali chips and the binary sequence chips.
4. The method for encoding information based on DNA storage according to claim 1, further comprising, after the storing the file in the form of the DNA information:
acquiring a timing task, and updating the mapping relation in the binary-DNA conversion mapping table when the system time meets the timing task;
or,
and establishing a public mapping relation based on the public document and a private mapping relation based on the private document for the mapping relation.
5. The method for encoding information based on DNA storage according to claim 1, further comprising, before the obtaining the file stored in the form of binary sequence:
determining a byte carrying eight bits as a segmentation unit of the binary sequence piece based on the principle that the number of the types of the alkali substrates is greater than or equal to the number of the types of the binary sequence piece and the conversion is convenient;
and determining the base pieces corresponding to the 28 binary sequence pieces according to the number of the bits being 8.
6. The method of claim 5, wherein the determining the base sequence pieces corresponding to the 28 binary sequence pieces comprises:
taking any one of the four bases as a first base of each alkali substrate, and continuing synthesizing a second base to a last base according to the composition conditions of the alkali substrate;
repeating the step of continuing the synthesis of the second base to the last base under the formation conditions with any one of the remaining three bases as the second base of each base substrate until the number of the base substrate species is equal to the number of the binary sequence piece species.
7. The method for encoding information based on DNA storage according to claim 1, wherein the storing the file in the form of the DNA information comprises:
synthesizing the DNA base sequence code into DNA solution or dry powder by a DNA synthesizer and storing.
8. An information encoding apparatus based on DNA storage, comprising:
the binary sequence file acquisition module is used for acquiring a file stored in a binary sequence form;
a DNA information forming module, which is used for dividing the binary sequence by taking bytes as units to obtain binary sequence pieces based on a binary-DNA conversion mapping table, mapping each binary sequence piece to an alkali substrate, and then combining all the alkali substrates in sequence to form DNA information, wherein each alkali substrate meets the following composition conditions: the length is five, the sum of the number of G bases and C bases is t, t/5 is more than or equal to 0.4 and less than or equal to 0.6, no repeated base exists at the boundary, and no continuous three repeated bases exist in the middle;
and the DNA information storage module is used for storing the file in the form of the DNA information.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the method for encoding information based on DNA storage according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable medium, in which a computer program is stored, which, when being executed by a processor, implements the method for encoding information based on DNA storage according to any one of claims 1 to 7.
CN202210432503.0A 2022-04-23 2022-04-23 Information coding method and device based on DNA storage, computer equipment and medium Pending CN114974434A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210432503.0A CN114974434A (en) 2022-04-23 2022-04-23 Information coding method and device based on DNA storage, computer equipment and medium
PCT/CN2022/091100 WO2023201782A1 (en) 2022-04-23 2022-05-06 Information coding method and apparatus based on dna storage, and computer device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210432503.0A CN114974434A (en) 2022-04-23 2022-04-23 Information coding method and device based on DNA storage, computer equipment and medium

Publications (1)

Publication Number Publication Date
CN114974434A true CN114974434A (en) 2022-08-30

Family

ID=82979489

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210432503.0A Pending CN114974434A (en) 2022-04-23 2022-04-23 Information coding method and device based on DNA storage, computer equipment and medium

Country Status (2)

Country Link
CN (1) CN114974434A (en)
WO (1) WO2023201782A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130280A1 (en) * 2017-09-01 2019-05-02 Seagate Technology Llc Timing recovery for dna storage
CN111443869A (en) * 2020-03-24 2020-07-24 中国科学院长春应用化学研究所 File storage method, device, equipment and computer readable storage medium
CN112382340A (en) * 2020-11-25 2021-02-19 中国科学院深圳先进技术研究院 Coding and decoding method and coding and decoding device for binary information to base sequence for DNA data storage
CN113539370A (en) * 2021-06-29 2021-10-22 中国科学院深圳先进技术研究院 Encoding method, decoding method, device, terminal device and readable storage medium
CN113744804A (en) * 2021-06-21 2021-12-03 深圳先进技术研究院 Method and device for storing data by using DNA and storage equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112711935B (en) * 2020-12-11 2023-04-18 中国科学院深圳先进技术研究院 Encoding method, decoding method, apparatus, and computer-readable storage medium
CN114254748A (en) * 2021-09-27 2022-03-29 清华大学 Extended coding method, system and related device for storage channel

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130280A1 (en) * 2017-09-01 2019-05-02 Seagate Technology Llc Timing recovery for dna storage
CN111443869A (en) * 2020-03-24 2020-07-24 中国科学院长春应用化学研究所 File storage method, device, equipment and computer readable storage medium
CN112382340A (en) * 2020-11-25 2021-02-19 中国科学院深圳先进技术研究院 Coding and decoding method and coding and decoding device for binary information to base sequence for DNA data storage
CN113744804A (en) * 2021-06-21 2021-12-03 深圳先进技术研究院 Method and device for storing data by using DNA and storage equipment
CN113539370A (en) * 2021-06-29 2021-10-22 中国科学院深圳先进技术研究院 Encoding method, decoding method, device, terminal device and readable storage medium

Also Published As

Publication number Publication date
WO2023201782A1 (en) 2023-10-26

Similar Documents

Publication Publication Date Title
CN110945595B (en) DNA-based data storage and retrieval
Fritz et al. Efficient storage of high throughput DNA sequencing data using reference-based compression
US20120330559A1 (en) Systems and methods for hybrid assembly of nucleic acid sequences
Myers Jr A history of DNA sequence assembly
WO2005107412A3 (en) Systems and methods for reconstruction gene networks in segregating populations
Xu et al. A DNA computing model for the graph vertex coloring problem based on a probe graph
Demongeot et al. The uroboros theory of life’s origin: 22-nucleotide theoretical minimal RNA rings reflect evolution of genetic code and tRNA-rRNA translation machineries
Hossain et al. Crystallizing short-read assemblies around seeds
WO2015180203A1 (en) High-throughput dna sequencing quality score lossless compression system and compression method
CN103093121A (en) Compressed storage and construction method of two-way multi-step deBruijn graph
CN103049680A (en) gene sequencing data reading method and system
Xie et al. Applications and potentials of nanopore sequencing in the (epi) genome and (epi) transcriptome era
Okuda et al. Virtual metagenome reconstruction from 16S rRNA gene sequences
Foox et al. Multiplexed pyrosequencing of nine sea anemone (Cnidaria: Anthozoa: Hexacorallia: Actiniaria) mitochondrial genomes
CN115312129A (en) Gene data compression method and device in high-throughput sequencing background and related equipment
CN114974434A (en) Information coding method and device based on DNA storage, computer equipment and medium
CN1960333A (en) QoS routing method of Ad Hoc network based on DNA computation
Vats Bio-informatics analysis of meta-transcriptomics sequencing
WO2004070029A1 (en) Method to encode a dna sequence and to compress a dna sequence
US20220415441A1 (en) Method for the Compression of Genome Sequence Data
Zhao et al. Composite Hedges Nanopores: A High INDEL-Correcting Codec System for Rapid and Portable DNA Data Readout
JP7089804B2 (en) A storage medium that stores a data creation device, a data creation method, and a data creation program.
Bhattacharyya et al. Recent directions in compressing next generation sequencing data
Chatterjee A Code Script for Life
Kaidapuram Data Preprocessing for Haplotype Calling from Viral NGS Data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination