WO2023201782A1 - Information coding method and apparatus based on dna storage, and computer device and medium - Google Patents

Information coding method and apparatus based on dna storage, and computer device and medium Download PDF

Info

Publication number
WO2023201782A1
WO2023201782A1 PCT/CN2022/091100 CN2022091100W WO2023201782A1 WO 2023201782 A1 WO2023201782 A1 WO 2023201782A1 CN 2022091100 W CN2022091100 W CN 2022091100W WO 2023201782 A1 WO2023201782 A1 WO 2023201782A1
Authority
WO
WIPO (PCT)
Prior art keywords
dna
base
information
binary sequence
bases
Prior art date
Application number
PCT/CN2022/091100
Other languages
French (fr)
Chinese (zh)
Inventor
黄奕翼
戴俊彪
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2023201782A1 publication Critical patent/WO2023201782A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present invention relates to the field of information storage technology, and in particular to an information encoding method, device, computer equipment and medium based on DNA storage.
  • DNA is a biological macromolecule with a defined sequence.
  • DNA storage technology encodes information into base sequences and stores them in DNA. The DNA can then be copied, sequenced, and decoded to read the information inside.
  • DNA with long repetitive bases or extreme GC content is prone to errors during synthesis, replication, and sequencing. How to quickly and efficiently avoid long repetitive bases or extreme GC content in DNA sequences after information encoding has become an urgent problem to be solved.
  • Embodiments of the present invention provide an information encoding method, device, computer equipment and medium based on DNA storage to solve the problem of how to avoid long repeated bases or extreme GC content in DNA sequences after information encoding.
  • An information encoding method based on DNA storage including:
  • the binary sequence is divided into byte units to obtain binary sequence slices, each binary sequence slice is mapped to a base slice, and then all the base slices are combined in sequence to form DNA information.
  • a base piece satisfies the following conditions: the length is five, the sum of the number of G bases and C bases is t, satisfies 0.4 ⁇ t/5 ⁇ 0.6, there are no repeated bases at the boundary, and there are no three consecutive repeated bases in the middle base;
  • the DNA information is converted into a binary sequence, which is used to convert the DNA information into files for storage.
  • the file is saved in the form of DNA information, it also includes:
  • the DNA base sequence code is synthesized into DNA solution or dry powder through a DNA synthesis device and stored.
  • An information encoding device based on DNA storage including:
  • Form a DNA information module which is used to divide the binary sequence in units of bytes to obtain binary sequence slices based on the binary-DNA conversion mapping table, map each binary sequence slice to a base slice, and then all The base pieces are combined sequentially to form DNA information.
  • Each base piece meets the following conditions: the length is five, the sum of the number of G bases and C bases is t, and 0.4 ⁇ t/5 ⁇ 0.6 is satisfied. There are no repeated bases at the boundaries, and there are no three consecutive repeated bases in the middle;
  • a module for saving DNA information is used to save the file in the form of the DNA information.
  • a computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the computer program, it implements the above information encoding method based on DNA storage.
  • a computer-readable medium stores a computer program.
  • the computer program is executed by a processor, the above-mentioned information encoding method based on DNA storage is implemented.
  • the above-mentioned information encoding methods, devices, computer equipment and media based on DNA storage can strictly guarantee the length of continuously repeated bases in all encoded DNA information by converting files into DNA information through composition conditions with linear complexity for storage. The longest is only 2, and the GC content is between 0.4 and 0.6, ensuring that DNA information has linear complexity, and the net information density (Net Information Density, NID) is 1.60 to effectively ensure the synthesis, replication and sequencing processes of DNA and improve The synthesis efficiency of DNA can improve the information density of information storage and extend the stable storage time of information, while reducing storage energy consumption.
  • NID Net Information Density
  • Figure 1 is a schematic diagram of the application environment of the information encoding method based on DNA storage in one embodiment of the present invention
  • Figure 2 illustrates a flow chart of an information encoding method based on DNA storage in one embodiment of the present invention
  • Figure 3 illustrates a first flow chart of an information encoding method based on DNA storage in another embodiment of the present invention
  • Figure 4 is a schematic diagram illustrating the entire process from encoding to decoding of an information encoding method based on DNA storage in another embodiment of the present invention
  • Figure 5 illustrates a second flow chart of an information encoding method based on DNA storage in another embodiment of the present invention
  • Figure 6 illustrates a third flow chart of an information encoding method based on DNA storage in another embodiment of the present invention
  • Figure 7 illustrates a fourth flow chart of an information encoding method based on DNA storage in another embodiment of the present invention.
  • Figure 8 is a schematic diagram of an information encoding device based on DNA storage in one embodiment of the present invention.
  • FIG. 9 is a schematic diagram of a computer device according to an embodiment of the present invention.
  • the information encoding method based on DNA storage can be applied in the application environment as shown in Figure 1.
  • the information encoding method based on DNA storage is applied in the information encoding system based on DNA storage.
  • the information based on DNA storage includes a client and a server, where the client communicates with the server through the network.
  • the client also known as the user end, refers to the program that corresponds to the server and provides local services to the client.
  • the client is a computer program, an APP program of a smart device, or a third-party applet embedded in other APPs.
  • the client can be installed on, but is not limited to, various computer devices such as personal computers, laptops, smartphones, tablets, and portable wearable devices.
  • the server can be implemented as an independent server or a server cluster composed of multiple servers.
  • an information encoding method based on DNA storage is provided.
  • the application of this method to the server in Figure 1 is used as an example to illustrate, specifically including the following steps:
  • the binary sequence is the digital sequence encoding of the file saved in binary form (composed only of 0 and 1).
  • this embodiment can perform DNA encoding on various forms of documents that have been saved in binary.
  • the various forms of documents include voice, text, images, music, etc., and are not specifically limited here.
  • each base piece meets the following composition conditions: the length is five, the sum of the number of G bases and C bases is t, satisfies 0.4 ⁇ t/5 ⁇ 0.6, there are no repeated bases on the boundary, and there are no three consecutive bases in the middle Repeat bases.
  • the binary-DNA conversion mapping table is a table that maps binary codes and DNA codes to each other.
  • a set of data in the binary-DNA conversion mapping table is: 00000000 ⁇ -->ATACG, which means that the binary code is 00000000.
  • the data can be saved in the DNA-encoded form ATACG after mapping in this embodiment.
  • the DNA base sequence code is the code synthesized from all n base units.
  • a binary-DNA conversion mapping table is used to convert a file saved as a binary sequence into DNA information synthesized into base slices in units of n bases for storage.
  • 8 bits that is, one byte, can be used as the unit of the binary sequence.
  • the continuous binary source code corresponding to the file can be split by bytes to obtain multiple binary sequence slices. Each sequence slice contains 8 bits.
  • the binary sequence is a sequence synthesized from multiple 8-bit binary sequence slices. code.
  • the binary code of the file is divided into one byte (that is, a binary sequence slice with 8 binary bits, and there are 256 such binary sequence slices in total). Each group is mapped to 5 appropriate bases.
  • the binary sequence can be obtained by synthesizing the base slices (there are 256 such base slices in total).
  • the decoding process is the reverse process of the encoding process: the DNA sequence code is divided into groups of 5 bases, and each group is mapped to a binary sequence piece with 8 binary bits.
  • saving files in the form of DNA information that is, in the form of biological macromolecules, can effectively extend the stable storage time of information and reduce storage energy consumption.
  • the information encoding method based on DNA storage provided in this embodiment can strictly ensure that the longest length of consecutive repeating bases in all encoded DNA information is only 2.
  • the GC content is between 0.4 and 0.6, ensuring that DNA information has linear complexity, and the Net Information Density (NID) is 1.60 to effectively ensure the synthesis, replication and sequencing processes of DNA and improve DNA synthesis. Efficiency, improve the information density of information storage and extend the stable storage time of information, while reducing storage energy consumption.
  • step S20 that is, after all the base pieces are combined in sequence to form DNA information, the following steps are specifically included:
  • the DNA decoding request is a request to decode the file with DNA information stored into a binary sequence, that is, decoding is the reverse process of encoding.
  • this embodiment still uses a binary-DNA conversion mapping table to divide the DNA information into base slices and then map it into a binary sequence.
  • the binary-DNA conversion mapping table includes at least one set of mapping relationships between base slices and binary sequence slices. This mapping relationship is also the binary code and five bases between each binary sequence slice and each base slice. The specific correspondence between the basis arrangements.
  • mapping relationship provided by the binary-DNA conversion mapping table, query the 5-base slice corresponding to each set of binary sequence slices. For example, "01100011” corresponds to "TCGTA", and so on. In this way, seven sets of 5-base chips were obtained.
  • the required DNA information is obtained. This DNA information can then be synthesized into DNA as an information recording carrier.
  • the DNA information to be decoded is divided into groups of 5 bases. This will definitely result in an integer number of groups, because the previous encoding was an integer number of groups. For example, "TCGTA CACTG TCTCT CACGA CGTCT AGTGC TCTAC" is divided into 7 groups.
  • mapping relationship provided by the binary-DNA conversion mapping table, query the 8-binary sequence slice corresponding to each group of base slices. For example, “TCGTA” corresponds to "01100011", and so on. In this way, 7 groups of 8-binary sequence slices are obtained.
  • the resulting binary sequence slices are combined to obtain the decoded information.
  • This embodiment is used to read and decode data from DNA as a storage medium carrier for subsequent processing. It is a fast, efficient and robust encoding and decoding method.
  • step S30 that is, after saving the file in the form of DNA information, the following steps are specifically included:
  • S3012 establishes a public mapping relationship based on public documents and a private mapping relationship based on private documents for mapping relationships.
  • the method provided in this embodiment can regularly update the mapping relationship in the binary-DNA conversion mapping table, that is, each binary sequence piece and the five base pieces in each base piece.
  • the order of bases is changed.
  • establish a set of public mapping relationships for public documents for public use and at the same time, establish a private private mapping relationship for use by documents with security or privacy.
  • step S10 before step S10, that is, before obtaining the file stored in the form of a binary sequence, the following steps are specifically included:
  • this application cuts the binary sequence of the file into binary sequence slices of a certain length, and then maps each binary sequence slice to an appropriate DNA base slice.
  • the principle that the number of types of base chips is greater than or equal to the number of types of binary sequence chips is 4 n ⁇ 2 8 , and it can be deduced that 2n ⁇ 8.
  • 2 8 is the total number of binary sequence slices.
  • the minimum value of n can be taken as the number of bases in the base chip.
  • each byte is 8 bits
  • 256 suitable base chips need to be determined.
  • n When one byte is a unit of binary sequence slices, n must be at least 4. However, a large part of the length-4 base slices have very long consecutive repeating bases (for example, AAAT has 3 consecutive repeating bases) or extreme GC content (for example, the GC content of GCCT is 75%). The continuous repeating bases of the base sequence obtained by combining such base slices are longer, and the GC content cannot be controlled. So x cannot take 4, it should at least take 5.
  • the method provided in this embodiment converts the file into a binary sequence synthesized from multiple binary sequence slices, which facilitates subsequent rapid conversion of the storage format through the mapping relationship between the binary sequence slices and the base slices.
  • the mapping relationship between the binary sequence slices and the base slices is recorded and the binary-DNA conversion mapping table includes at least one set of mapping relationships between the base slices and the binary sequence slices. This embodiment can adapt the number of corresponding base slices based on the total number of binary sequence slices, thereby saving storage resources of base slices.
  • step S102 the base slices corresponding to the 28 binary sequence slices are determined, which specifically includes the following steps:
  • abcde be a base piece of length 5, in which the values of a, b, c, d, and e are all in the set ⁇ A, T, C, G ⁇ . Considering the length of consecutive repeating bases, they should satisfy the following conditions:
  • Condition 1 means that there must be no repeated bases on the boundaries of the base sheets to avoid 3 or more consecutive repeating bases after combination (for example, the combination of AATCG and TCGGA will result in 3 consecutive repeating bases AAA).
  • Condition 2 means that three consecutive repeating bases cannot appear inside the base sheet. This is also to avoid too long consecutive repeating bases.
  • t be the sum of the numbers of G and C in the base piece. In order to control the GC content in the encoded base sequence, t should meet the following conditions:
  • condition 2 is violated and d can be A, C or G.
  • d can be A, C or G.
  • the value of e also depends on the previous ones. e cannot be the same as d, and the total number of G and C must be 2 or 3. For example, if it is preceded by GTTA, then e cannot be A, otherwise condition 1 is violated; further, e can only be G or C to satisfy condition 3. .
  • suitable base slices are screened out, as shown below in the binary-DNA conversion mapping table, where the symbol " ⁇ -->" represents the mapping, and the left side of the symbol is a binary sequence slice of length 8. (called an 8-binary sequence piece), and on the right is a base piece of length 5 (called a 5-base piece):
  • step S20 the file is converted into the form of DNA base sequence code and saved, which specifically includes the following steps:
  • oligonucleotides through the splicing of oligonucleotides, a variety of existing technologies can already artificially synthesize specific DNA sequences, among which chemical methods have matured and enzymatic synthesis methods are developing.
  • the chemical method is divided into four steps: deprotection, coupling, capping (optional) and oxidation. It is characterized by early appearance and the use of toxic reagents. Enzymatic methods are relatively mild, less damaging to DNA, more accurate, and have fewer by-products.
  • DNA is closely related to biology and will not be eliminated by the times like other storage media.
  • the storage density of DNA is very high.
  • the storage density of the most compact hard disk in the world is only one thousandth of it.
  • 10 complete high-definition movies can be stored in the size of a grain of salt.
  • DNA is the core of biological research. As time goes by and technology matures, it will become more and more convenient to access data on DNA.
  • the information encoding method based on DNA storage proposed in this embodiment can store various information originally stored as binary into DNA information through fast, high encoding efficiency and robust construction conditions, with a net information density (Net Information Density, NID) can reach 1.60, and the length of continuous repeating bases (homopolymer) in all encoded DNA sequences is only 2 bases at most, and the GC content is strictly controlled between 40% and 60%.
  • NID Net Information Density
  • sequence number of each step in the above embodiment does not mean the order of execution.
  • the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present invention.
  • an information encoding device based on DNA storage corresponds to the information encoding method based on DNA storage in the above embodiment.
  • the information encoding device based on DNA storage includes a module 10 for obtaining a binary sequence file, a module 20 for forming DNA information, and a module 30 for saving DNA information.
  • the detailed description of each functional module is as follows:
  • a DNA information module 20 is formed, which is used to divide the binary sequence in units of bytes to obtain binary sequence slices based on the binary-DNA conversion mapping table, map each binary sequence slice to a base slice, and then All the base pieces are combined in sequence to form DNA information.
  • Each base piece meets the following conditions: the length is five, the sum of the number of G bases and C bases is t, and 0.4 ⁇ t/5 ⁇ 0.6 , there are no repeated bases at the boundary, and there are no three consecutive repeated bases in the middle;
  • the DNA information saving module 30 is used to save the file in the form of the DNA information.
  • Each module in the above-mentioned DNA storage-based information encoding device can be implemented in whole or in part by software, hardware, and combinations thereof.
  • Each of the above modules may be embedded in or independent of the processor of the computer device in the form of hardware, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in Figure 9.
  • the computer device includes a processor, memory, network interface, and database connected through a system bus. Wherein, the processor of the computer device is used to provide computing and control capabilities.
  • the memory of the computer device includes non-volatile media and internal memory. This non-volatile medium stores the operating system, computer programs and databases. This internal memory provides an environment for the execution of operating systems and computer programs in non-volatile media.
  • the computer device's database is used for data related to information encoding methods based on DNA storage.
  • the network interface of the computer device is used to communicate with external terminals through a network connection.
  • the computer program when executed by the processor, implements an information encoding method based on DNA storage.
  • a computer device including a memory, a processor, and a computer program stored in the memory and executable on the processor.
  • the processor executes the computer program
  • the information encoding method based on DNA storage in the above embodiment is implemented. , for example, steps S10 to S20 shown in Figure 2 .
  • the processor executes the computer program
  • the functions of each module/unit of the information encoding device based on DNA storage in the above embodiments are implemented, such as the functions of modules 10 to 20 shown in FIG. 8 . To avoid repetition, they will not be repeated here.
  • a computer-readable medium is provided with a computer program stored thereon.
  • the computer program When the computer program is executed by a processor, the information encoding method based on DNA storage in the above embodiment is implemented, such as steps S10 to S20 shown in FIG. 2 .
  • the computer program when executed by the processor, it realizes the functions of each module/unit in the DNA storage-based information encoding device in the above device embodiment, such as the functions of modules 10 to 20 shown in Figure 8 . To avoid repetition, they will not be repeated here.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Synchlink DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM
  • Module completion means dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioethics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An information coding method and apparatus based on DNA storage, and a computer device and a medium. The method comprises: acquiring a file, which is stored in the form of a binary sequence (S10); on the basis of a binary-DNA conversion mapping table, segmenting the binary sequence in bytes to obtain binary sequence segments, mapping each binary sequence segment to a base segment, and then sequentially combining all the base segments to form DNA information, wherein each base segment satisfies the following composition conditions: the length is 5; the sum of the number of G bases and the number of C bases is t, which satisfies 0.4 ≤ t/5 ≤ 0.6; there are no repeated bases at a boundary; and there are no three consecutive repeated bases in the middle (S20); and saving the file in the form of DNA information (S30). The method can strictly guarantee that consecutive repeated bases in all coded DNA information have a maximum length of only 2 and a GC content that is between 0.4 and 0.6, an algorithm has linear complexity, and the net information density is 1.60.

Description

基于DNA存储的信息编码方法、装置、计算机设备及介质Information encoding methods, devices, computer equipment and media based on DNA storage 技术领域Technical field
本发明涉及信息存储技术领域,尤其涉及一种基于DNA存储的信息编码方法、装置、计算机设备及介质。The present invention relates to the field of information storage technology, and in particular to an information encoding method, device, computer equipment and medium based on DNA storage.
背景技术Background technique
随着互联网和人工智能等信息技术和数字技术的快速发展,信息量呈指数级飞快增长,磁盘、硬盘、闪存等传统存储介质已经逐渐不能满足全世界范围内数据存储的需要。而将DNA作为存储介质存在天然优势:一是信息密度高,据微软研究院此前估计,1立方毫米的DNA能够存储1个EB(Exabyte,百亿亿字节)的数据;二是存储时间长、稳定性强,在合适的条件下,可以存储上万年;三是存储能耗很低。With the rapid development of information and digital technologies such as the Internet and artificial intelligence, the amount of information has grown exponentially. Traditional storage media such as disks, hard drives, and flash memories have gradually been unable to meet the needs of data storage worldwide. There are natural advantages to using DNA as a storage medium: First, the information density is high. According to previous estimates by Microsoft Research, 1 cubic millimeter of DNA can store 1 exabyte (exabyte) of data; second, the storage time is long. , Strong stability, and can be stored for tens of thousands of years under suitable conditions; third, storage energy consumption is very low.
DNA是一种序列确定的生物大分子。DNA存储技术就是将信息编码成碱基序列,从而存储在DNA里。随后,DNA可以复制、测序,接着经过解码读取里面的信息。在这几个过程中,具有长重复碱基或极端GC含量的DNA在合成、复制以及测序时容易出错。如何快速、高效地避免信息编码后的DNA序列出现长重复碱基或极端GC含量成为亟待解决的问题。DNA is a biological macromolecule with a defined sequence. DNA storage technology encodes information into base sequences and stores them in DNA. The DNA can then be copied, sequenced, and decoded to read the information inside. During these processes, DNA with long repetitive bases or extreme GC content is prone to errors during synthesis, replication, and sequencing. How to quickly and efficiently avoid long repetitive bases or extreme GC content in DNA sequences after information encoding has become an urgent problem to be solved.
发明内容Contents of the invention
本发明实施例提供一种基于DNA存储的信息编码方法、装置、计算机设备及介质,以解决如何避免信息编码后的DNA序列出现长重复碱基或极端GC含量的问题。Embodiments of the present invention provide an information encoding method, device, computer equipment and medium based on DNA storage to solve the problem of how to avoid long repeated bases or extreme GC content in DNA sequences after information encoding.
一种基于DNA存储的信息编码方法,包括:An information encoding method based on DNA storage, including:
获取以二进制序列的形式存储的文件;Get a file stored as a binary sequence;
基于二进制-DNA转换映射表,将二进制序列以字节为单位进行分割得到二进制序列片,将每一二进制序列片映射到一个碱基片,再将所有碱基片依次合并后形成DNA信息,每一碱基片满足如下构成条件:长度为五,G碱基和C碱基 的数量和为t,满足0.4≤t/5≤0.6,边界不存在重复碱基,中间不存在连续三个重复碱基;Based on the binary-DNA conversion mapping table, the binary sequence is divided into byte units to obtain binary sequence slices, each binary sequence slice is mapped to a base slice, and then all the base slices are combined in sequence to form DNA information. A base piece satisfies the following conditions: the length is five, the sum of the number of G bases and C bases is t, satisfies 0.4≤t/5≤0.6, there are no repeated bases at the boundary, and there are no three consecutive repeated bases in the middle base;
将文件以DNA信息的形式进行保存。Save the file as DNA information.
进一步地,在将所有碱基片依次合并后形成DNA信息之后,还包括:Further, after all the base pieces are combined in sequence to form DNA information, it also includes:
获取DNA信息解码请求,基于DNA信息解码请求,获取对应的DNA信息;Obtain the DNA information decoding request, and obtain the corresponding DNA information based on the DNA information decoding request;
基于二进制-DNA转换映射表,将DNA信息转换为二进制序列,用于将DNA信息转换为文件进行保存。Based on the binary-DNA conversion mapping table, the DNA information is converted into a binary sequence, which is used to convert the DNA information into files for storage.
进一步地,在将文件以DNA信息的形式进行保存之后,还包括:Further, after the file is saved in the form of DNA information, it also includes:
获取定时任务,当系统时间满足定时任务时,对二进制-DNA转换映射表中的映射关系进行更新;Obtain the scheduled task, and when the system time meets the scheduled task, update the mapping relationship in the binary-DNA conversion mapping table;
或者,or,
给映射关系建立基于公开文档的公开映射关系和基于私密文档的私密映射关系。Establish a public mapping relationship based on public documents and a private mapping relationship based on private documents for the mapping relationship.
进一步地,在获取以二进制序列的形式存储的文件之前,还包括:Further, before obtaining the file stored in the form of a binary sequence, it also includes:
基于碱基片的种类数量大于或者等于二进制序列片的种类数量以及转换便利的原则,确定以携带八个比特的一个字节作为二进制序列片的分割单位;Based on the principle that the number of types of base chips is greater than or equal to the number of types of binary sequence chips and the convenience of conversion, it is determined that one byte carrying eight bits is used as the division unit of binary sequence chips;
根据比特的个数8,确定2 8个二进制序列片分别对应的碱基片。 According to the number of bits 8, determine the base slices corresponding to the 28 binary sequence slices.
进一步地,确定2 8个二进制序列片分别对应的碱基片,包括: Further, determine the base slices corresponding to the 28 binary sequence slices, including:
以四种碱基中的任一种碱基作为每一碱基片的第一位碱基,按照碱基片的构成条件继续合成第二位碱基直至最后一位碱基;Use any one of the four bases as the first base of each base piece, and continue to synthesize the second base until the last base according to the composition conditions of the base piece;
以剩余三种碱基中的任一种碱基作为每一碱基片的第二位碱基,重复执行按照构成条件继续合成第二位碱基直至最后一位碱基的步骤,直至碱基片的种类数量等于二进制序列片的种类数量。Use any one of the remaining three bases as the second base of each base piece, and repeat the steps of continuing to synthesize the second base until the last base according to the composition conditions, until the base The number of slice types is equal to the number of binary sequence slice types.
进一步地,将文件转换为DNA碱基序列码的形式进行保存,包括:Further, convert the file into the form of DNA base sequence code and save it, including:
通过DNA合成装置将DNA碱基序列码合成为DNA溶液或干粉并保存。The DNA base sequence code is synthesized into DNA solution or dry powder through a DNA synthesis device and stored.
一种基于DNA存储的信息编码装置,包括:An information encoding device based on DNA storage, including:
获取二进制序列文件模块,用于获取以二进制序列的形式存储的文件;Obtain binary sequence file module, used to obtain files stored in the form of binary sequence;
形成DNA信息模块,用于基于二进制-DNA转换映射表,将所述二进制序列以字节为单位进行分割得到二进制序列片,将每一所述二进制序列片映射到一个碱基片,再将所有所述碱基片依次合并后形成DNA信息,每一所述碱基片满 足如下构成条件:长度为五,G碱基和C碱基的数量和为t,满足0.4≤t/5≤0.6,边界不存在重复碱基,中间不存在连续三个重复碱基;Form a DNA information module, which is used to divide the binary sequence in units of bytes to obtain binary sequence slices based on the binary-DNA conversion mapping table, map each binary sequence slice to a base slice, and then all The base pieces are combined sequentially to form DNA information. Each base piece meets the following conditions: the length is five, the sum of the number of G bases and C bases is t, and 0.4≤t/5≤0.6 is satisfied. There are no repeated bases at the boundaries, and there are no three consecutive repeated bases in the middle;
保存DNA信息模块,用于将所述文件以所述DNA信息的形式进行保存。A module for saving DNA information is used to save the file in the form of the DNA information.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述基于DNA存储的信息编码方法。A computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the above information encoding method based on DNA storage.
一种计算机可读介质,所述计算机可读介质存储有计算机程序,所述计算机程序被处理器执行时实现上述基于DNA存储的信息编码方法。A computer-readable medium stores a computer program. When the computer program is executed by a processor, the above-mentioned information encoding method based on DNA storage is implemented.
上述基于DNA存储的信息编码方法、装置、计算机设备及介质,通过将文件通过具有线性复杂度的构成条件转换为DNA信息进行保存,可严格保证编码后的全部DNA信息中的连续重复碱基长度最长只有2、GC含量介于0.4和0.6之间,保障了DNA信息具有线性复杂度,净信息密度(Net Information Density,NID)为1.60,以有效保障DNA的合成、复制和测序流程,提高DNA的合成效率,提高信息存储的信息密度和延长信息稳定存储的时间,同时减低储存能耗。The above-mentioned information encoding methods, devices, computer equipment and media based on DNA storage can strictly guarantee the length of continuously repeated bases in all encoded DNA information by converting files into DNA information through composition conditions with linear complexity for storage. The longest is only 2, and the GC content is between 0.4 and 0.6, ensuring that DNA information has linear complexity, and the net information density (Net Information Density, NID) is 1.60 to effectively ensure the synthesis, replication and sequencing processes of DNA and improve The synthesis efficiency of DNA can improve the information density of information storage and extend the stable storage time of information, while reducing storage energy consumption.
附图说明Description of the drawings
为了更清楚地说明本发明实施例的技术方案,下面将对本发明实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present invention more clearly, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. , for those of ordinary skill in the art, other drawings can also be obtained based on these drawings without exerting creative labor.
图1绘示本发明一实施例中基于DNA存储的信息编码方法的应用环境示意图;Figure 1 is a schematic diagram of the application environment of the information encoding method based on DNA storage in one embodiment of the present invention;
图2绘示本发明一实施例中基于DNA存储的信息编码方法的流程图;Figure 2 illustrates a flow chart of an information encoding method based on DNA storage in one embodiment of the present invention;
图3绘示本发明另一实施例中基于DNA存储的信息编码方法的第一流程图;Figure 3 illustrates a first flow chart of an information encoding method based on DNA storage in another embodiment of the present invention;
图4绘示本发明另一实施例中基于DNA存储的信息编码方法的编码至解码全流程的示意图;Figure 4 is a schematic diagram illustrating the entire process from encoding to decoding of an information encoding method based on DNA storage in another embodiment of the present invention;
图5绘示本发明另一实施例中基于DNA存储的信息编码方法的第二流程图;Figure 5 illustrates a second flow chart of an information encoding method based on DNA storage in another embodiment of the present invention;
图6绘示本发明另一实施例中基于DNA存储的信息编码方法的第三流程图;Figure 6 illustrates a third flow chart of an information encoding method based on DNA storage in another embodiment of the present invention;
图7绘示本发明另一实施例中基于DNA存储的信息编码方法的第四流程图;Figure 7 illustrates a fourth flow chart of an information encoding method based on DNA storage in another embodiment of the present invention;
图8绘示本发明一实施例中基于DNA存储的信息编码装置的示意图;Figure 8 is a schematic diagram of an information encoding device based on DNA storage in one embodiment of the present invention;
图9绘示本发明一实施例中计算机设备的示意图。FIG. 9 is a schematic diagram of a computer device according to an embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without making creative efforts fall within the scope of protection of the present invention.
本发明实施例提供的基于DNA存储的信息编码方法,可应用在如图1的应用环境中,该基于DNA存储的信息编码方法应用在基于DNA存储的信息编码系统中,该基于DNA存储的信息编码系统包括客户端和服务器,其中,客户端通过网络与服务器进行通信。客户端又称为用户端,是指与服务器相对应,为客户端提供本地服务的程序。进一步地,客户端为计算机端程序、智能设备的APP程序或嵌入其他APP的第三方小程序。该客户端可安装在但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备等计算机设备上。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The information encoding method based on DNA storage provided by the embodiment of the present invention can be applied in the application environment as shown in Figure 1. The information encoding method based on DNA storage is applied in the information encoding system based on DNA storage. The information based on DNA storage The coding system includes a client and a server, where the client communicates with the server through the network. The client, also known as the user end, refers to the program that corresponds to the server and provides local services to the client. Further, the client is a computer program, an APP program of a smart device, or a third-party applet embedded in other APPs. The client can be installed on, but is not limited to, various computer devices such as personal computers, laptops, smartphones, tablets, and portable wearable devices. The server can be implemented as an independent server or a server cluster composed of multiple servers.
传统的存储技术将信息以二进制序列(也即由0和1组成的序列)存到硬盘、光盘、U盘、CD等媒介中,比如音乐、图片、视频等文件在计算机的底层都是二进制序列。由于其机械特性,这些媒介经过读或写达到一定次数后会发生故障。另外,它们信息密度低、体积大,随着全球信息的爆炸式增长,已经难以满足存储需求。Traditional storage technology stores information in binary sequences (that is, sequences composed of 0s and 1s) into hard disks, optical disks, U disks, CDs and other media. For example, music, pictures, videos and other files are all binary sequences at the bottom of the computer. . Due to their mechanical properties, these media will fail after a certain number of reads or writes. In addition, they have low information density and large volume. With the explosive growth of global information, it is difficult to meet storage needs.
在一实施例中,如图2所示,提供一种基于DNA存储的信息编码方法,以该方法应用在图1中的服务器为例进行说明,具体包括如下步骤:In one embodiment, as shown in Figure 2, an information encoding method based on DNA storage is provided. The application of this method to the server in Figure 1 is used as an example to illustrate, specifically including the following steps:
S10.获取以二进制序列的形式存储的文件。S10. Obtain files stored in the form of binary sequences.
其中,二进制序列是文件以二进制形式保存的(只有0和1组成)的数字序列编码。Among them, the binary sequence is the digital sequence encoding of the file saved in binary form (composed only of 0 and 1).
具体地,本实施例可对已二进制保存的各种形式的文档进行DNA编码,各种形式的文档包括语音、文字、图像以及音乐等,此处不作具体限定。Specifically, this embodiment can perform DNA encoding on various forms of documents that have been saved in binary. The various forms of documents include voice, text, images, music, etc., and are not specifically limited here.
S20.基于二进制-DNA转换映射表,将二进制序列以字节为单位进行分割得到二进制序列片,将每一二进制序列片映射到一个碱基片,再将所有碱基片依次合并后形成DNA信息,每一碱基片满足如下构成条件:长度为五,G碱基和C 碱基的数量和为t,满足0.4≤t/5≤0.6,边界不存在重复碱基,中间不存在连续三个重复碱基。S20. Based on the binary-DNA conversion mapping table, divide the binary sequence into byte units to obtain binary sequence slices, map each binary sequence slice to a base slice, and then merge all the base slices in sequence to form DNA information. , each base piece meets the following composition conditions: the length is five, the sum of the number of G bases and C bases is t, satisfies 0.4≤t/5≤0.6, there are no repeated bases on the boundary, and there are no three consecutive bases in the middle Repeat bases.
其中,二进制-DNA转换映射表是二进制编码和DNA编码之间进行互相映射的表,比如,该二进制-DNA转换映射表中的一组数据为:00000000<-->ATACG,表示二进制为00000000的数据可通过本实施例的映射后成为已DNA编码形式ATACG进行保存。Among them, the binary-DNA conversion mapping table is a table that maps binary codes and DNA codes to each other. For example, a set of data in the binary-DNA conversion mapping table is: 00000000<-->ATACG, which means that the binary code is 00000000. The data can be saved in the DNA-encoded form ATACG after mapping in this embodiment.
DNA碱基序列码即为所有n个碱基为单位的碱基片合成后的编码。The DNA base sequence code is the code synthesized from all n base units.
具体地,构成DNA的含氮碱基有四种:腺嘌呤(A)、鸟嘌呤(G)、胸腺嘧啶(T)和胞嘧啶(C)。本实施例通过二进制-DNA转换映射表将以二进制序列保存的文件转换为以n个碱基为单位的碱基片合成后的DNA信息进行保存。Specifically, there are four nitrogenous bases that make up DNA: adenine (A), guanine (G), thymine (T), and cytosine (C). In this embodiment, a binary-DNA conversion mapping table is used to convert a file saved as a binary sequence into DNA information synthesized into base slices in units of n bases for storage.
本实施例可采用8比特也即一个字节作为二进制序列的单位。举例说明,可将文件对应的连续二进制原始码按字节进行拆分后得到多个二进制序列片,每个序列片含8个比特,二进制序列即为多个8比特的二进制序列片合成的序列码。In this embodiment, 8 bits, that is, one byte, can be used as the unit of the binary sequence. For example, the continuous binary source code corresponding to the file can be split by bytes to obtain multiple binary sequence slices. Each sequence slice contains 8 bits. The binary sequence is a sequence synthesized from multiple 8-bit binary sequence slices. code.
举例说明,将文件的二进制码按一个字节(也就是有8个二进制位的二进制序列片,这样的二进制序列片一共有256个)一组分割,每一组映射到5个合适的碱基片(这样的碱基片共有256个)后进行合成,可得到二进制序列。For example, the binary code of the file is divided into one byte (that is, a binary sequence slice with 8 binary bits, and there are 256 such binary sequence slices in total). Each group is mapped to 5 appropriate bases. The binary sequence can be obtained by synthesizing the base slices (there are 256 such base slices in total).
可以理解地,解码过程即为编码过程的逆序过程:将DNA序列码按5个碱基一组分割,每组映射到一个有8个二进制位的二进制序列片。Understandably, the decoding process is the reverse process of the encoding process: the DNA sequence code is divided into groups of 5 bases, and each group is mapped to a binary sequence piece with 8 binary bits.
S30.将文件以DNA信息的形式进行保存。S30. Save the file in the form of DNA information.
具体地,将文件以DNA信息的形式也即生物大分子的形式保存,可有效延长信息稳定存储的时间,同时减低储存能耗。Specifically, saving files in the form of DNA information, that is, in the form of biological macromolecules, can effectively extend the stable storage time of information and reduce storage energy consumption.
本实施例提供的基于DNA存储的信息编码方法,通过将文件通过具有线性复杂度的构成条件转换为DNA信息进行保存,可严格保证编码后的全部DNA信息中的连续重复碱基长度最长只有2、GC含量介于0.4和0.6之间,保障了DNA信息具有线性复杂度,净信息密度(Net Information Density,NID)为1.60,以有效保障DNA的合成、复制和测序流程,提高DNA的合成效率,提高信息存储的信息密度和延长信息稳定存储的时间,同时减低储存能耗。The information encoding method based on DNA storage provided in this embodiment can strictly ensure that the longest length of consecutive repeating bases in all encoded DNA information is only 2. The GC content is between 0.4 and 0.6, ensuring that DNA information has linear complexity, and the Net Information Density (NID) is 1.60 to effectively ensure the synthesis, replication and sequencing processes of DNA and improve DNA synthesis. Efficiency, improve the information density of information storage and extend the stable storage time of information, while reducing storage energy consumption.
在一具体实施例中,如图3所示,在步骤S20之后,即在将所有碱基片依次合并后形成DNA信息之后,还具体包括如下步骤:In a specific embodiment, as shown in Figure 3, after step S20, that is, after all the base pieces are combined in sequence to form DNA information, the following steps are specifically included:
S201.获取DNA信息解码请求,基于DNA信息解码请求,获取对应的DNA 信息。S201. Obtain the DNA information decoding request, and obtain the corresponding DNA information based on the DNA information decoding request.
S202.基于二进制-DNA转换映射表,将DNA信息转换为二进制序列,用于将DNA信息转换为文件进行保存。S202. Based on the binary-DNA conversion mapping table, convert the DNA information into a binary sequence, which is used to convert the DNA information into a file for storage.
其中,DNA解码请求是将已DNA信息存储的文件解码为以二进制序列存储的请求,也即解码是编码的逆向过程。Among them, the DNA decoding request is a request to decode the file with DNA information stored into a binary sequence, that is, decoding is the reverse process of encoding.
具体地,本实施例依旧采用二进制-DNA转换映射表,将DNA信息按碱基片进行分割后映射为二进制序列。优选地,二进制-DNA转换映射表中包括至少一组碱基片和二进制序列片的映射关系,该映射关系也即每个二进制序列片和每个碱基片之间的二进制码和五个碱基的排列的具体对应关系。Specifically, this embodiment still uses a binary-DNA conversion mapping table to divide the DNA information into base slices and then map it into a binary sequence. Preferably, the binary-DNA conversion mapping table includes at least one set of mapping relationships between base slices and binary sequence slices. This mapping relationship is also the binary code and five bases between each binary sequence slice and each base slice. The specific correspondence between the basis arrangements.
进一步地,从编码到解码的全流程举例说明本实施例的实现流程:Further, the entire process from encoding to decoding is given as an example to illustrate the implementation process of this embodiment:
A编码过程:A coding process:
分割:segmentation:
将输入信息的二进制序列按8个二进制位一组分割。这样一定得到整数个组,因为8个二进制位是一个字节,而文件是以字节为单位存储,其大小一定是字节的整数倍。以图4中的输入信息为例,“01100011 10001101 01011011 10001110 10111011 00110111 01011000”一共分割成7组。Divide the binary sequence of the input information into groups of 8 binary bits. This must result in an integer number of groups, because 8 binary bits are a byte, and files are stored in bytes, and their size must be an integer multiple of bytes. Taking the input information in Figure 4 as an example, "01100011 10001101 01011011 10001110 10111011 00110111 01011000" is divided into 7 groups in total.
映射:Mapping:
按照二进制-DNA转换映射表提供的映射关系,查询每一组二进制序列片对应的5-碱基片。比如“01100011”对应到“TCGTA”,其他的类推。这样得到7组5-碱基片。According to the mapping relationship provided by the binary-DNA conversion mapping table, query the 5-base slice corresponding to each set of binary sequence slices. For example, "01100011" corresponds to "TCGTA", and so on. In this way, seven sets of 5-base chips were obtained.
合并:merge:
将所得碱基片合并,就得到所需DNA信息。该DNA信息随后可合成DNA作为信息记录载体。By combining the obtained base slices, the required DNA information is obtained. This DNA information can then be synthesized into DNA as an information recording carrier.
B解码过程:B decoding process:
分割:segmentation:
将要解码的DNA信息按5个碱基一组分割。这样也一定会得到整数个组,因为前面编码时就是整数个组。比如“TCGTA CACTG TCTCT CACGA CGTCT AGTGC TCTAC”分割为7组。The DNA information to be decoded is divided into groups of 5 bases. This will definitely result in an integer number of groups, because the previous encoding was an integer number of groups. For example, "TCGTA CACTG TCTCT CACGA CGTCT AGTGC TCTAC" is divided into 7 groups.
映射:Mapping:
按照二进制-DNA转换映射表提供的映射关系,查询每一组碱基片对应的8- 二进制序列片。比如“TCGTA”对应到“01100011”,其余类推。这样得到7组8-二进制序列片。According to the mapping relationship provided by the binary-DNA conversion mapping table, query the 8-binary sequence slice corresponding to each group of base slices. For example, "TCGTA" corresponds to "01100011", and so on. In this way, 7 groups of 8-binary sequence slices are obtained.
合并:merge:
将所得二进制序列片合并,就得到解码后的信息。The resulting binary sequence slices are combined to obtain the decoded information.
本实施例用于实现从DNA作为存储介质的载体中读取并解码出数据进行后续处理,是快速高效以及稳健的编解码方式。This embodiment is used to read and decode data from DNA as a storage medium carrier for subsequent processing. It is a fast, efficient and robust encoding and decoding method.
在一具体实施例中,如图5所示,在步骤S30之后,即在将文件以DNA信息的形式进行保存之后,还具体包括如下步骤:In a specific embodiment, as shown in Figure 5, after step S30, that is, after saving the file in the form of DNA information, the following steps are specifically included:
S3011.获取定时任务,当系统时间满足定时任务时,对二进制-DNA转换映射表中的映射关系进行更新;S3011. Obtain the scheduled task, and when the system time meets the scheduled task, update the mapping relationship in the binary-DNA conversion mapping table;
或者,or,
S3012给映射关系建立基于公开文档的公开映射关系和基于私密文档的私密映射关系。S3012 establishes a public mapping relationship based on public documents and a private mapping relationship based on private documents for mapping relationships.
具体地,为了加强文件的保存安全性和可靠性,本实施例提供的方法可定期更新二进制-DNA转换映射表中的映射关系,也即将每个二进制序列片和每个碱基片中的五个碱基的排列顺序进行变更。或者,给可公开文档建立一套可公开的映射关系给大众使用,同时,建立私密的私密映射关系给具有安全性或私密性的文档来使用。Specifically, in order to enhance the security and reliability of file preservation, the method provided in this embodiment can regularly update the mapping relationship in the binary-DNA conversion mapping table, that is, each binary sequence piece and the five base pieces in each base piece. The order of bases is changed. Or, establish a set of public mapping relationships for public documents for public use, and at the same time, establish a private private mapping relationship for use by documents with security or privacy.
在一具体实施例中,如图6所示,在步骤S10之前,即在获取以二进制序列的形式存储的文件之前,还具体包括如下步骤:In a specific embodiment, as shown in Figure 6, before step S10, that is, before obtaining the file stored in the form of a binary sequence, the following steps are specifically included:
S101.基于碱基片的种类数量大于或者等于二进制序列片的种类数量以及转换便利的原则,确定以携带八个比特的一个字节作为二进制序列片的分割单位。S101. Based on the principle that the number of types of base chips is greater than or equal to the number of types of binary sequence chips and that conversion is convenient, it is determined that one byte carrying eight bits is used as the division unit of the binary sequence chip.
S102.根据比特的个数8,确定2 8个二进制序列片分别对应的碱基片。 S102. According to the number of bits 8, determine the base slices corresponding to the 28 binary sequence slices.
具体地,为了使算法的复杂度达到线性,本申请将文件的二进制序列切割成一定长度的二进制序列片,再将每个二进制序列片映射到合适的DNA碱基片。碱基片的种类数量大于或者等于二进制序列片的种类数量的原则,即为4 n≥2 8,并可推导出2n≥8。其中,2 8是二进制序列片的总数。为了节省碱基片资源,本实施例中可取n的最小值作为碱基片的碱基个数。 Specifically, in order to make the complexity of the algorithm linear, this application cuts the binary sequence of the file into binary sequence slices of a certain length, and then maps each binary sequence slice to an appropriate DNA base slice. The principle that the number of types of base chips is greater than or equal to the number of types of binary sequence chips is 4 n ≥ 2 8 , and it can be deduced that 2n ≥ 8. Among them, 2 8 is the total number of binary sequence slices. In order to save base chip resources, in this embodiment, the minimum value of n can be taken as the number of bases in the base chip.
举例说明,由于计算机上的文件以字节为单位存储,每个字节是8位,二进制序列片的长度可以固定为8,这样二进制序列片的总数共有2 8=256种。接下来 需确定256个合适的碱基片。 For example, since the files on the computer are stored in bytes, each byte is 8 bits, the length of the binary sequence slice can be fixed at 8, so the total number of binary sequence slices is 2 8 =256. Next, 256 suitable base chips need to be determined.
当即一字节数为二进制序列片的单元时,n至少为4。然而长度为4的碱基片中很大一部分具有很长的连续重复碱基(比如AAAT有3个连续重复碱基)或极端GC含量(比如GCCT的GC含量是75%)。这样的碱基片组合起来得到的碱基序列的连续重复碱基更长,GC含量也无法控制。所以x不能取4,至少应该取5。When one byte is a unit of binary sequence slices, n must be at least 4. However, a large part of the length-4 base slices have very long consecutive repeating bases (for example, AAAT has 3 consecutive repeating bases) or extreme GC content (for example, the GC content of GCCT is 75%). The continuous repeating bases of the base sequence obtained by combining such base slices are longer, and the GC content cannot be controlled. So x cannot take 4, it should at least take 5.
本实施例提供的方法将文件转换为多个二进制序列片合成的二进制序列,利于后续通过二进制序列片和碱基片之间的映射关系实现保存格式的迅速转换。优选地,过二进制序列片和碱基片之间的映射关系记录与二进制-DNA转换映射表,该表包括至少一组碱基片和二进制序列片的映射关系。本实施例可基于二进制序列片的总数适配出对应的碱基片的个数,节省碱基片的存储资源。The method provided in this embodiment converts the file into a binary sequence synthesized from multiple binary sequence slices, which facilitates subsequent rapid conversion of the storage format through the mapping relationship between the binary sequence slices and the base slices. Preferably, the mapping relationship between the binary sequence slices and the base slices is recorded and the binary-DNA conversion mapping table includes at least one set of mapping relationships between the base slices and the binary sequence slices. This embodiment can adapt the number of corresponding base slices based on the total number of binary sequence slices, thereby saving storage resources of base slices.
在一具体实施例中,如图7所示,在步骤S102中,即确定2 8个二进制序列片分别对应的碱基片,具体包括如下步骤: In a specific embodiment, as shown in Figure 7, in step S102, the base slices corresponding to the 28 binary sequence slices are determined, which specifically includes the following steps:
S1021.以四种碱基中的任一种碱基作为每一碱基片的第一位碱基,按照碱基片的构成条件继续合成第二位碱基直至最后一位碱基。S1021. Use any one of the four bases as the first base of each base piece, and continue to synthesize the second base until the last base according to the composition conditions of the base piece.
S1022.以剩余三种碱基中的任一种碱基作为每一碱基片的第二位碱基,重复执行按照构成条件继续合成第二位碱基直至最后一位碱基的步骤,直至碱基片的种类数量等于二进制序列片的种类数量。S1022. Use any one of the remaining three bases as the second base of each base piece, and repeat the steps of continuing to synthesize the second base until the last base according to the composition conditions, until The number of types of base chips is equal to the number of types of binary sequence chips.
具体地,继续以n=5举例进行说明。从长度为5的所有碱基片中筛选出256个合适的可作为存储介质的碱基片。Specifically, continue to take n=5 as an example for explanation. From all the base slices with a length of 5, 256 suitable base slices that can be used as storage media are selected.
令abcde为长度为5的碱基片,其中a、b、c、d、e的取值都在集合{A,T,C,G}中。考虑到连续重复碱基的长度,它们应该满足这样的条件:Let abcde be a base piece of length 5, in which the values of a, b, c, d, and e are all in the set {A, T, C, G}. Considering the length of consecutive repeating bases, they should satisfy the following conditions:
条件1:a≠b,d≠e;Condition 1: a≠b,d≠e;
条件2:b、c、d三个不全部相同。Condition 2: b, c, and d are not all the same.
条件1是说在碱基片的边界上不能有重复碱基,以避免组合后出现3或3个以上连续重复碱基(比如AATCG和TCGGA组合就会出现3个连续重复碱基AAA)。条件2是说碱基片内部不能出现3个连续重复碱基,这也是为了避免连续重复碱基过长。另外,令t为碱基片中G和C的数量之和。为了控制编码后的碱基序列中GC含量,t应当满足如下条件: Condition 1 means that there must be no repeated bases on the boundaries of the base sheets to avoid 3 or more consecutive repeating bases after combination (for example, the combination of AATCG and TCGGA will result in 3 consecutive repeating bases AAA). Condition 2 means that three consecutive repeating bases cannot appear inside the base sheet. This is also to avoid too long consecutive repeating bases. In addition, let t be the sum of the numbers of G and C in the base piece. In order to control the GC content in the encoded base sequence, t should meet the following conditions:
条件3:0.4≤t/5≤0.6Condition 3: 0.4≤t/5≤0.6
也就是说,每个碱基片的GC含量都介于0.4和0.6之间,从而编码后的碱 基序列的GC含量也位于这个区间。解出条件3得到In other words, the GC content of each base piece is between 0.4 and 0.6, so the GC content of the encoded base sequence is also within this range. Solve condition 3 to get
2≤t≤3,2≤t≤3,
意味着碱基片中G和C的数量之和只能是2或3。This means that the sum of the number of G and C in the base piece can only be 2 or 3.
根据以上条件,就能筛选出合适的碱基片。Based on the above conditions, suitable base chips can be screened out.
从a开始,可以取{A,T,C,G}中任何一个,有四种可能。由于b不能跟a相同,b只有3种可能。也就是说,假如a取A,b就不能取A,只能取T、C、G。c的取值有四种,因为它处于碱基片的中间,前面只有两个碱基。d的取值要谨慎,必须保证不能使b、c、d三个全部相同,而且G和C的总数不能少于1或超过3(此时可以是1、2或3,因为后面还有一个碱基e)。比如,假若前面是GTT,那d不能再取T,否则违背条件2,d可以取A、C或G。接下来,e的取值也要具体看前面几个。e不能跟d相同,而且要保证G和C的总数为2或3.比如,假若前面是GTTA,那e不能取A,否则违背条件1;进一步地,e只能取G或C来满足条件3。这样,就筛选除了合适的碱基片,如下文所示的二进制-DNA转换映射表中的映射关系,其中符号“<-->”表示映射,该符号左侧是长度为8的二进制序列片(称为8-二进制序列片),右侧是长度为5的碱基片(称为5-碱基片):Starting from a, you can choose any one of {A, T, C, G}, there are four possibilities. Since b cannot be the same as a, there are only three possibilities for b. In other words, if a takes A, b cannot take A, but can only take T, C, and G. There are four values for c, because it is in the middle of the base sheet, with only two bases in front. The value of d must be chosen carefully, and it must be ensured that b, c, and d cannot all be the same, and the total number of G and C cannot be less than 1 or more than 3 (it can be 1, 2, or 3 at this time, because there is another base e). For example, if the previous one is GTT, then d can no longer be T. Otherwise, condition 2 is violated and d can be A, C or G. Next, the value of e also depends on the previous ones. e cannot be the same as d, and the total number of G and C must be 2 or 3. For example, if it is preceded by GTTA, then e cannot be A, otherwise condition 1 is violated; further, e can only be G or C to satisfy condition 3. . In this way, suitable base slices are screened out, as shown below in the binary-DNA conversion mapping table, where the symbol "<-->" represents the mapping, and the left side of the symbol is a binary sequence slice of length 8. (called an 8-binary sequence piece), and on the right is a base piece of length 5 (called a 5-base piece):
Figure PCTCN2022091100-appb-000001
Figure PCTCN2022091100-appb-000001
Figure PCTCN2022091100-appb-000002
Figure PCTCN2022091100-appb-000002
Figure PCTCN2022091100-appb-000003
Figure PCTCN2022091100-appb-000003
在一具体实施例中,在步骤S20中,即将文件转换为DNA碱基序列码的形式进行保存,具体包括如下步骤:In a specific embodiment, in step S20, the file is converted into the form of DNA base sequence code and saved, which specifically includes the following steps:
S21.通过DNA合成装置将DNA碱基序列码合成为DNA溶液或干粉并保存。S21. Use the DNA synthesis device to synthesize the DNA base sequence code into a DNA solution or dry powder and store it.
具体地,通过寡链核苷酸的拼接,现有的多种技术已经可以人工合成特定的DNA序列,其中化学法已经成熟,酶促合成法正在发展。化学法分为去保护、偶联、加帽(可选)及氧化四个步骤,特点是出现时间早,需要使用有毒试剂。酶促法相对温和,较少损伤DNA,准确性较高,副产物较少。Specifically, through the splicing of oligonucleotides, a variety of existing technologies can already artificially synthesize specific DNA sequences, among which chemical methods have matured and enzymatic synthesis methods are developing. The chemical method is divided into four steps: deprotection, coupling, capping (optional) and oxidation. It is characterized by early appearance and the use of toxic reagents. Enzymatic methods are relatively mild, less damaging to DNA, more accurate, and have fewer by-products.
DNA与生物息息相关,不会像其它存储介质一样被时代淘汰。DNA的存储密度非常高,目前世界上最紧凑的硬盘存储密度仅仅是它的千分之一。利用DNA存储数据,可在一粒盐的体积中储存10部完整的高清电影。DNA是生物学研究 的核心,随着时间的推移和技术的成熟,在DNA上存取数据会越来越方便。DNA is closely related to biology and will not be eliminated by the times like other storage media. The storage density of DNA is very high. The storage density of the most compact hard disk in the world is only one thousandth of it. Using DNA to store data, 10 complete high-definition movies can be stored in the size of a grain of salt. DNA is the core of biological research. As time goes by and technology matures, it will become more and more convenient to access data on DNA.
分析本申请提供的方法的复杂度。8为输入文件的二进制序列的长度,则实施编码或解码的映射所需的步数是关于8的线性函数,因而编码和解码都是线性复杂度。本发明的方法将每8个二进制位映射到5个四进制碎片,净信息密度为8/5=1.60。Analyze the complexity of the method provided in this application. 8 is the length of the binary sequence of the input file, then the number of steps required to implement the encoding or decoding mapping is a linear function of 8, so both encoding and decoding have linear complexity. The method of the present invention maps every 8 binary bits to 5 quaternary fragments, and the net information density is 8/5=1.60.
进一步地,本实施例提出的基于DNA存储的信息编码方法可通过快速、高编码效率和稳健的构成条件将原存储为二进制的各种信息存储到DNA信息中,净信息密度(Net Information Density,NID)可达1.60,且编码后的全部DNA序列中的连续重复碱基(homopolymer)长度最长只有2个碱基,GC含量严格控制在40%至60%之间。Furthermore, the information encoding method based on DNA storage proposed in this embodiment can store various information originally stored as binary into DNA information through fast, high encoding efficiency and robust construction conditions, with a net information density (Net Information Density, NID) can reach 1.60, and the length of continuous repeating bases (homopolymer) in all encoded DNA sequences is only 2 bases at most, and the GC content is strictly controlled between 40% and 60%.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本发明实施例的实施过程构成任何限定。It should be understood that the sequence number of each step in the above embodiment does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present invention.
在一实施例中,提供一种基于DNA存储的信息编码装置,该基于DNA存储的信息编码装置与上述实施例中基于DNA存储的信息编码方法一一对应。如图8所示,该基于DNA存储的信息编码装置包括获取二进制序列文件模块10、形成DNA信息模块20和保存DNA信息模块30。各功能模块详细说明如下:In one embodiment, an information encoding device based on DNA storage is provided. The information encoding device based on DNA storage corresponds to the information encoding method based on DNA storage in the above embodiment. As shown in FIG. 8 , the information encoding device based on DNA storage includes a module 10 for obtaining a binary sequence file, a module 20 for forming DNA information, and a module 30 for saving DNA information. The detailed description of each functional module is as follows:
获取二进制序列文件模块10,用于获取以二进制序列的形式存储的文件;Obtain binary sequence file module 10, used to acquire files stored in the form of binary sequence;
形成DNA信息模块20,用于基于二进制-DNA转换映射表,将所述二进制序列以字节为单位进行分割得到二进制序列片,将每一所述二进制序列片映射到一个碱基片,再将所有所述碱基片依次合并后形成DNA信息,每一所述碱基片满足如下构成条件:长度为五,G碱基和C碱基的数量和为t,满足0.4≤t/5≤0.6,边界不存在重复碱基,中间不存在连续三个重复碱基;A DNA information module 20 is formed, which is used to divide the binary sequence in units of bytes to obtain binary sequence slices based on the binary-DNA conversion mapping table, map each binary sequence slice to a base slice, and then All the base pieces are combined in sequence to form DNA information. Each base piece meets the following conditions: the length is five, the sum of the number of G bases and C bases is t, and 0.4≤t/5≤0.6 , there are no repeated bases at the boundary, and there are no three consecutive repeated bases in the middle;
保存DNA信息模块30,用于将所述文件以所述DNA信息的形式进行保存。The DNA information saving module 30 is used to save the file in the form of the DNA information.
关于基于DNA存储的信息编码装置的具体限定可以参见上文中对于基于DNA存储的信息编码方法的限定,在此不再赘述。上述基于DNA存储的信息编码装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For specific limitations on the information encoding device based on DNA storage, please refer to the above limitations on the information encoding method based on DNA storage, which will not be described again here. Each module in the above-mentioned DNA storage-based information encoding device can be implemented in whole or in part by software, hardware, and combinations thereof. Each of the above modules may be embedded in or independent of the processor of the computer device in the form of hardware, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.
在一实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内 部结构图可以如图9所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性介质、内存储器。该非易失性介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于基于DNA存储的信息编码方法相关的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种基于DNA存储的信息编码方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in Figure 9. The computer device includes a processor, memory, network interface, and database connected through a system bus. Wherein, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes non-volatile media and internal memory. This non-volatile medium stores the operating system, computer programs and databases. This internal memory provides an environment for the execution of operating systems and computer programs in non-volatile media. The computer device's database is used for data related to information encoding methods based on DNA storage. The network interface of the computer device is used to communicate with external terminals through a network connection. The computer program, when executed by the processor, implements an information encoding method based on DNA storage.
在一实施例中,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行计算机程序时实现上述实施例基于DNA存储的信息编码方法,例如图2所示S10至步骤S20。或者,处理器执行计算机程序时实现上述实施例中基于DNA存储的信息编码装置的各模块/单元的功能,例如图8所示模块10至模块20的功能。为避免重复,此处不再赘述。In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, the information encoding method based on DNA storage in the above embodiment is implemented. , for example, steps S10 to S20 shown in Figure 2 . Alternatively, when the processor executes the computer program, the functions of each module/unit of the information encoding device based on DNA storage in the above embodiments are implemented, such as the functions of modules 10 to 20 shown in FIG. 8 . To avoid repetition, they will not be repeated here.
在一实施例中,提供一种计算机可读介质,其上存储有计算机程序,计算机程序被处理器执行时实现上述实施例基于DNA存储的信息编码方法,例如图2所示S10至步骤S20。或者,该计算机程序被处理器执行时实现上述装置实施例中基于DNA存储的信息编码装置中各模块/单元的功能,例如图8所示模块10至模块20的功能。为避免重复,此处不再赘述。In one embodiment, a computer-readable medium is provided with a computer program stored thereon. When the computer program is executed by a processor, the information encoding method based on DNA storage in the above embodiment is implemented, such as steps S10 to S20 shown in FIG. 2 . Alternatively, when the computer program is executed by the processor, it realizes the functions of each module/unit in the DNA storage-based information encoding device in the above device embodiment, such as the functions of modules 10 to 20 shown in Figure 8 . To avoid repetition, they will not be repeated here.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序可存储于一非易失性计算机可读取介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM) 等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile computer-readable medium. When executed, the computer program may include the processes of the above method embodiments. Any reference to memory, storage, database or other media used in various embodiments of this application may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that for the convenience and simplicity of description, only the division of the above functional units and modules is used as an example. In actual applications, the above functions can be allocated to different functional units and modules according to needs. Module completion means dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.
以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围,均应包含在本发明的保护范围之内。The above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still modify the technical solutions of the foregoing embodiments. Modifications are made to the recorded technical solutions, or equivalent substitutions are made to some of the technical features; however, these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of each embodiment of the present invention, and should all be included in the present invention. within the scope of protection.

Claims (10)

  1. 一种基于DNA存储的信息编码方法,其特征在于,包括:An information encoding method based on DNA storage, which is characterized by including:
    获取以二进制序列的形式存储的文件;Get a file stored as a binary sequence;
    基于二进制-DNA转换映射表,将所述二进制序列以字节为单位进行分割得到二进制序列片,将每一所述二进制序列片映射到一个碱基片,再将所有所述碱基片依次合并后形成DNA信息,每一所述碱基片满足如下构成条件:长度为五,G碱基和C碱基的数量和为t,满足0.4≤t/5≤0.6,边界不存在重复碱基,中间不存在连续三个重复碱基;Based on the binary-DNA conversion mapping table, the binary sequence is divided into byte units to obtain binary sequence slices, each binary sequence slice is mapped to a base slice, and then all the base slices are merged in sequence Finally, DNA information is formed. Each base piece meets the following conditions: the length is five, the sum of the number of G bases and C bases is t, satisfies 0.4≤t/5≤0.6, and there are no repeated bases on the boundary. There are no three consecutive repeated bases in the middle;
    将所述文件以所述DNA信息的形式进行保存。The file is saved in the form of the DNA information.
  2. 根据权利要求1所述的基于DNA存储的信息编码方法,其特征在于,在所述将所有所述碱基片依次合并后形成DNA信息之后,还包括:The information encoding method based on DNA storage according to claim 1, characterized in that, after said sequentially merging all the base pieces to form DNA information, it further includes:
    获取DNA信息解码请求,基于所述DNA信息解码请求,获取对应的所述DNA信息;Obtain a DNA information decoding request, and obtain the corresponding DNA information based on the DNA information decoding request;
    基于所述二进制-DNA转换映射表,将所述DNA信息转换为二进制序列,用于将所述DNA信息转换为文件进行保存。Based on the binary-DNA conversion mapping table, the DNA information is converted into a binary sequence, which is used to convert the DNA information into a file for storage.
  3. 根据权利要求1所述的基于DNA存储的信息编码方法,其特征在于,所述二进制-DNA转换映射表中包括至少一组所述碱基片和所述二进制序列片的映射关系。The information encoding method based on DNA storage according to claim 1, characterized in that the binary-DNA conversion mapping table includes at least one set of mapping relationships between the base slices and the binary sequence slices.
  4. 根据权利要求1所述的基于DNA存储的信息编码方法,其特征在于,在所述将所述文件以所述DNA信息的形式进行保存之后,还包括:The information encoding method based on DNA storage according to claim 1, characterized in that after the file is saved in the form of the DNA information, it further includes:
    获取定时任务,当系统时间满足所述定时任务时,对所述二进制-DNA转换映射表中的所述映射关系进行更新;Obtain a scheduled task, and when the system time meets the scheduled task, update the mapping relationship in the binary-DNA conversion mapping table;
    或者,or,
    给所述映射关系建立基于公开文档的公开映射关系和基于私密文档的私密映射关系。A public mapping relationship based on public documents and a private mapping relationship based on private documents are established for the mapping relationships.
  5. 根据权利要求1所述的基于DNA存储的信息编码方法,其特征在于,在所述获取以二进制序列的形式存储的文件之前,还包括:The information encoding method based on DNA storage according to claim 1, characterized in that, before obtaining the file stored in the form of a binary sequence, it also includes:
    基于碱基片的种类数量大于或者等于二进制序列片的种类数量以及转换便利的原则,确定以携带八个比特的一个字节作为所述二进制序列片的分割单位;Based on the principle that the number of types of base chips is greater than or equal to the number of types of binary sequence chips and that conversion is convenient, it is determined that one byte carrying eight bits is used as the division unit of the binary sequence chip;
    根据所述比特的个数8,确定2 8个二进制序列片分别对应的碱基片。 According to the number of bits (8), the base slices corresponding to the 28 binary sequence slices are determined.
  6. 根据权利要求5所述的基于DNA存储的信息编码方法,其特征在于,所述确定2 8个二进制序列片分别对应的碱基片,包括: The information encoding method based on DNA storage according to claim 5, characterized in that determining the base slices corresponding to the 28 binary sequence slices includes:
    以四种碱基中的任一种碱基作为每一碱基片的第一位碱基,按照所述碱基片的构成条件继续合成第二位碱基直至最后一位碱基;Use any one of the four bases as the first base of each base piece, and continue to synthesize the second base until the last base according to the composition conditions of the base piece;
    以剩余三种碱基中的任一种碱基作为每一碱基片的第二位碱基,重复执行按照所述构成条件继续合成第二位碱基直至最后一位碱基的步骤,直至所述碱基片的种类数量等于所述二进制序列片的种类数量。Use any one of the remaining three bases as the second base of each base piece, and repeat the steps of continuing to synthesize the second base until the last base according to the composition conditions. The number of types of base chips is equal to the number of types of binary sequence chips.
  7. 根据权利要求1所述的基于DNA存储的信息编码方法,其特征在于,所述将所述文件以所述DNA信息的形式进行保存,包括:The information encoding method based on DNA storage according to claim 1, wherein the step of saving the file in the form of the DNA information includes:
    通过DNA合成装置将所述DNA碱基序列码合成为DNA溶液或干粉并保存。The DNA base sequence code is synthesized into a DNA solution or dry powder by a DNA synthesis device and stored.
  8. 一种基于DNA存储的信息编码装置,其特征在于,包括:An information encoding device based on DNA storage, characterized by including:
    获取二进制序列文件模块,用于获取以二进制序列的形式存储的文件;Obtain binary sequence file module, used to obtain files stored in the form of binary sequence;
    形成DNA信息模块,用于基于二进制-DNA转换映射表,将所述二进制序列以字节为单位进行分割得到二进制序列片,将每一所述二进制序列片映射到一个碱基片,再将所有所述碱基片依次合并后形成DNA信息,每一所述碱基片满足如下构成条件:长度为五,G碱基和C碱基的数量和为t,满足0.4≤t/5≤0.6,边界不存在重复碱基,中间不存在连续三个重复碱基;Form a DNA information module, which is used to divide the binary sequence in units of bytes to obtain binary sequence slices based on the binary-DNA conversion mapping table, map each binary sequence slice to a base slice, and then all The base pieces are combined sequentially to form DNA information. Each base piece meets the following conditions: the length is five, the sum of the number of G bases and C bases is t, and 0.4≤t/5≤0.6 is satisfied. There are no repeated bases at the boundaries, and there are no three consecutive repeated bases in the middle;
    保存DNA信息模块,用于将所述文件以所述DNA信息的形式进行保存。A module for saving DNA information is used to save the file in the form of the DNA information.
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7任一项所述基于DNA存储的信息编码方法。A computer device, including a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that when the processor executes the computer program, it implements claims 1 to 1 7. Information encoding method based on DNA storage according to any one of 7.
  10. 一种计算机可读介质,所述计算机可读介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述基于DNA存储的信息编码方法。A computer-readable medium storing a computer program, characterized in that when the computer program is executed by a processor, the information encoding method based on DNA storage as described in any one of claims 1 to 7 is implemented. .
PCT/CN2022/091100 2022-04-23 2022-05-06 Information coding method and apparatus based on dna storage, and computer device and medium WO2023201782A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210432503.0 2022-04-23
CN202210432503.0A CN114974434A (en) 2022-04-23 2022-04-23 Information coding method and device based on DNA storage, computer equipment and medium

Publications (1)

Publication Number Publication Date
WO2023201782A1 true WO2023201782A1 (en) 2023-10-26

Family

ID=82979489

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/091100 WO2023201782A1 (en) 2022-04-23 2022-05-06 Information coding method and apparatus based on dna storage, and computer device and medium

Country Status (2)

Country Link
CN (1) CN114974434A (en)
WO (1) WO2023201782A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130280A1 (en) * 2017-09-01 2019-05-02 Seagate Technology Llc Timing recovery for dna storage
CN111443869A (en) * 2020-03-24 2020-07-24 中国科学院长春应用化学研究所 File storage method, device, equipment and computer readable storage medium
CN112382340A (en) * 2020-11-25 2021-02-19 中国科学院深圳先进技术研究院 Coding and decoding method and coding and decoding device for binary information to base sequence for DNA data storage
CN112711935A (en) * 2020-12-11 2021-04-27 中国科学院深圳先进技术研究院 Encoding method, decoding method, apparatus and computer readable storage medium
CN113539370A (en) * 2021-06-29 2021-10-22 中国科学院深圳先进技术研究院 Encoding method, decoding method, device, terminal device and readable storage medium
CN113744804A (en) * 2021-06-21 2021-12-03 深圳先进技术研究院 Method and device for storing data by using DNA and storage equipment
CN114254748A (en) * 2021-09-27 2022-03-29 清华大学 Extended coding method, system and related device for storage channel

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190130280A1 (en) * 2017-09-01 2019-05-02 Seagate Technology Llc Timing recovery for dna storage
CN111443869A (en) * 2020-03-24 2020-07-24 中国科学院长春应用化学研究所 File storage method, device, equipment and computer readable storage medium
CN112382340A (en) * 2020-11-25 2021-02-19 中国科学院深圳先进技术研究院 Coding and decoding method and coding and decoding device for binary information to base sequence for DNA data storage
CN112711935A (en) * 2020-12-11 2021-04-27 中国科学院深圳先进技术研究院 Encoding method, decoding method, apparatus and computer readable storage medium
CN113744804A (en) * 2021-06-21 2021-12-03 深圳先进技术研究院 Method and device for storing data by using DNA and storage equipment
CN113539370A (en) * 2021-06-29 2021-10-22 中国科学院深圳先进技术研究院 Encoding method, decoding method, device, terminal device and readable storage medium
CN114254748A (en) * 2021-09-27 2022-03-29 清华大学 Extended coding method, system and related device for storage channel

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
KUN BI, GU WANJUN, LU ZUHONG: "Coding algorithms in DNA storage", SHENGWU XINXIXUE - CHINESE JOURNAL OF BIOINFORMATICS, HARBIN GONGYE DAXUE, CN, vol. 18, no. 2, 20 April 2020 (2020-04-20), CN , pages 76 - 85, XP093101073, ISSN: 1672-5565, DOI: 10.12113/202003002 *

Also Published As

Publication number Publication date
CN114974434A (en) 2022-08-30

Similar Documents

Publication Publication Date Title
US20210050074A1 (en) Systems and methods for sequence encoding, storage, and compression
JP7079786B2 (en) Methods, computer-readable media, and equipment for accessing structured bioinformatics data in access units.
CN111091876B (en) DNA storage method, system and electronic equipment
Grabowski et al. Disk-based compression of data from genome sequencing
Moffat Word‐based text compression
US9298722B2 (en) Optimal sequential (de)compression of digital data
US20170123676A1 (en) Reference Block Aggregating into a Reference Set for Deduplication in Memory Management
CN107526743B (en) Method and apparatus for compressing file system metadata
US20240004838A1 (en) Quality score compression for improving downstream genotyping accuracy
CN112527736B (en) DNA-based data storage method, data recovery method and terminal equipment
US20170123689A1 (en) Pipelined Reference Set Construction and Use in Memory Management
Rødland Compact representation of k-mer de Bruijn graphs for genome read assembly
US20210124517A1 (en) Method, device and computer program product for storing data
WO2023201782A1 (en) Information coding method and apparatus based on dna storage, and computer device and medium
CN114138792A (en) Key-value separated storage method and system
WO2022120626A1 (en) Dna-based data storage method and apparatus, dna-based data recovery method and apparatus, and terminal device
CN114678074A (en) Hidden addressing DNA storage coding design method
US20220358290A1 (en) Encoding and storing text using dna sequences
Milicchio et al. Efficient data structures for mobile de novo genome assembly by third-generation sequencing
CN114356220B (en) Encoding method based on DNA storage, electronic device and readable storage medium
WO2023206023A1 (en) Encoding method and encoding device for dna storage
WO2019119336A1 (en) Multi-thread compression and decompression methods in generic data gz format, and device
Chen et al. Compression for population genetic data through finite-state entropy
CN114822695B (en) Encoding method and encoding device for DNA storage
WO2024113382A1 (en) Image data dna storage method and system, and electronic device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22938017

Country of ref document: EP

Kind code of ref document: A1