WO2019037117A1 - Encoding and decoding method, device and data processing device - Google Patents

Encoding and decoding method, device and data processing device Download PDF

Info

Publication number
WO2019037117A1
WO2019037117A1 PCT/CN2017/099152 CN2017099152W WO2019037117A1 WO 2019037117 A1 WO2019037117 A1 WO 2019037117A1 CN 2017099152 W CN2017099152 W CN 2017099152W WO 2019037117 A1 WO2019037117 A1 WO 2019037117A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
information
nucleic acid
sequence
binary code
Prior art date
Application number
PCT/CN2017/099152
Other languages
French (fr)
Chinese (zh)
Inventor
杨焕明
刘斯奇
汪建
Original Assignee
深圳华大基因研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因研究院 filed Critical 深圳华大基因研究院
Priority to PCT/CN2017/099152 priority Critical patent/WO2019037117A1/en
Priority to CN201780094012.7A priority patent/CN111095423B/en
Publication of WO2019037117A1 publication Critical patent/WO2019037117A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/40Encryption of genetic data

Definitions

  • the present invention relates to the field of data processing technologies, and in particular, to an encoding method, an encoding device, a decoding method, a decoding device, a data processing device, and a computer readable storage medium.
  • the related technology mainly uses the secret key to convert the plaintext of the information into meaningless ciphertext to achieve the encryption effect.
  • the inventors have found that the above-mentioned related art has the following problems: the complicated and cumbersome calculation of information is performed only by a predetermined mathematical method, resulting in low encryption efficiency and low security; the existing method for storing information by using DNA requires DNA. Synthesizers and sequencers are expensive and the method operations are time consuming and labor intensive. The inventors have proposed a solution to at least one of the above problems.
  • An object of the present invention is to provide an encoding technology solution with high encryption efficiency and high security, and another object of the present invention is to provide an information storage solution which is simple in operation and low in price.
  • an encoding method comprising: digitizing information to generate sequence data; dividing the sequence data into N data segments, N being an integer greater than 1; for each data segment Finding a corresponding nucleic acid fragment in a gene database, and arranging the nucleic acid fragment in the genetic data
  • the location information in the library is used as an identifier of each data segment; a sequence code is generated according to the identifier corresponding to each data segment.
  • the digitizing process is to transcode the binary code corresponding to the information to generate the sequence data.
  • sequence data is data consisting of four deoxyribonucleotides of adenine A, cytosine C, guanine G, and thymine T.
  • 0 in the binary code is converted to A or T, and 1 is converted to C or G to generate the sequence data.
  • 01 in the binary code is converted to A, 00 is converted to T, 11 is converted to C, and 10 is converted to G to generate the sequence data.
  • the sequence data is a binary code corresponding to the information.
  • nucleic acid fragments in the gene database are transcoded into binary code prior to the searching step.
  • a or T in the gene database is converted to binary code 0, and C or G is converted to binary code 1.
  • a in the gene database is converted to binary code 01
  • T is converted to binary code 00
  • C is converted to binary code 11
  • G is converted to binary code 10.
  • the identifier comprises position information of the first symbol and the last symbol of the nucleic acid fragment in the gene database.
  • the identifier comprises positional information of a first symbol of the nucleic acid fragment in the gene database, and a length of the nucleic acid fragment.
  • the genetic database comprises one or more animal and/or plant and/or microbial genomic data.
  • the gene database comprises wild type genomic data and/or synthetic genomic data.
  • the gene database comprises human genomic data.
  • a decoding method including: obtaining an identifier corresponding to each data segment from the encoded data, where the encoded data is a sequence encoding generated according to the encoding method according to any of the above embodiments; Obtaining location information corresponding to each data segment according to the identifier; and counting the number of genes according to the location information
  • a corresponding nucleic acid fragment is obtained from a library; sequence data is generated based on the nucleic acid fragment. Information is obtained based on the sequence data.
  • an encoding apparatus including: an information digitizing module, configured to digitize information to generate sequence data; a data identifier determining module, wherein the data identifier determining module is connected to the information digitizing module, Dividing the sequence data into N data segments, N being an integer greater than 1, searching for a corresponding nucleic acid fragment in the gene database for each data segment, and locating the nucleic acid fragment in the gene database
  • the information is used as an identifier of each data segment.
  • the code generation module is connected to the data identifier determining module, and is configured to generate a sequence code according to the identifier corresponding to each data segment.
  • the data identifier determining module performs further data partitioning on the data segment in the gene database that does not find the corresponding nucleic acid fragment, obtains M data segments, and searches for M data in the gene database.
  • M is an integer greater than one.
  • the information digitizing module transcodes the binary code corresponding to the information to generate the sequence data.
  • sequence data is data consisting of four deoxyribonucleotides of adenine A, cytosine C, guanine G, and thymine T.
  • the information digitizing module converts 0 of the binary code to A or T, and 1 converts to C or G to generate the sequence data.
  • the information digitizing module converts 01 in the binary code to A, 00 to T, 11 to C, and 10 to G to generate the sequence data.
  • the sequence data is a binary code corresponding to the information.
  • the apparatus further includes a genetic data transcoding module, wherein the genetic data transcoding module is respectively connected to the information digitizing module and the data identifier determining module, and is configured to convert all the nucleic acid fragments in the gene database
  • the code is a binary code.
  • the genetic data transcoding module converts A or T in the gene database into binary code 0, and C or G is converted to binary code 1.
  • the genetic data transcoding module converts A in the gene database into binary code 01, T is converted to binary code 00, C is converted to binary code 11, and G is converted to binary code 10.
  • the identifier comprises position information of the first symbol and the last symbol of the nucleic acid fragment in the gene database.
  • the identifier includes position information of the first symbol of the nucleic acid fragment in the gene database, And the length of the nucleic acid fragment.
  • the information is at least one of text information, picture information, audio information, or video information.
  • the genetic database comprises one or more animal and/or plant and/or microbial genomic data.
  • the gene database comprises wild type genomic data and/or synthetic genomic data.
  • the gene database comprises human genomic data.
  • a decoding apparatus including: a data identifier obtaining module, configured to acquire an identifier corresponding to each data segment from the encoded data, where the encoded data is according to any one of the foregoing embodiments.
  • the encoding method is the sequence encoding generated by the encoding device according to any one of the above embodiments; the sequence obtaining module is connected to the data identifier obtaining module, and configured to acquire a position corresponding to each data segment according to the identifier.
  • a data processing apparatus comprising: a memory and a processor coupled to the memory, the processor being configured to perform the above based on an instruction stored in the memory device An encoding method or a decoding method in any of the embodiments.
  • a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements an encoding method or a decoding method in any of the above embodiments.
  • An advantage of the present invention is that the information is encrypted by matching the sequence data of the information to be encrypted to the nucleic acid fragments in the gene database and encoding the corresponding position information as a sequence. Utilizing the ultra-high storage density of nucleic acids and a unique intermolecular recognition mechanism, encryption can be completed without complicated and cumbersome mathematical calculation of information, thereby improving encryption efficiency and security.
  • Another advantage of the present invention is that the use of the present invention for information storage eliminates the need for expensive DNA synthesizers and sequencers, and requires only a computer having associated programs for encoding and decoding information to store information in nucleotides.
  • the sequence includes wild-type genomes or synthetic genomes of humans or other species, and storage capacity is unlimited, allowing for the storage of an unlimited amount of information.
  • Figure 1 shows a flow chart of one embodiment of the encoding method of the present invention.
  • Fig. 2 shows a schematic diagram of one embodiment of the encoding/decoding method of the present invention.
  • Figure 3 shows a flow chart of one embodiment of the decoding method of the present invention.
  • Fig. 4 is a block diagram showing an embodiment of an encoding apparatus of the present invention.
  • Fig. 5 is a block diagram showing an embodiment of a decoding apparatus of the present invention.
  • Fig. 6 is a block diagram showing an embodiment of a data processing device of the present invention.
  • Figure 1 shows a flow chart of one embodiment of the encoding method of the present invention.
  • step 110 the information is digitized to generate sequence data.
  • the digitizing process can include converting the information to a binary code.
  • the binary code is transcoded to generate sequence data, which may be a series of data arranged in order.
  • the information may be in any form such as text information, image information, or audio information.
  • Fig. 2 shows a schematic diagram of one embodiment of the encoding/decoding method of the present invention.
  • the information to be processed is text information 21, "What I cannot create, I do not understand. Look deep into nature, and then you will understand everything better.”
  • the 0 in this binary code can be converted to A or T, and 1 is converted to C or G to generate sequence data 23 "AGACTGGCAGCTCTTTTGGTTTAGAGCGACTA".
  • A, C, G, and T correspond to adenine, cytosine, guanine, and thymine in DNA (Deoxyribonucleic Acid, deoxyribonucleic acid), respectively.
  • Other forms of sequence data may also be generated according to other conversion modes between 1, 0 and A, C, G, and T.
  • the sequence data can be a binary code corresponding to the information.
  • all gene segments in the gene database need to be transcoded into binary code so that the binary code corresponding to the information can be found in the transformed gene database.
  • any form of information can be mapped to the data stored in DNA, thereby linking the information with the genetic database, providing the necessary technical basis for the encryption of information. Further, the following steps can be used to encrypt and store information.
  • step 120 the sequence data is divided into N data segments, N being an integer greater than one.
  • step 130 for each data segment, the corresponding nucleic acid fragment is looked up in the gene database, and the position information of the nucleic acid fragment in the gene database is used as the identification of each data segment.
  • the position information of the first symbol of the nucleic acid fragment that matches the data fragment and the length of the nucleic acid fragment may also be saved as an identification of the data fragment.
  • a nucleic acid fragment refers to a fragment formed by a plurality of nucleotides linked end to end, and the nucleotide may be a deoxyribonucleotide or a ribonucleotide.
  • the nucleic acid fragment can be transcoded into a binary code according to certain rules as needed, and the nucleic acid fragment after transcoding refers to the binary code corresponding to the nucleic acid fragment.
  • the length of the nucleic acid fragment can be expressed by the number of nucleotides, that is, "nt"; each nucleotide is regarded as one character in the present invention, and the number of nucleotides can also be expressed by the number of characters.
  • the nucleic acid fragment can be transcoded into a binary code according to a certain rule as needed, and the nucleic acid fragment after transcoding refers to a binary code corresponding to the nucleic acid fragment.
  • the length of the nucleic acid fragment is represented by a Byte.
  • the size of N can be adjusted based on the search for nucleic acid fragments in the gene database.
  • nucleic acid fragments corresponding to the data segments are found in the gene database, and the sequence data can be re-divided to obtain M data segments, and the gene database is searched for each of the M data segments.
  • a nucleic acid fragment, M being an integer greater than one.
  • the length of the re-divided data segment is smaller than the length of the original data segment so that it can be checked in the genetic database.
  • Find the nucleic acid fragment corresponding to the data fragment For example, a data fragment that cannot find a corresponding nucleic acid fragment in a gene database can be divided into multiple parts, and each part is respectively searched for a corresponding nucleic acid fragment in the gene database to improve the probability of fragment matching and the efficiency of searching.
  • the gene database 24 may be a nucleotide sequence of the human nuclear pore-reporting protein gene (SEQ ID NO: 1), which contains a database of 4103 characters.
  • the sequence data 23 is divided into a plurality of data segments, each of which contains 2 characters. The same nucleic acid fragment as each data fragment is looked up in the gene database 24.
  • the position corresponding to the first character in the nucleic acid fragment and the length of the nucleic acid fragment are recorded as an identifier.
  • the data segment composed of the first two characters AG in the sequence data corresponds to the identifier of 3856 2, that is, the 3856th character in the AG corresponding gene database starts with a nucleic acid segment of length 2 characters.
  • the length of the data fragment is reduced to 1 character and the same nucleic acid fragment is looked up in the gene database 24.
  • the new data sequence A is composed of the third character alone.
  • the corresponding identifier of the data sequence is 3827 1, that is, the 3827 characters in the corresponding gene database of A corresponds to a nucleic acid fragment having a length of 1 character.
  • step 140 sequence encoding is generated based on the identification of the respective data segment.
  • the identifiers of the data segments can be stored in order to obtain the sequence code corresponding to the information.
  • the gene database 24 in the embodiment shown in FIG. 2 described above has a small capacity, and thus the divided data segment length is also relatively small, and only the implementation process of the method is exemplarily illustrated.
  • a gene database storing a large number of gene sequences can be used as a database of encoding methods.
  • the sequence code consisting of these identifiers only contains the identifier of each data segment, which not only can realize information encryption, but also can improve storage efficiency.
  • the encoded data composed of the sequence encoding can be decoded by the inverse of the above steps.
  • Figure 3 shows a flow chart of one embodiment of the decoding method of the present invention.
  • step 310 an identifier corresponding to each data segment is obtained from the encoded data.
  • the encoded data may be the sequence code 25 in FIG.
  • step 320 location information corresponding to each data segment is obtained according to the identifier.
  • a corresponding nucleic acid fragment is obtained from the gene database based on the location information.
  • the identifier 3856 2 in the sequence code 25 in FIG. 2 represents a nucleic acid fragment of the gene database 24 starting with the 3856th character and having a length of 2 characters.
  • step 340 sequence data is generated from the nucleic acid fragments.
  • the acquired gene fragments can be combined to obtain the sequence data 23 "AGACTGGCAGCTCTTTTGGTTTAGAGCGACTA" in FIG.
  • the sequence data 23 is transcoded into a binary code 22 "01010111011010000110000101110100" according to the transcoding relationship between A, C, G, T and 1, 0 employed in encoding.
  • step 350 information is obtained from the sequence data.
  • the binary code 22 can be decoded into text information 21 "What I cannot create, I do not understand", thereby completing the decryption.
  • the information of the information to be encrypted is corresponding to the gene segment in the gene database, and the corresponding position information is encoded as a sequence, thereby realizing the encryption of the information.
  • Fig. 4 is a block diagram showing an embodiment of an encoding apparatus of the present invention.
  • the apparatus includes an information digitization module 41, a data identification determination module 42, and an encoding generation module 43.
  • the information digitization module 31 digitizes the information to generate sequence data.
  • the information digitization module 41 transcodes the binary code corresponding to the information to generate sequence data. For example, the information digitization module 41 converts 0 in the binary code to A or T, 1 to C or G to generate sequence data, or converts 01 in the binary code to A, 00 to T, 11 to C, 10 is converted to G to generate sequence data.
  • the sequence data is data composed of A, C, G, and T.
  • the apparatus further includes a genetic data transcoding module 44.
  • the gene data transcoding module 44 transcodes all of the nucleic acid fragments in the gene database into a binary code.
  • the data identification determining module 42 divides the sequence data into N data segments, N is an integer greater than 1, for each data segment, searches for a corresponding gene segment in the gene database, and uses the position information of the nucleic acid segment in the gene database as The identification of each piece of data.
  • the identification may include positional information of the first symbol and the last symbol of the nucleic acid fragment in the gene database, or the identification may include positional information of the first symbol of the nucleic acid fragment in the gene database, and the length of the nucleic acid fragment.
  • the data identification determining module 42 performs further data partitioning on the data segments in the gene database for which the corresponding nucleic acid fragments are not found, obtains M data segments, and searches for the M data segments in the gene database. For each corresponding nucleic acid fragment, M is an integer greater than one.
  • the code generation module 43 generates a sequence code based on the identifier corresponding to each data segment. For example, the sequence encoding may be sequentially generated in the order in which the respective data segments are divided.
  • Fig. 5 is a block diagram showing an embodiment of a decoding apparatus of the present invention.
  • the apparatus includes: a data identifier acquisition module 51, a sequence acquisition module 52, and an information generation module 53.
  • the data identifier obtaining module 51 acquires an identifier corresponding to each data segment from the encoded data, and the encoded data is a sequence encoding generated by the encoding method in the above embodiment or by the encoding device in the above embodiment.
  • the sequence obtaining module 52 acquires location information corresponding to each data segment according to the identifier, and acquires a corresponding nucleic acid fragment from the gene database according to the location information.
  • the information generating module 53 generates sequence data based on the nucleic acid fragments and acquires information based on the sequence data.
  • the information of the information to be encrypted is corresponding to the gene segment in the gene database, and the corresponding position information is encoded as a sequence, thereby realizing the encryption of the information.
  • Fig. 6 is a block diagram showing an embodiment of a data processing device of the present invention.
  • the apparatus 6 of this embodiment includes a memory 61 and a processor 62 coupled to the memory 61, the processor 62 being configured to perform any one of the implementations of the present invention based on instructions stored in the memory 61.
  • the encoding method or decoding method in the example is the example.
  • the memory 61 may include, for example, a system memory, a fixed non-volatile storage medium, or the like.
  • the system memory stores, for example, an operating system, an application, a boot loader, a database, and other programs.
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code. .
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code. .
  • the methods and systems of the present invention may be implemented in a number of ways.
  • the methods and systems of the present invention can be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware.
  • the above-described sequence of steps for the method is for illustrative purposes only, and the steps of the method of the present invention are not limited to the order specifically described above unless otherwise specifically stated.
  • the invention may also be embodied as a program recorded in a recording medium, the program comprising machine readable instructions for implementing the method according to the invention.
  • the invention also covers a recording medium storing a program for performing the method according to the invention.

Abstract

The present invention discloses an encoding and decoding method, a device and a data processing device relating to the technical field of data processing. The encoding method comprises: performing digital processing on information to generate sequence data (110); dividing the sequence data into N data segments (120), wherein N is an integer greater than 1; searching a genetic database for a nucleic acid segment, corresponding to each data segment, and using position information of the nucleic acid segment in the genetic database as an identifier of each data segment (130); and generating a sequence encoding according to the identifier corresponding to each data segment (140). The method and device can be used to increase the efficiency and security of encryption.

Description

编码/解码方法、装置和数据处理装置Encoding/decoding method, device and data processing device 技术领域Technical field
本发明涉及数据处理技术领域,特别涉及一种编码方法、编码装置、解码方法、解码装置、数据处理装置以及计算机可读存储介质。The present invention relates to the field of data processing technologies, and in particular, to an encoding method, an encoding device, a decoding method, a decoding device, a data processing device, and a computer readable storage medium.
背景技术Background technique
随着信息化技术的快速发展,由数字编码组成,以互联网或其他各种传输途径为载体的数字化信息已经广泛应用到了人类社会生活的各个方面。因此,保护数字化信息的安全就显得更加重要,尤其是在军事、商业和医疗等特殊领域。With the rapid development of information technology, digital information consisting of digital coding and Internet or other various transmission channels has been widely applied to all aspects of human social life. Therefore, the security of protecting digital information is even more important, especially in special areas such as military, commercial and medical.
作为保障数字化信息安全性的重要技术手段,数字加密技术越来越受到重视。相关技术主要是采用秘钥将信息明文转换成无意义的密文,以起到加密效果。As an important technical means to ensure the security of digital information, digital encryption technology has received more and more attention. The related technology mainly uses the secret key to convert the plaintext of the information into meaningless ciphertext to achieve the encryption effect.
到目前为止,利用DNA进行信息存储的方法都需要:1)具有编码和解码信息的相关程序的计算机,将信息存储为“计算机语言”(0或1,数字信息的二进制代码),然后再转换成“生物语言”(DNA序列中的核苷酸A,T,C和G)。2)DNA合成仪,用于将“生物语言”的信息在体外或体内存储。3)DNA测序仪,在获得“生物语言”信息后,将“生物语言”重新转换为“计算机语言”,并进一步存储信息。虽然这是一个完全可用的系统,但步骤2)和3)使用的仪器是非常昂贵,且整个方法流程操作费时费力,无法广泛使用。So far, methods for storing information using DNA require: 1) A computer with a program that encodes and decodes information, storing the information as a "computer language" (0 or 1, binary code of digital information), and then converting Into "biological language" (nucleotides A, T, C and G in DNA sequences). 2) A DNA synthesizer for storing "biological language" information in vitro or in vivo. 3) The DNA sequencer, after obtaining the "biological language" information, re-converts "biological language" into "computer language" and further stores the information. Although this is a fully usable system, the instruments used in steps 2) and 3) are very expensive and the entire process flow is time consuming and labor intensive and cannot be widely used.
发明内容Summary of the invention
本发明人发现上述相关技术中存在如下问题:仅依靠预先规定的数学方法对信息进行复杂繁琐的计算,导致加密效率低,安全性不高;现有的利用DNA进行信息存储的方法需要使用DNA合成仪与测序器,价格昂贵,方法操作费时费力。针对上述问题中的至少一个问题,本发明人提出了解决方案。The inventors have found that the above-mentioned related art has the following problems: the complicated and cumbersome calculation of information is performed only by a predetermined mathematical method, resulting in low encryption efficiency and low security; the existing method for storing information by using DNA requires DNA. Synthesizers and sequencers are expensive and the method operations are time consuming and labor intensive. The inventors have proposed a solution to at least one of the above problems.
本发明的一个目的是提供一种加密效率高,安全性高的编码技术方案,以及本发明的另一个目的是提供一种操作简单,价格低廉的信息存储方案。An object of the present invention is to provide an encoding technology solution with high encryption efficiency and high security, and another object of the present invention is to provide an information storage solution which is simple in operation and low in price.
根据本发明的一个实施例,提供了一种编码方法,包括:对信息进行数字化处理生成序列数据;将所述序列数据划分为N个数据片段,N为大于1的整数;针对每个数据片段,在基因数据库中查找相应的核酸片段,并将所述核酸片段在所述基因数据 库中的位置信息作为每个数据片段的标识;根据各个数据片段对应的标识生成序列编码。According to an embodiment of the present invention, there is provided an encoding method comprising: digitizing information to generate sequence data; dividing the sequence data into N data segments, N being an integer greater than 1; for each data segment Finding a corresponding nucleic acid fragment in a gene database, and arranging the nucleic acid fragment in the genetic data The location information in the library is used as an identifier of each data segment; a sequence code is generated according to the identifier corresponding to each data segment.
可选地,针对所述基因数据库中没有查找到相应的核酸片段的数据片段,进行进一步的数据划分,得到M个数据片段,并在所述基因数据库中查找与M个数据片段中的每一个相应的核酸片段,M为大于1的整数。Optionally, for the data segment in the gene database that does not find the corresponding nucleic acid segment, further data division is performed to obtain M data segments, and each of the M data segments is searched in the genetic database. A corresponding nucleic acid fragment, M is an integer greater than one.
可选地,所述数字化处理为对所述信息对应的二进制代码进行转码生成所述序列数据。Optionally, the digitizing process is to transcode the binary code corresponding to the information to generate the sequence data.
可选地,所述序列数据为由腺嘌呤A、胞嘧啶C、鸟嘌呤G和胸腺嘧啶T四种脱氧核糖核苷酸构成的数据。Alternatively, the sequence data is data consisting of four deoxyribonucleotides of adenine A, cytosine C, guanine G, and thymine T.
可选地,将所述二进制代码中的0转换为A或T,1转换为C或G以生成所述序列数据。Alternatively, 0 in the binary code is converted to A or T, and 1 is converted to C or G to generate the sequence data.
可选地,将所述二进制代码中的01转换为A,00转换为T,11转换为C,10转换为G以生成所述序列数据。Alternatively, 01 in the binary code is converted to A, 00 is converted to T, 11 is converted to C, and 10 is converted to G to generate the sequence data.
可选地,所述序列数据为所述信息对应的二进制代码。Optionally, the sequence data is a binary code corresponding to the information.
可选地,在查找步骤之前,将所述基因数据库中所有的核酸片段转码为二进制代码。Optionally, all nucleic acid fragments in the gene database are transcoded into binary code prior to the searching step.
可选地,将所述基因数据库中的A或T转换为二进制代码0,C或G转换为二进制代码1。Alternatively, A or T in the gene database is converted to binary code 0, and C or G is converted to binary code 1.
可选地,将所述基因数据库中的A转换为二进制代码01,T转换为二进制代码00,C转换为二进制代码11,G转换为二进制代码10。Alternatively, A in the gene database is converted to binary code 01, T is converted to binary code 00, C is converted to binary code 11, and G is converted to binary code 10.
可选地,所述标识包括所述核酸片段第一个符号和最后一个符号在所述基因数据库中的位置信息。Optionally, the identifier comprises position information of the first symbol and the last symbol of the nucleic acid fragment in the gene database.
可选地,所述标识包括所述核酸片段第一个符号在所述基因数据库中的位置信息,和所述核酸片段的长度。Optionally, the identifier comprises positional information of a first symbol of the nucleic acid fragment in the gene database, and a length of the nucleic acid fragment.
可选地,所述基因数据库包括一个或多个动物和/或植物和/或微生物基因组数据。Optionally, the genetic database comprises one or more animal and/or plant and/or microbial genomic data.
可选地,所述基因数据库包括野生型基因组数据和/或合成型基因组数据。Optionally, the gene database comprises wild type genomic data and/or synthetic genomic data.
可选地,所述基因数据库包括人类基因组数据。Optionally, the gene database comprises human genomic data.
根据本发明的另一个实施例,提供一种解码方法,包括:从编码数据中获取各数据片段对应的标识,所述编码数据为根据上述任一个实施例所述的编码方法生成的序列编码;根据所述标识获取各数据片段对应的位置信息;根据所述位置信息从基因数 据库中获取对应的核酸片段;根据所述核酸片段生成序列数据。根据所述序列数据获取信息。According to another embodiment of the present invention, a decoding method is provided, including: obtaining an identifier corresponding to each data segment from the encoded data, where the encoded data is a sequence encoding generated according to the encoding method according to any of the above embodiments; Obtaining location information corresponding to each data segment according to the identifier; and counting the number of genes according to the location information A corresponding nucleic acid fragment is obtained from a library; sequence data is generated based on the nucleic acid fragment. Information is obtained based on the sequence data.
根据本发明的又一个实施例,提供一种编码装置,包括:信息数字化模块,用于对信息进行数字化处理生成序列数据;数据标识确定模块,所述数据标识确定模块与信息数字化模块相连,用于将所述序列数据划分为N个数据片段,N为大于1的整数,针对每个数据片段,在基因数据库中查找相应的核酸片段,并将所述核酸片段在所述基因数据库中的位置信息作为每个数据片段的标识;编码生成模块,所述编码生成模块与所述数据标识确定模块相连,用于根据各个数据片段对应的标识,生成序列编码。According to still another embodiment of the present invention, an encoding apparatus is provided, including: an information digitizing module, configured to digitize information to generate sequence data; a data identifier determining module, wherein the data identifier determining module is connected to the information digitizing module, Dividing the sequence data into N data segments, N being an integer greater than 1, searching for a corresponding nucleic acid fragment in the gene database for each data segment, and locating the nucleic acid fragment in the gene database The information is used as an identifier of each data segment. The code generation module is connected to the data identifier determining module, and is configured to generate a sequence code according to the identifier corresponding to each data segment.
可选地,所述数据标识确定模块针对所述基因数据库中没有查找到相应的核酸片段的数据片段,进行进一步的数据划分,得到M个数据片段,并在所述基因数据库中查找与M个数据片段中的每一个相应的核酸片段,M为大于1的整数。Optionally, the data identifier determining module performs further data partitioning on the data segment in the gene database that does not find the corresponding nucleic acid fragment, obtains M data segments, and searches for M data in the gene database. Each corresponding nucleic acid fragment in the data fragment, M is an integer greater than one.
可选地,上述信息数字化模块对所述信息对应的二进制代码进行转码生成所述序列数据。Optionally, the information digitizing module transcodes the binary code corresponding to the information to generate the sequence data.
可选地,所述序列数据为由腺嘌呤A、胞嘧啶C、鸟嘌呤G和胸腺嘧啶T四种脱氧核糖核苷酸构成的数据。Alternatively, the sequence data is data consisting of four deoxyribonucleotides of adenine A, cytosine C, guanine G, and thymine T.
可选地,所述信息数字化模块将所述二进制代码中的0转换为A或T,1转换为C或G以生成所述序列数据。Optionally, the information digitizing module converts 0 of the binary code to A or T, and 1 converts to C or G to generate the sequence data.
可选地,所述信息数字化模块将所述二进制代码中的01转换为A,00转换为T,11转换为C,10转换为G以生成所述序列数据。Optionally, the information digitizing module converts 01 in the binary code to A, 00 to T, 11 to C, and 10 to G to generate the sequence data.
可选地,所述序列数据为所述信息对应的二进制代码。Optionally, the sequence data is a binary code corresponding to the information.
可选地,该装置还包括基因数据转码模块,所述基因数据转码模块分别与所述信息数字化模块、所述数据标识确定模块相连,用于将所述基因数据库中所有的核酸片段转码为二进制代码。Optionally, the apparatus further includes a genetic data transcoding module, wherein the genetic data transcoding module is respectively connected to the information digitizing module and the data identifier determining module, and is configured to convert all the nucleic acid fragments in the gene database The code is a binary code.
可选地,所述基因数据转码模块将所述基因数据库中的A或T转换为二进制代码0,C或G转换为二进制代码1。Optionally, the genetic data transcoding module converts A or T in the gene database into binary code 0, and C or G is converted to binary code 1.
可选地,所述基因数据转码模块将所述基因数据库中的A转换为二进制代码01,T转换为二进制代码00,C转换为二进制代码11,G转换为二进制代码10。Optionally, the genetic data transcoding module converts A in the gene database into binary code 01, T is converted to binary code 00, C is converted to binary code 11, and G is converted to binary code 10.
可选地,所述标识包括所述核酸片段第一个符号和最后一个符号在所述基因数据库中的位置信息。Optionally, the identifier comprises position information of the first symbol and the last symbol of the nucleic acid fragment in the gene database.
可选地,所述标识包括所述核酸片段第一个符号在所述基因数据库中的位置信息, 和所述核酸片段的长度。Optionally, the identifier includes position information of the first symbol of the nucleic acid fragment in the gene database, And the length of the nucleic acid fragment.
可选地,所述信息为文字信息、图片信息、音频信息或视频信息中的至少一种。Optionally, the information is at least one of text information, picture information, audio information, or video information.
可选地,所述基因数据库包括一个或多个动物和/或植物和/或微生物基因组数据。Optionally, the genetic database comprises one or more animal and/or plant and/or microbial genomic data.
可选地,所述基因数据库包括野生型基因组数据和/或合成型基因组数据。Optionally, the gene database comprises wild type genomic data and/or synthetic genomic data.
可选地,所述基因数据库包括人类基因组数据。Optionally, the gene database comprises human genomic data.
根据本发明的又一个实施例,提供一种解码装置,包括:数据标识获取模块,用于从编码数据中获取各数据片段对应的标识,所述编码数据为根据上述任一个实施例所述的编码方法或者根据上述任一个实施例所述的编码装置生成的序列编码;序列获取模块,所述序列获取模块与所述数据标识获取模块相连,用于根据所述标识获取各数据片段对应的位置信息,并根据所述位置信息从基因数据库中获取对应的核酸片段;信息生成模块,所述信息生成模块与所述序列获取模块相连,用于根据所述核酸片段生成序列数据,并根据所述序列数据获取信息。According to still another embodiment of the present invention, a decoding apparatus is provided, including: a data identifier obtaining module, configured to acquire an identifier corresponding to each data segment from the encoded data, where the encoded data is according to any one of the foregoing embodiments. The encoding method is the sequence encoding generated by the encoding device according to any one of the above embodiments; the sequence obtaining module is connected to the data identifier obtaining module, and configured to acquire a position corresponding to each data segment according to the identifier. And generating, according to the location information, a corresponding nucleic acid fragment from a gene database; an information generating module, wherein the information generating module is connected to the sequence acquiring module, configured to generate sequence data according to the nucleic acid segment, and according to the Sequence data gets information.
根据本发明的再一个实施例,提供一种数据处理装置,包括:存储器以及耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器装置中的指令,执行上述任一个实施例中的编码方法或解码方法。According to still another embodiment of the present invention, there is provided a data processing apparatus comprising: a memory and a processor coupled to the memory, the processor being configured to perform the above based on an instruction stored in the memory device An encoding method or a decoding method in any of the embodiments.
根据本发明的再一个实施例,提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述任一个实施例中的编码方法或解码方法。According to still another embodiment of the present invention, there is provided a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements an encoding method or a decoding method in any of the above embodiments.
本发明的一个优点在于,通过将待加密信息的序列数据对应到基因数据库中的核酸片段,将相应的位置信息作为序列编码,从而实现了信息的加密。利用核酸的超高存储密度以及独特的分子间识别机制,无需对信息进行复杂繁琐的数学计算即可完成加密,从而提高了加密效率和安全性。An advantage of the present invention is that the information is encrypted by matching the sequence data of the information to be encrypted to the nucleic acid fragments in the gene database and encoding the corresponding position information as a sequence. Utilizing the ultra-high storage density of nucleic acids and a unique intermolecular recognition mechanism, encryption can be completed without complicated and cumbersome mathematical calculation of information, thereby improving encryption efficiency and security.
本发明的另一个优点在于,使用本发明进行信息存储,不需要价格昂贵的DNA合成仪与测序仪,只需要具有编码和解码信息的相关程序的计算机,就可以实现将信息存储在核苷酸序列中,包括人类或其他物种的野生型基因组或者合成基因组,并且存储容量不受限制,允许存储无限量的信息。Another advantage of the present invention is that the use of the present invention for information storage eliminates the need for expensive DNA synthesizers and sequencers, and requires only a computer having associated programs for encoding and decoding information to store information in nucleotides. The sequence includes wild-type genomes or synthetic genomes of humans or other species, and storage capacity is unlimited, allowing for the storage of an unlimited amount of information.
附图说明DRAWINGS
构成说明书的一部分的附图描述了本发明的实施例,并且连同说明书一起用于解释本发明的原理。The accompanying drawings, which are incorporated in FIG.
参照附图,根据下面的详细描述,可以更加清楚地理解本发明,其中: The invention may be more clearly understood from the following detailed description, in which:
图1示出本发明的编码方法的一个实施例的流程图。Figure 1 shows a flow chart of one embodiment of the encoding method of the present invention.
图2示出了本发明的编码/解码方法的一个实施例的示意图。Fig. 2 shows a schematic diagram of one embodiment of the encoding/decoding method of the present invention.
图3示出本发明的解码方法的一个实施例的流程图。Figure 3 shows a flow chart of one embodiment of the decoding method of the present invention.
图4示出本发明的编码装置的一个实施例的结构图。Fig. 4 is a block diagram showing an embodiment of an encoding apparatus of the present invention.
图5示出本发明的解码装置的一个实施例的结构图。Fig. 5 is a block diagram showing an embodiment of a decoding apparatus of the present invention.
图6示出本发明的数据处理装置的一个实施例的结构图。Fig. 6 is a block diagram showing an embodiment of a data processing device of the present invention.
具体实施方式Detailed ways
现在将参照附图来详细描述本发明的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本发明的范围。Various exemplary embodiments of the present invention will now be described in detail with reference to the drawings. It should be noted that the relative arrangement of the components and steps, numerical expressions and numerical values set forth in the embodiments are not intended to limit the scope of the invention unless otherwise specified.
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。In the meantime, it should be understood that the dimensions of the various parts shown in the drawings are not drawn in the actual scale relationship for the convenience of the description.
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本发明及其应用或使用的任何限制。The following description of the at least one exemplary embodiment is merely illustrative and is in no way
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为授权说明书的一部分。Techniques, methods and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, the techniques, methods and apparatus should be considered as part of the authorization specification.
在这里示出和讨论的所有示例中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它示例可以具有不同的值。In all of the examples shown and discussed herein, any specific values are to be construed as illustrative only and not as a limitation. Accordingly, other examples of the exemplary embodiments may have different values.
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。It should be noted that similar reference numerals and letters indicate similar items in the following figures, and therefore, once an item is defined in one figure, it is not required to be further discussed in the subsequent figures.
图1示出本发明的编码方法的一个实施例的流程图。Figure 1 shows a flow chart of one embodiment of the encoding method of the present invention.
如图1所示,在步骤110中,对信息进行数字化处理生成序列数据。As shown in FIG. 1, in step 110, the information is digitized to generate sequence data.
在一个实施例中,数字化处理可以包括将信息转换为二进制代码。对二进制代码进行转码生成序列数据,序列数据可以是按顺序排列的一系列数据。信息可以为文本信息、图像信息或者音频信息等任何形式。下面以文档信息为例说明本发明的编码方法的具体步骤。In one embodiment, the digitizing process can include converting the information to a binary code. The binary code is transcoded to generate sequence data, which may be a series of data arranged in order. The information may be in any form such as text information, image information, or audio information. The specific steps of the encoding method of the present invention will be described below by taking document information as an example.
图2示出了本发明的编码/解码方法的一个实施例的示意图。Fig. 2 shows a schematic diagram of one embodiment of the encoding/decoding method of the present invention.
如图2所示,待处理的信息为文本信息21,“What I cannot create,I do not understand.Look deep into nature,and then you will understand everything better.”, 将这段文字信息转换为二进制代码22“01010111011010000110000101110100……”。可以将这段二进制代码中的0转换为A或T,1转换为C或G以生成序列数据23“AGACTGGCAGCTCTTTTGGTTTAGAGCGACTA……”。A、C、G和T分别对应DNA(Deoxyribonucleic Acid,脱氧核糖核酸)中的腺嘌呤(Adenine)、胞嘧啶(Cytosine)、鸟嘌呤(Guanine)和胸腺嘧啶(Thymine)。也可以依据1、0与A、C、G、T之间的其它转换方式生成其它形式的序列数据。As shown in FIG. 2, the information to be processed is text information 21, "What I cannot create, I do not understand. Look deep into nature, and then you will understand everything better.", Convert this text information into binary code 22 "01010111011010000110000101110100...". The 0 in this binary code can be converted to A or T, and 1 is converted to C or G to generate sequence data 23 "AGACTGGCAGCTCTTTTGGTTTAGAGCGACTA...". A, C, G, and T correspond to adenine, cytosine, guanine, and thymine in DNA (Deoxyribonucleic Acid, deoxyribonucleic acid), respectively. Other forms of sequence data may also be generated according to other conversion modes between 1, 0 and A, C, G, and T.
在另一个实施例中,序列数据可以为信息对应的二进制代码。在这种情况下,需要将基因数据库中所有的基因片段转码为二进制代码,以便可以在变换后的基因数据库中找到与信息对应的二进制代码。In another embodiment, the sequence data can be a binary code corresponding to the information. In this case, all gene segments in the gene database need to be transcoded into binary code so that the binary code corresponding to the information can be found in the transformed gene database.
通过上面的步骤可以将任何形式的信息对应到以DNA方式存储的数据,从而将信息与基因数据库联系起来,为信息的加密提供了必要的技术基础。进一步可以通过下面的步骤实现信息加密以及存储。Through the above steps, any form of information can be mapped to the data stored in DNA, thereby linking the information with the genetic database, providing the necessary technical basis for the encryption of information. Further, the following steps can be used to encrypt and store information.
在步骤120中,将序列数据划分为N个数据片段,N为大于1的整数。In step 120, the sequence data is divided into N data segments, N being an integer greater than one.
在步骤130中,针对每个数据片段,在基因数据库中查找相应的核酸片段,并将核酸片段在基因数据库中的位置信息作为每个数据片段的标识。也可以将与数据片段匹配的核酸片段的第一个符号的位置信息和核酸片段长度保存为数据片段的标识。In step 130, for each data segment, the corresponding nucleic acid fragment is looked up in the gene database, and the position information of the nucleic acid fragment in the gene database is used as the identification of each data segment. The position information of the first symbol of the nucleic acid fragment that matches the data fragment and the length of the nucleic acid fragment may also be saved as an identification of the data fragment.
核酸片段是指由多个核苷酸头尾相连而形成的片段,核苷酸可以是脱氧核糖核苷酸,也可以是核糖核苷酸。可根据需要将核酸片段依照一定规则转码为二进制代码,而转码之后的核酸片段则指核酸片段所对应的二进制代码。A nucleic acid fragment refers to a fragment formed by a plurality of nucleotides linked end to end, and the nucleotide may be a deoxyribonucleotide or a ribonucleotide. The nucleic acid fragment can be transcoded into a binary code according to certain rules as needed, and the nucleic acid fragment after transcoding refers to the binary code corresponding to the nucleic acid fragment.
核酸片段的长度可以使用核苷酸的数量来表示,即“nt”;本发明中将每个核苷酸视为1个字符,也可以用字符的数量来表示核苷酸的数量。本发明中,可以根据需要将核酸片段依照一定规则转码为二进制代码,而转码之后的核酸片段则指核酸片段所对应的二进制代码。在这种情况下,核酸片段的长度则是用字节(Byte)来表示。The length of the nucleic acid fragment can be expressed by the number of nucleotides, that is, "nt"; each nucleotide is regarded as one character in the present invention, and the number of nucleotides can also be expressed by the number of characters. In the present invention, the nucleic acid fragment can be transcoded into a binary code according to a certain rule as needed, and the nucleic acid fragment after transcoding refers to a binary code corresponding to the nucleic acid fragment. In this case, the length of the nucleic acid fragment is represented by a Byte.
在上面两个步骤中,N的值越大,编码的存储效率越高,但是在基因数据库中找到相应的核酸片段的几率就越小。因此,可以根据在基因数据库中核酸片段的查找情况调整N的大小。In the above two steps, the larger the value of N, the higher the storage efficiency of the code, but the smaller the probability of finding the corresponding nucleic acid fragment in the gene database. Therefore, the size of N can be adjusted based on the search for nucleic acid fragments in the gene database.
在一个实施例中,在基因数据库中没有查找到与数据片段相应的核酸片段,可以将序列数据重新划分,得到M个数据片段,并在基因数据库中查找与M个数据片段中的每一个相应的核酸片段,M为大于1的整数。In one embodiment, no nucleic acid fragments corresponding to the data segments are found in the gene database, and the sequence data can be re-divided to obtain M data segments, and the gene database is searched for each of the M data segments. A nucleic acid fragment, M being an integer greater than one.
重新划分的数据片段的长度小于原数据片段的长度,以便能够在基因数据库中查 找到与数据片段相应的核酸片段。例如,可以将无法在基因数据库中找到相应核酸片段的数据片段划分为多个部分,分别查找各部分在基因数据库中相应的核酸片段,以提高片段匹配几率和查找效率。The length of the re-divided data segment is smaller than the length of the original data segment so that it can be checked in the genetic database. Find the nucleic acid fragment corresponding to the data fragment. For example, a data fragment that cannot find a corresponding nucleic acid fragment in a gene database can be divided into multiple parts, and each part is respectively searched for a corresponding nucleic acid fragment in the gene database to improve the probability of fragment matching and the efficiency of searching.
在一个实施例中,如图2所示,基因数据库24可以为人核孔复核蛋白基因的核苷酸序列(SEQ ID NO:1),共含有4103个字符构成的数据库。将序列数据23划分多个数据片段,每个数据片段包含2个字符。在基因数据库24中查找与各个数据片段相同的核酸片段。In one embodiment, as shown in FIG. 2, the gene database 24 may be a nucleotide sequence of the human nuclear pore-reporting protein gene (SEQ ID NO: 1), which contains a database of 4103 characters. The sequence data 23 is divided into a plurality of data segments, each of which contains 2 characters. The same nucleic acid fragment as each data fragment is looked up in the gene database 24.
如果查找到了相同的核酸片段,则记录该核酸片段中第一个字符对应的位置以及该核酸片段的长度作为标识。例如,序列数据中的前两个字符AG组成的数据片段对应的标识为3856 2,即AG对应基因数据库中第3856个字符开始长度为2个字符的核酸片段。If the same nucleic acid fragment is found, the position corresponding to the first character in the nucleic acid fragment and the length of the nucleic acid fragment are recorded as an identifier. For example, the data segment composed of the first two characters AG in the sequence data corresponds to the identifier of 3856 2, that is, the 3856th character in the AG corresponding gene database starts with a nucleic acid segment of length 2 characters.
如果未查找到相同的核酸片段,则将数据片段的长度降为1个字符,并在基因数据库24中查找相同的核酸片段。例如,序列数据中的第3和第4个字符组成的数据序列AC在基因数据库24中不存在相同的核酸片段,则以第3个字符单独组成新的数据序列A。该数据序列对应的标识为3827 1,即A对应基因数据库中第3827个字符开始长度为1个字符的核酸片段。If the same nucleic acid fragment is not found, the length of the data fragment is reduced to 1 character and the same nucleic acid fragment is looked up in the gene database 24. For example, if the data sequence AC composed of the 3rd and 4th characters in the sequence data does not have the same nucleic acid fragment in the gene database 24, the new data sequence A is composed of the third character alone. The corresponding identifier of the data sequence is 3827 1, that is, the 3827 characters in the corresponding gene database of A corresponds to a nucleic acid fragment having a length of 1 character.
在步骤140中,根据各个数据片段对应的标识生成序列编码。如图2所示,可以将各数据片段的标识按顺序存储起来即可获得信息对应的序列编码25“3856 2 3827 1 3856 1 1313 1 3275 1 1079 1 3906 1 1078 1 3856 2 853 1 949 1 3229 1 2600 1 3755 1 2496 1 714 1 2518 1 2736 1 1713 1 1789 1 1291 1 2153 1 3601 2 1159 1 537 1 2660 1 1962 1 375 1 892 1 1309 1 2620 1 2736 1……”。也可以为各个数据片段增加标识位以指示其生成顺序,再按照任意顺序将各个数据片段存储为序列编码。In step 140, sequence encoding is generated based on the identification of the respective data segment. As shown in FIG. 2, the identifiers of the data segments can be stored in order to obtain the sequence code corresponding to the information. 25" 3856 2 3827 1 3856 1 1313 1 3275 1 1079 1 3906 1 1078 1 3856 2 853 1 949 1 3229 1 2600 1 3755 1 2496 1 714 1 2518 1 2736 1 1713 1 1789 1 1291 1 2153 1 3601 2 1159 1 537 1 2660 1 1962 1 375 1 892 1 1309 1 2620 1 2736 1...". It is also possible to add identification bits to individual data segments to indicate their generation order, and then store each data segment as a sequence code in any order.
上述图2示出的实施例中的基因数据库24容量较小,因此划分的数据片段长度也比较小,仅是示例性地说明本方法的实现过程。在实际应用中,可以采用存储有海量基因序列的基因数据库作为编码方法的数据库。如野生型或合成型的人类基因组数据、细菌基因组数据或多个物种基因组数据的组合数据库等,这些基因数据库中包含有几十亿量级的核苷酸,完全可以支持查找划分长度为几十甚至几百bit的数据片段,并以简短的标识来对这些数据片段进行编码。由这些标识构成的序列编码中仅包含各个数据片段的标识,不但可以实现信息加密,而且可以提高存储效率。The gene database 24 in the embodiment shown in FIG. 2 described above has a small capacity, and thus the divided data segment length is also relatively small, and only the implementation process of the method is exemplarily illustrated. In practical applications, a gene database storing a large number of gene sequences can be used as a database of encoding methods. Such as wild-type or synthetic human genomic data, bacterial genomic data or a combined database of genomic data from multiple species, these gene databases contain billions of nucleotides, which can fully support the search segmentation length of tens Even hundreds of bits of data are encoded with short identifiers. The sequence code consisting of these identifiers only contains the identifier of each data segment, which not only can realize information encryption, but also can improve storage efficiency.
可以通过上述步骤的逆过程对序列编码构成的编码数据进行解码。 The encoded data composed of the sequence encoding can be decoded by the inverse of the above steps.
图3示出本发明的解码方法的一个实施例的流程图。Figure 3 shows a flow chart of one embodiment of the decoding method of the present invention.
如图3所示,在步骤310中,从编码数据中获取各数据片段对应的标识。例如,编码数据可以为图2中的序列编码25。As shown in FIG. 3, in step 310, an identifier corresponding to each data segment is obtained from the encoded data. For example, the encoded data may be the sequence code 25 in FIG.
在步骤320中,根据标识获取各数据片段对应的位置信息。In step 320, location information corresponding to each data segment is obtained according to the identifier.
在步骤330中,根据位置信息从基因数据库中获取对应的核酸片段。例如,图2中的序列编码25中的标识3856 2即表示基因数据库24中以第3856个字符为起始字符,长度为2个字符的核酸片段。In step 330, a corresponding nucleic acid fragment is obtained from the gene database based on the location information. For example, the identifier 3856 2 in the sequence code 25 in FIG. 2 represents a nucleic acid fragment of the gene database 24 starting with the 3856th character and having a length of 2 characters.
在步骤340中,根据核酸片段生成序列数据。In step 340, sequence data is generated from the nucleic acid fragments.
例如,可以将获取的基因片段组合起来以得到图2中的序列数据23“AGACTGGCAGCTCTTTTGGTTTAGAGCGACTA……”。根据编码时采用的A、C、G、T与1、0之间的转码关系,将序列数据23转码为二进制代码22“01010111011010000110000101110100……”。For example, the acquired gene fragments can be combined to obtain the sequence data 23 "AGACTGGCAGCTCTTTTGGTTTAGAGCGACTA..." in FIG. The sequence data 23 is transcoded into a binary code 22 "01010111011010000110000101110100..." according to the transcoding relationship between A, C, G, T and 1, 0 employed in encoding.
在步骤350中,根据序列数据获取信息。例如,可以将二进制代码22译码为文本信息21“What I cannot create,I do not understand”,从而完成解密。In step 350, information is obtained from the sequence data. For example, the binary code 22 can be decoded into text information 21 "What I cannot create, I do not understand", thereby completing the decryption.
上述实施例中,通过将待加密信息的序列数据对应到基因数据库中的基因片段,将相应的位置信息作为序列编码,从而实现了信息的加密。利用基因的超高存储密度以及独特的分子间识别机制,无需对信息进行复杂繁琐的数学计算即可完成加密,从而提高了加密效率和安全性。In the above embodiment, the information of the information to be encrypted is corresponding to the gene segment in the gene database, and the corresponding position information is encoded as a sequence, thereby realizing the encryption of the information. By using the ultra-high storage density of the gene and the unique intermolecular recognition mechanism, encryption can be completed without complicated and cumbersome mathematical calculation of information, thereby improving encryption efficiency and security.
而且上述实施例在进行信息存储时,不需要价格昂贵的DNA合成仪与测序仪,只需要具有编码和解码信息的相关程序的计算机,就可以实现将信息存储在核苷酸序列中,包括人类或其他物种的野生型基因组或者合成基因组,并且存储容量不受限制,允许存储无限量的信息。Moreover, in the above embodiment, when information storage is performed, an expensive DNA synthesizer and a sequencer are not required, and only a computer having a related program for encoding and decoding information is required to store information in a nucleotide sequence, including humans. Or wild-type genomes or synthetic genomes of other species, and storage capacity is unlimited, allowing for the storage of an unlimited amount of information.
图4示出本发明的编码装置的一个实施例的结构图。Fig. 4 is a block diagram showing an embodiment of an encoding apparatus of the present invention.
如图4所示,该装置包括:信息数字化模块41、数据标识确定模块42和编码生成模块43。As shown in FIG. 4, the apparatus includes an information digitization module 41, a data identification determination module 42, and an encoding generation module 43.
信息数字化模块31对信息进行数字化处理生成序列数据。The information digitization module 31 digitizes the information to generate sequence data.
在一个实施例中,信息数字化模块41对信息对应的二进制代码进行转码生成序列数据。例如,信息数字化模块41将二进制代码中的0转换为A或T,1转换为C或G以生成序列数据,或者将所述二进制代码中的01转换为A,00转换为T,11转换为C,10转换为G以生成序列数据。在这种情况下序列数据为由A、C、G和T构成的数据。 In one embodiment, the information digitization module 41 transcodes the binary code corresponding to the information to generate sequence data. For example, the information digitization module 41 converts 0 in the binary code to A or T, 1 to C or G to generate sequence data, or converts 01 in the binary code to A, 00 to T, 11 to C, 10 is converted to G to generate sequence data. In this case, the sequence data is data composed of A, C, G, and T.
在另一个实施例中,该装置还包括基因数据转码模块44。在序列数据为信息对应的二进制代码的情况下,基因数据转码模块44将基因数据库中所有的核酸片段转码为二进制代码。In another embodiment, the apparatus further includes a genetic data transcoding module 44. In the case where the sequence data is a binary code corresponding to the information, the gene data transcoding module 44 transcodes all of the nucleic acid fragments in the gene database into a binary code.
数据标识确定模块42将序列数据划分为N个数据片段,N为大于1的整数,针对每个数据片段,在基因数据库中查找相应的基因片段,并将核酸片段在基因数据库中的位置信息作为每个数据片段的标识。例如,标识可以包括核酸片段第一个符号和最后一个符号在基因数据库中的位置信息,或者标识可以包括核酸片段第一个符号在基因数据库中的位置信息,和核酸片段的长度。The data identification determining module 42 divides the sequence data into N data segments, N is an integer greater than 1, for each data segment, searches for a corresponding gene segment in the gene database, and uses the position information of the nucleic acid segment in the gene database as The identification of each piece of data. For example, the identification may include positional information of the first symbol and the last symbol of the nucleic acid fragment in the gene database, or the identification may include positional information of the first symbol of the nucleic acid fragment in the gene database, and the length of the nucleic acid fragment.
在一个实施例中,数据标识确定模块42针对基因数据库中没有查找到相应的核酸片段的数据片段,进行进一步的数据划分,得到M个数据片段,并在基因数据库中查找与M个数据片段中的每一个相应的核酸片段,M为大于1的整数。In one embodiment, the data identification determining module 42 performs further data partitioning on the data segments in the gene database for which the corresponding nucleic acid fragments are not found, obtains M data segments, and searches for the M data segments in the gene database. For each corresponding nucleic acid fragment, M is an integer greater than one.
编码生成模块43根据各个数据片段对应的标识生成序列编码。例如,可以按照各个数据片段被划分的顺序依次生成序列编码。The code generation module 43 generates a sequence code based on the identifier corresponding to each data segment. For example, the sequence encoding may be sequentially generated in the order in which the respective data segments are divided.
图5示出本发明的解码装置的一个实施例的结构图。Fig. 5 is a block diagram showing an embodiment of a decoding apparatus of the present invention.
如图5所示,该装置包括:数据标识获取模块51、序列获取模块52和信息生成模块53。As shown in FIG. 5, the apparatus includes: a data identifier acquisition module 51, a sequence acquisition module 52, and an information generation module 53.
数据标识获取模块51从编码数据中获取各数据片段对应的标识,编码数据为通过上述实施例中的编码方法或者通过上述实施例中的编码装置生成的序列编码。The data identifier obtaining module 51 acquires an identifier corresponding to each data segment from the encoded data, and the encoded data is a sequence encoding generated by the encoding method in the above embodiment or by the encoding device in the above embodiment.
序列获取模块52根据标识获取各数据片段对应的位置信息,并根据位置信息从基因数据库中获取对应的核酸片段。The sequence obtaining module 52 acquires location information corresponding to each data segment according to the identifier, and acquires a corresponding nucleic acid fragment from the gene database according to the location information.
信息生成模块53根据核酸片段生成序列数据,并根据序列数据获取信息。The information generating module 53 generates sequence data based on the nucleic acid fragments and acquires information based on the sequence data.
上述实施例中,通过将待加密信息的序列数据对应到基因数据库中的基因片段,将相应的位置信息作为序列编码,从而实现了信息的加密。利用基因的超高存储密度以及独特的分子间识别机制,无需对信息进行复杂繁琐的数学计算即可完成加密,从而提高了加密效率和安全性。In the above embodiment, the information of the information to be encrypted is corresponding to the gene segment in the gene database, and the corresponding position information is encoded as a sequence, thereby realizing the encryption of the information. By using the ultra-high storage density of the gene and the unique intermolecular recognition mechanism, encryption can be completed without complicated and cumbersome mathematical calculation of information, thereby improving encryption efficiency and security.
上述实施例中,进行信息存储时,不需要价格昂贵的DNA合成仪与测序仪,只需要具有编码和解码信息的相关程序的计算机,就可以实现将信息存储在核苷酸序列中,包括人类或其他物种的野生型基因组或者合成基因组,并且存储容量不受限制,允许存储无限量的信息。In the above embodiment, when information is stored, an expensive DNA synthesizer and a sequencer are not required, and only a computer having a program for encoding and decoding information is required to store information in a nucleotide sequence, including humans. Or wild-type genomes or synthetic genomes of other species, and storage capacity is unlimited, allowing for the storage of an unlimited amount of information.
图6示出本发明的数据处理装置的一个实施例的结构图。 Fig. 6 is a block diagram showing an embodiment of a data processing device of the present invention.
如图6所示,该实施例的装置6包括:存储器61以及耦接至该存储器61的处理器62,处理器62被配置为基于存储在存储器61中的指令,执行本发明中任意一个实施例中的编码方法或解码方法。As shown in FIG. 6, the apparatus 6 of this embodiment includes a memory 61 and a processor 62 coupled to the memory 61, the processor 62 being configured to perform any one of the implementations of the present invention based on instructions stored in the memory 61. The encoding method or decoding method in the example.
其中,存储器61例如可以包括系统存储器、固定非易失性存储介质等。系统存储器例如存储有操作系统、应用程序、引导装载程序(Boot Loader)、数据库以及其他程序等。The memory 61 may include, for example, a system memory, a fixed non-volatile storage medium, or the like. The system memory stores, for example, an operating system, an application, a boot loader, a database, and other programs.
本领域内的技术人员应当明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用非瞬时性存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code. .
本领域内的技术人员应当明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用非瞬时性存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code. .
至此,已经详细描述了根据本发明的编码/解码方法、装置和数据处理装置。为了避免遮蔽本发明的构思,没有描述本领域所公知的一些细节。本领域技术人员根据上面的描述,完全可以明白如何实施这里公开的技术方案。Heretofore, the encoding/decoding method, apparatus, and data processing apparatus according to the present invention have been described in detail. In order to avoid obscuring the concepts of the present invention, some details known in the art are not described. Those skilled in the art can fully understand how to implement the technical solutions disclosed herein according to the above description.
可能以许多方式来实现本发明的方法和系统。例如,可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本发明的方法和系统。用于所述方法的步骤的上述顺序仅是为了进行说明,本发明的方法的步骤不限于以上具体描述的顺序,除非以其它方式特别说明。此外,在一些实施例中,还可将本发明实施为记录在记录介质中的程序,这些程序包括用于实现根据本发明的方法的机器可读指令。因而,本发明还覆盖存储用于执行根据本发明的方法的程序的记录介质。The methods and systems of the present invention may be implemented in a number of ways. For example, the methods and systems of the present invention can be implemented in software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described sequence of steps for the method is for illustrative purposes only, and the steps of the method of the present invention are not limited to the order specifically described above unless otherwise specifically stated. Moreover, in some embodiments, the invention may also be embodied as a program recorded in a recording medium, the program comprising machine readable instructions for implementing the method according to the invention. Thus, the invention also covers a recording medium storing a program for performing the method according to the invention.
虽然已经通过示例对本发明的一些特定实施例进行了详细说明,但是本领域的技术人员应该理解,以上示例仅是为了进行说明,而不是为了限制本发明的范围。本领域的技术人员应该理解,可在不脱离本发明的范围和精神的情况下,对以上实施例进行修改。本发明的范围由所附权利要求来限定。 While the invention has been described in detail with reference to the specific embodiments of the present invention, it should be understood that It will be appreciated by those skilled in the art that the above embodiments may be modified without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims (36)

  1. 一种编码方法,包括:An encoding method comprising:
    对信息进行数字化处理生成序列数据;Digitizing the information to generate sequence data;
    将所述序列数据划分为N个数据片段,N为大于1的整数;Dividing the sequence data into N data segments, where N is an integer greater than one;
    针对每个数据片段,在基因数据库中查找相应的核酸片段,并将所述核酸片段在所述基因数据库中的位置信息作为每个数据片段的标识;Searching for a corresponding nucleic acid fragment in the gene database for each data segment, and using the position information of the nucleic acid fragment in the gene database as an identifier of each data segment;
    根据各个数据片段对应的标识生成序列编码。A sequence code is generated according to the identifier corresponding to each data segment.
  2. 根据权利要求1所述的编码方法,其中,查找步骤包括:The encoding method according to claim 1, wherein the searching step comprises:
    针对所述基因数据库中没有查找到相应的核酸片段的数据片段,进行进一步的数据划分,得到M个数据片段,并在所述基因数据库中查找与M个数据片段中的每一个相应的核酸片段,M为大于1的整数。Performing further data partitioning on the data fragments in the gene database in which the corresponding nucleic acid fragments are not found, obtaining M data fragments, and searching for the nucleic acid fragments corresponding to each of the M data fragments in the gene database , M is an integer greater than one.
  3. 根据权利要求1所述的编码方法,其中,所述数字化处理为对所述信息对应的二进制代码进行转码生成所述序列数据。The encoding method according to claim 1, wherein said digitizing processing is to transcode a binary code corresponding to said information to generate said sequence data.
  4. 根据权利要求3所述的编码方法,其中,所述序列数据为由腺嘌呤A、胞嘧啶C、鸟嘌呤G和胸腺嘧啶T四种脱氧核糖核苷酸构成的数据。The encoding method according to claim 3, wherein the sequence data is data composed of four deoxyribonucleotides of adenine A, cytosine C, guanine G, and thymine T.
  5. 根据权利要求4所述的编码方法,其中,将所述二进制代码中的0转换为A或T,1转换为C或G以生成所述序列数据。The encoding method according to claim 4, wherein 0 in the binary code is converted to A or T, and 1 is converted to C or G to generate the sequence data.
  6. 根据权利要求4所述的编码方法,其中,将所述二进制代码中的01转换为A,00转换为T,11转换为C,10转换为G以生成所述序列数据。The encoding method according to claim 4, wherein 01 in the binary code is converted to A, 00 is converted to T, 11 is converted to C, and 10 is converted to G to generate the sequence data.
  7. 根据权利要求1所述的编码方法,其中,所述序列数据为所述信息对应的二进制代码。The encoding method according to claim 1, wherein said sequence data is a binary code corresponding to said information.
  8. 根据权利要求7所述的编码方法,在查找步骤之前,还包括:The encoding method according to claim 7, further comprising: before the searching step,
    将所述基因数据库中所有的核酸片段转码为二进制代码。All nucleic acid fragments in the gene database are transcoded into binary code.
  9. 根据权利要求8所述的编码方法,其中,将所述基因数据库中的A或T转换为二进制代码0,C或G转换为二进制代码1。The encoding method according to claim 8, wherein A or T in said gene database is converted into binary code 0, and C or G is converted into binary code 1.
  10. 根据权利要求8所述的编码方法,其中,将所述基因数据库中的A转换为二进制代码01,T转换为二进制代码00,C转换为二进制代码11,G转换为二进制代码10。The encoding method according to claim 8, wherein A in said gene database is converted into binary code 01, T is converted into binary code 00, C is converted into binary code 11, and G is converted into binary code 10.
  11. 根据权利要求1-10任一项所述的编码方法,其中,所述标识包括所述核酸片 段第一个符号和最后一个符号在所述基因数据库中的位置信息。The encoding method according to any one of claims 1 to 10, wherein the identifier comprises the nucleic acid sheet The positional information of the first symbol and the last symbol of the segment in the gene database.
  12. 根据权利要求1-10任一项所述的编码方法,其中,所述标识包括所述核酸片段第一个符号在所述基因数据库中的位置信息,和所述核酸片段的长度。The encoding method according to any one of claims 1 to 10, wherein the identifier comprises positional information of the first symbol of the nucleic acid fragment in the gene database, and a length of the nucleic acid fragment.
  13. 根据权利要求1-10任一项所述的编码方法,其中,所述信息为文字信息、图片信息、音频信息或视频信息中的至少一种。The encoding method according to any one of claims 1 to 10, wherein the information is at least one of text information, picture information, audio information, or video information.
  14. 根据权利要求1-10任一项所述的编码方法,其中,所述基因数据库包括一个或多个动物和/或植物和/或微生物基因组数据。The encoding method according to any one of claims 1 to 10, wherein the gene database comprises one or more animal and/or plant and/or microbial genomic data.
  15. 根据权利要求14所述的编码方法,其中,所述基因数据库包括野生型基因组数据和/或合成型基因组数据。The encoding method according to claim 14, wherein the gene database comprises wild type genomic data and/or synthetic genomic data.
  16. 根据权利要求15所述的编码方法,其中,所述基因数据库包括人类基因组数据。The encoding method according to claim 15, wherein said gene database comprises human genome data.
  17. 一种解码方法,包括:A decoding method comprising:
    从编码数据中获取各数据片段对应的标识,所述编码数据为根据权利要求1-16任一项所述的编码方法生成的序列编码;Acquiring an identifier corresponding to each data segment from the encoded data, the encoded data being a sequence encoding generated by the encoding method according to any one of claims 1-16;
    根据所述标识获取各数据片段对应的位置信息;Obtaining location information corresponding to each data segment according to the identifier;
    根据所述位置信息从基因数据库中获取对应的核酸片段;Obtaining a corresponding nucleic acid fragment from the gene database according to the position information;
    根据所述核酸片段生成序列数据。Sequence data is generated based on the nucleic acid fragments.
    根据所述序列数据获取信息。Information is obtained based on the sequence data.
  18. 一种编码装置,包括:An encoding device comprising:
    信息数字化模块,用于对信息进行数字化处理生成序列数据;An information digitization module for digitizing information to generate sequence data;
    数据标识确定模块,所述数据标识确定模块与信息数字化模块相连,用于将所述序列数据划分为N个数据片段,N为大于1的整数,针对每个数据片段,在基因数据库中查找相应的核酸片段,并将所述核酸片段在所述基因数据库中的位置信息作为每个数据片段的标识;a data identifier determining module, wherein the data identifier determining module is connected to the information digitizing module, configured to divide the sequence data into N data segments, where N is an integer greater than 1, and search for corresponding data in the gene database for each data segment. a nucleic acid fragment, and the positional information of the nucleic acid fragment in the gene database as an identifier of each data fragment;
    编码生成模块,所述编码生成模块与所述数据标识确定模块相连,用于根据各个数据片段对应的标识生成序列编码。The code generation module is connected to the data identifier determination module and configured to generate a sequence code according to the identifier corresponding to each data segment.
  19. 根据权利要求18所述的编码装置,其中,The encoding device according to claim 18, wherein
    所述数据标识确定模块针对所述基因数据库中没有查找到相应的核酸片段的数据片段,进行进一步的数据划分,得到M个数据片段,并在所述基因数据库中查找与M个数据片段中的每一个相应的核酸片段,M为大于1的整数。 The data identifier determining module performs further data partitioning on the data segment in the gene database that does not find the corresponding nucleic acid segment, obtains M data segments, and searches in the gene database for the M data segments. For each corresponding nucleic acid fragment, M is an integer greater than one.
  20. 根据权利要求18所述的编码装置,其中,所述信息数字化模块对所述信息对应的二进制代码进行转码生成所述序列数据。The encoding apparatus according to claim 18, wherein said information digitizing module transcodes a binary code corresponding to said information to generate said sequence data.
  21. 根据权利要求20所述的编码装置,其中,所述序列数据为由腺嘌呤A、胞嘧啶C、鸟嘌呤G和胸腺嘧啶T四种脱氧核糖核苷酸构成的数据。The encoding device according to claim 20, wherein said sequence data is data composed of four deoxyribonucleotides of adenine A, cytosine C, guanine G, and thymine T.
  22. 根据权利要求21所述的编码装置,其中,The encoding device according to claim 21, wherein
    所述信息数字化模块将所述二进制代码中的0转换为A或T,1转换为C或G以生成所述序列数据。The information digitization module converts 0 of the binary code to A or T, and 1 converts to C or G to generate the sequence data.
  23. 根据权利要求21所述的编码装置,其中,The encoding device according to claim 21, wherein
    所述信息数字化模块将所述二进制代码中的01转换为A,00转换为T,11转换为C,10转换为G以生成所述序列数据。The information digitizing module converts 01 in the binary code to A, 00 to T, 11 to C, and 10 to G to generate the sequence data.
  24. 根据权利要求18所述的编码装置,其中,所述序列数据为所述信息对应的二进制代码。The encoding apparatus according to claim 18, wherein said sequence data is a binary code corresponding to said information.
  25. 根据权利要求18所述的编码装置,还包括:The encoding device of claim 18, further comprising:
    基因数据转码模块,所述基因数据转码模块分别与所述信息数字化模块、所述数据标识确定模块相连,用于将所述基因数据库中所有的核酸片段转码为二进制代码。And a gene data transcoding module, wherein the gene data transcoding module is respectively connected to the information digitizing module and the data identifier determining module, and is configured to transcode all the nucleic acid fragments in the gene database into a binary code.
  26. 根据权利要求25所述的编码装置,其中,The encoding device according to claim 25, wherein
    所述基因数据转码模块将所述基因数据库中的A或T转换为二进制代码0,C或G转换为二进制代码1。The gene data transcoding module converts A or T in the gene database into binary code 0, and C or G is converted into binary code 1.
  27. 根据权利要求25所述的编码装置,其中,The encoding device according to claim 25, wherein
    所述基因数据转码模块将所述基因数据库中的A转换为二进制代码01,T转换为二进制代码00,C转换为二进制代码11,G转换为二进制代码10。The gene data transcoding module converts A in the gene database into binary code 01, T is converted to binary code 00, C is converted to binary code 11, and G is converted to binary code 10.
  28. 根据权利要求18-27任一项所述的编码装置,其中,所述标识包括所述核酸片段第一个符号和最后一个符号在所述基因数据库中的位置信息。The encoding device according to any one of claims 18 to 27, wherein the identifier comprises position information of the first symbol and the last symbol of the nucleic acid fragment in the gene database.
  29. 根据权利要求18-27任一项所述的编码装置,其中,所述标识包括所述核酸片段第一个符号在所述基因数据库中的位置信息,和所述核酸片段的长度。The encoding device according to any one of claims 18 to 27, wherein the identifier comprises positional information of the first symbol of the nucleic acid fragment in the gene database, and a length of the nucleic acid fragment.
  30. 根据权利要求18-27任一项所述的编码装置,其中,所述信息为文字信息、图片信息、音频信息或视频信息中的至少一种。The encoding apparatus according to any one of claims 18 to 27, wherein the information is at least one of text information, picture information, audio information, or video information.
  31. 根据权利要求18-27任一项所述的编码装置,其中,所述基因数据库包括一个或多个动物和/或植物和/或微生物基因组数据。The encoding device according to any one of claims 18 to 27, wherein the genetic database comprises one or more animal and/or plant and/or microbial genomic data.
  32. 根据权利要求31所述的编码装置,其中,所述基因数据库包括野生型基因组 数据和/或合成型基因组数据。The encoding device according to claim 31, wherein said gene database comprises a wild type genome Data and/or synthetic genomic data.
  33. 根据权利要求32所述的编码装置,其中,所述基因数据库包括人类基因组数据。The encoding device according to claim 32, wherein said gene database comprises human genome data.
  34. 一种解码装置,包括:A decoding device comprising:
    数据标识获取模块,用于从编码数据中获取各数据片段对应的标识,所述编码数据为根据权利要求1-16任一项所述的编码方法或根据权利要求18-33任一项所述的编码装置生成的序列编码;a data identifier obtaining module, configured to obtain, from the encoded data, an identifier corresponding to each data segment, the encoded data being the encoding method according to any one of claims 1 to 16 or according to any one of claims 18-33 a sequence code generated by the encoding device;
    序列获取模块,所述序列获取模块与所述数据标识获取模块相连,用于根据所述标识获取各数据片段对应的位置信息,并根据所述位置信息从基因数据库中获取对应的核酸片段;a sequence acquisition module, the sequence acquisition module is connected to the data identifier acquisition module, configured to acquire location information corresponding to each data segment according to the identifier, and obtain a corresponding nucleic acid fragment from the genetic database according to the location information;
    信息生成模块,所述信息生成模块与所述序列获取模块相连,用于根据所述核酸片段生成序列数据,并根据所述序列数据获取信息。An information generating module, the information generating module is connected to the sequence acquiring module, configured to generate sequence data according to the nucleic acid segment, and obtain information according to the sequence data.
  35. 一种数据处理装置,包括:A data processing device comprising:
    存储器;以及Memory;
    耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器装置中的指令,执行如权利要求1-16中任一项所述的编码方法或如权利要求17中所述的解码方法。a processor coupled to the memory, the processor being configured to perform the encoding method of any one of claims 1-16 or the claim 17 according to an instruction stored in the memory device The decoding method described.
  36. 一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求1-16中任一项所述的编码方法或如权利要求17中所述的解码方法。 A computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the encoding method of any of claims 1-16 or the decoding method of claim 17.
PCT/CN2017/099152 2017-08-25 2017-08-25 Encoding and decoding method, device and data processing device WO2019037117A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2017/099152 WO2019037117A1 (en) 2017-08-25 2017-08-25 Encoding and decoding method, device and data processing device
CN201780094012.7A CN111095423B (en) 2017-08-25 2017-08-25 Encoding/decoding method, apparatus and data processing apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/099152 WO2019037117A1 (en) 2017-08-25 2017-08-25 Encoding and decoding method, device and data processing device

Publications (1)

Publication Number Publication Date
WO2019037117A1 true WO2019037117A1 (en) 2019-02-28

Family

ID=65439286

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/099152 WO2019037117A1 (en) 2017-08-25 2017-08-25 Encoding and decoding method, device and data processing device

Country Status (2)

Country Link
CN (1) CN111095423B (en)
WO (1) WO2019037117A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112687338B (en) * 2020-12-31 2022-01-11 云舟生物科技(广州)有限公司 Method for storing and restoring gene sequence, computer storage medium and electronic device
CN113380322B (en) * 2021-06-25 2023-10-24 倍生生物科技(深圳)有限公司 Artificial nucleic acid sequence watermark coding system, watermark character string and coding and decoding method
CN113782102B (en) * 2021-08-13 2022-12-13 中科碳元(深圳)生物科技有限公司 Method, device and equipment for storing DNA data and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080120079A1 (en) * 2005-02-11 2008-05-22 Smartgene Gmbh Computer-Implemented Method and Computer-Based System for Validating Dna Sequencing Data
CN103114127A (en) * 2011-11-16 2013-05-22 中国科学院华南植物园 DNA chip based cipher system
CN105022935A (en) * 2014-04-22 2015-11-04 中国科学院青岛生物能源与过程研究所 Encoding method and decoding method for performing information storage by means of DNA
CN106845158A (en) * 2017-02-17 2017-06-13 苏州泓迅生物科技股份有限公司 A kind of method that information Store is carried out using DNA

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05324738A (en) * 1992-05-20 1993-12-07 Fujitsu Ltd Homogeneity classifying method of gene database
CN101420614B (en) * 2008-11-28 2010-08-18 同济大学 Image compression method and device integrating hybrid coding and wordbook coding
WO2013074658A1 (en) * 2011-11-15 2013-05-23 Citrix Systems, Inc. Systems and methods for compressing short text by dictionaries in a network
CN106506007A (en) * 2015-09-08 2017-03-15 联发科技(新加坡)私人有限公司 A kind of lossless data compression and decompressing device and its method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080120079A1 (en) * 2005-02-11 2008-05-22 Smartgene Gmbh Computer-Implemented Method and Computer-Based System for Validating Dna Sequencing Data
CN103114127A (en) * 2011-11-16 2013-05-22 中国科学院华南植物园 DNA chip based cipher system
CN105022935A (en) * 2014-04-22 2015-11-04 中国科学院青岛生物能源与过程研究所 Encoding method and decoding method for performing information storage by means of DNA
CN106845158A (en) * 2017-02-17 2017-06-13 苏州泓迅生物科技股份有限公司 A kind of method that information Store is carried out using DNA

Also Published As

Publication number Publication date
CN111095423B (en) 2023-07-21
CN111095423A (en) 2020-05-01

Similar Documents

Publication Publication Date Title
US20220344005A1 (en) Methods to compress, encrypt and retrieve genomic alignment data
JP7079786B2 (en) Methods, computer-readable media, and equipment for accessing structured bioinformatics data in access units.
US20170249345A1 (en) A biomolecule based data storage system
KR101638594B1 (en) Method and apparatus for searching DNA sequence
US10311239B2 (en) Genetic information storage apparatus, genetic information search apparatus, genetic information storage program, genetic information search program, genetic information storage method, genetic information search method, and genetic information search system
US20210194686A1 (en) Encoding and decoding information in synthetic dna with cryptographic keys generated based on polymorphic features of nucleic acids
WO2024077948A1 (en) Private query method, apparatus and system, and storage medium
WO2019037117A1 (en) Encoding and decoding method, device and data processing device
JP6902104B2 (en) Efficient data structure for bioinformatics information display
Al Yami et al. LFastqC: A lossless non-reference-based FASTQ compressor
Garhwal et al. BIIIA: a bioinformatics-inspired image identification approach
Wang et al. Reversible data hiding in encrypted images using median edge detector and two’s complement
Liu et al. High-capacity reversible data hiding in encrypted images based on hierarchical quad-tree coding and multi-MSB prediction
CN110168652B (en) Method and system for storing and accessing bioinformatic data
Sahlin Strobemers: an alternative to k-mers for sequence comparison
Liu et al. Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression
Sardaraz et al. SCA-NGS: Secure compression algorithm for next generation sequencing data using genetic operators and block sorting
US20220277098A1 (en) Method and system for securely storing and programmatically searching data
Kredens et al. Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review
Tripathi et al. Identifying DNA sequence by using stream matching techniques
Kumar et al. WBMFC: Efficient and Secure Storage of Genomic Data.
Sarkar et al. Quark enables semi-reference-based compression of RNA-seq data
Jain et al. An information security-based literature survey and classification framework of data storage in DNA
Naro et al. Reversible fingerprinting for genomic information
Bi et al. Extended XOR Algorithm with Biotechnology Constraints for Data Security in DNA Storage

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17922130

Country of ref document: EP

Kind code of ref document: A1