WO2018039938A1 - 将数据进行生物存储并还原的方法 - Google Patents

将数据进行生物存储并还原的方法 Download PDF

Info

Publication number
WO2018039938A1
WO2018039938A1 PCT/CN2016/097398 CN2016097398W WO2018039938A1 WO 2018039938 A1 WO2018039938 A1 WO 2018039938A1 CN 2016097398 W CN2016097398 W CN 2016097398W WO 2018039938 A1 WO2018039938 A1 WO 2018039938A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
data
dna sequence
base
datadna
Prior art date
Application number
PCT/CN2016/097398
Other languages
English (en)
French (fr)
Inventor
戴俊彪
吴庆余
伊加提⋅乃哥麦提
孙凯文
董俊凯
秦怡然
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Priority to EP16914505.9A priority Critical patent/EP3509018B1/en
Priority to PCT/CN2016/097398 priority patent/WO2018039938A1/zh
Priority to US16/328,745 priority patent/US11177019B2/en
Publication of WO2018039938A1 publication Critical patent/WO2018039938A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/123DNA computing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models

Definitions

  • the invention belongs to the fields of bioinformatics, synthetic biology and computer, and particularly relates to a conversion method capable of converting data into a biocompatible DNA sequence and reducing the DNA sequence library to original data.
  • the 21st century is the century of life sciences and the century of information and big data.
  • information technology is booming, and an important issue is how to deal with the growing data.
  • the agency also predicts that by 2020, the total amount of global data will be Reach 40ZB.
  • Existing data storage technologies have exposed their small storage density, high storage energy consumption, and short storage period before such a large amount of data.
  • DNA a long-lived entity that has long been responsible for the storage of biological genetic information, has gradually attracted the attention of scientists.
  • DNA As the carrier of genetic information, DNA has a data storage density far beyond the existing storage technology; and it can maintain the integrity of stored information in a sub-optimal environment; the life cycle can be long and can be self-replicating or artificially amplifying A copy of the information is implemented.
  • the present invention provides a method of converting data into a DNA sequence of DNA, using the DNA sequence as an information storage medium to store data.
  • the DNA sequence obtained by the method of the present invention is adapted to be stored in an organism, for example, stored in a cell as a plasmid, or integrated into a genome of a cell.
  • each data conversion unit is converted into a single-strand DNA short sequence, thereby converting the data into a series of a series of single-strand DNA short sequences.
  • the length of each single-stranded DNA short sequence is suitable for genetic manipulation, for example, for cloning into a plasmid or for integration into a cellular genome, thereby facilitating storage of the converted DNA sequence in an organism.
  • a specially designed dataDNA sequence conversion rule is used to convert a data conversion unit into a dataDNA sequence representing the conversion unit data information, and a data sequence in the single-strand DNA short sequence is reduced to a binary number sequence of the data conversion unit.
  • the dataDNA sequence conversion rule prevents the generation of initial codons in the dataDNA sequence and prevents the generation of single-base continuous repeat sequences in the DNA sequence.
  • the dataDNA sequence conversion rule is:
  • each single-stranded DNA short sequence may further comprise an index DNA sequence indicating position information of the data conversion unit to indicate positional information of the data conversion unit information contained in the single-stranded DNA short sequence in the entire data, thereby facilitating When a series of short sequences of single-stranded DNA sequences are reduced to a series of data conversion units, the data conversion units are spliced into raw data.
  • the index DNA sequence when the index DNA sequence is obtained, the position number of the data conversion unit in the data is first converted into a fixed number of ternary number sequence, and then the ternary number sequence is used using a specially designed index DNA sequence conversion rule. Converted to an index DNA sequence with the same number of bases as the number of digits in the ternary sequence.
  • the index DNA sequence is first converted into a ternary sequence using the index DNA sequence conversion rule, and then the ternary sequence is converted into a position number of the data conversion unit in the data.
  • the index DNA sequence conversion rule is:
  • index DNA sequence conversion rule refers to the above index DNA sequence conversion rule.
  • the invention also specifically designs a method for preventing mutations which may occur during in vitro manipulation and cell passage, that is, in each short-sequence of single-stranded DNA, for detecting whether the short-sequence of the single-stranded DNA is mutated and correcting the mutation.
  • CorrectionDNA sequence
  • a method of converting data into a sequence of data DNA comprising dividing the data into one or more data conversion units and providing a sequence of binary numbers for each of the data conversion units, each of which is Data units are converted into a data DNA sequence, thereby obtaining a data DNA sequence library; the data DNA sequence library comprising one or more data DNA sequences, each data DNA sequence being converted by a data conversion unit;
  • the steps include: converting the binary sequence of each data conversion unit into a data DNA sequence according to a data DNA sequence conversion rule, that is, a data DNA sequence.
  • the present invention also provides another method of converting data into a data DNA sequence, the method comprising dividing the data into one or more data conversion units and providing a sequence of binary numbers for each data conversion unit, according to the following steps Convert each data Converting the unit into a data DNA sequence, thereby obtaining a data DNA sequence library; the data DNA sequence library comprising one or more data DNA sequences, each data DNA sequence being converted by a data conversion unit; the steps comprising :
  • index DNA sequence of the data conversion unit is ligated to the data DNA sequence, and a protection sequence of 2 bases in length is added to the junction to obtain an index+dataDNA sequence, which is a data DNA sequence.
  • the present invention also provides a method of converting data into a data DNA sequence comprising a mutation correction sequence, the method comprising dividing the data into one or more data conversion units and providing a sequence of binary numbers for each data conversion unit, according to The step of converting each data conversion unit into a data DNA sequence comprising a mutation correction sequence, thereby obtaining a data DNA sequence library; the data DNA sequence library comprising one or more data DNA sequences, each data DNA sequence being A data conversion unit is converted; the steps include:
  • N(i) is the number of i bases present in the preliminary data DNA sequence
  • the base-bit weighted summation value sum of the preliminary data DNA sequence is calculated according to the following formula:
  • val(i) is the value of base i
  • val(A), val(T), val(C), val(G) correspond to 1, 2, 3, 4 respectively
  • position(i) is the position coordinate of base i
  • N is the total length of the preliminary data DNA sequence
  • the preliminary judgment sequence is connected to the depth judgment sequence, and the protective base C is added at the junction to obtain a correction DNA sequence;
  • the preliminary data DNA sequence is ligated to the correction DNA sequence, and a protection sequence of 2 bases in length is added at the junction to obtain a DNA sequence containing the mutation correction sequence.
  • the step (1) comprises: converting the binary number sequence of the data conversion unit to a dataDNA sequence according to a dataDNA sequence conversion rule,
  • the dataDNA sequence serves as a preliminary data DNA sequence that does not contain a mutation-corrected sequence.
  • step (1) comprises:
  • index DNA sequence of the data conversion unit is ligated to the dataDNA sequence, and a protection sequence of 2 bases in length is added to the junction to obtain an index+dataDNA sequence, and the obtained index+dataDNA sequence is included as a mutation-free correction sequence.
  • each conversion unit of data is converted into a data DNA sequence including data conversion unit position information, data conversion unit data content information, and mutation correction sequence, wherein correction DNA is preferably used in step (1-3) Connect to the dataDNA end of the index+dataDNA sequence.
  • the binary number sequence of the data conversion unit can be converted to a preliminary data DNA not comprising the mutation correction sequence by other methods in step (1) sequence.
  • the present invention still further provides an encrypted data DNA sequence conversion method, comprising:
  • any of the foregoing data conversion methods are methods implemented on a computer.
  • a method of storing data using a DNA sequence comprising: converting data into a data DNA sequence, synthesizing the DNA sequence, and storing the synthesized DNA using any of the data conversion methods of the present invention sequence.
  • the stored synthetic DNA sequence is the storage of the DNA sequence in a cell as a plasmid, or the integration of the DNA sequence into the genome of the cell.
  • a method of reducing a DNA sequence obtained by sequencing to data comprising:
  • step (2) may be the reduction of the dataDNA sequence to data in binary form, or step (2) may comprise restoring the data DNA sequence to data in binary form and further restoring the data in the form of the binary number For raw data.
  • the invention also provides another method for reducing the DNA sequence obtained by sequencing into data, comprising:
  • step (3) may be the reduction of the dataDNA sequence to data in binary form, or may further comprise a string of characters that further reduce the data in binary form.
  • the restored data obtained in the step (4) may be data in the form of a binary number, or may be original data further reduced by the data in the form of the binary number, or may also be obtained by the step (3).
  • the invention also provides a method for correcting and restoring a DNA sequence obtained by sequencing to data, comprising:
  • DNA sequence obtained by sequencing comprising a preliminary data DNA sequence and a mutation correction sequence, wherein the preliminary data DNA sequence comprises data content information of a data conversion unit; preliminary data in the DNA sequence obtained by the sequencing A DNA sequence has at most one base mutation;
  • N(i) is the number of i bases present in the sequencing sequence of the preliminary data DNA sequence
  • the base number judgment value X'(i) of the sequencing sequence of the preliminary data DNA sequence and the preliminary judgment sequence in the mutation correction sequence contained in the DNA sequence obtained by sequencing are determined by the same rule. (i) Comparison:
  • the judgment value of the base number of only one base changes, it indicates that the sequencing sequence of the preliminary data DNA sequence has the insertion or deletion of the one base relative to the unmutated preliminary data DNA sequence;
  • the base bitwise weighted summation sum' of the sequencing sequence of the preliminary data DNA sequence is calculated according to the following formula:
  • val(i) is the value of base i
  • val(A), val(T), val(C), val(G) correspond to 1, 2, 3, 4 respectively
  • position(i) is the position coordinate of the base i
  • N is the total length of the sequencing sequence of the preliminary data DNA sequence
  • the base-bit weighted summation value sum' of the sequence of the preliminary data DNA sequence and the depth judgment sequence in the mutation correction sequence contained in the DNA sequence obtained by sequencing are subjected to the same rule to obtain the base weight-based weighting Contrast with the value sum;
  • the sequencing sequence of the preliminary data DNA sequence is base substitution with respect to the unmutated preliminary data DNA sequence: if sum'>sum, the base substitution occurring is a base having a smaller val(i) value. Replaced with a base with a large val(i) value. If sum' ⁇ sum, the base substitution occurred is a base with a larger val(i) value replaced with a base with a smaller val(i) value. Base, the position coordinate at which the base substitution occurs is the absolute value of the divisor obtained by dividing the difference between sum' and sum by the difference between the val(i) of the two bases, and replacing the base at the position with the two The other of the bases, the sequencing sequence is corrected to the unmutated preliminary data DNA sequence;
  • base insertion position is determined as follows: from the preliminary data DNA Starting from the position where the base appears for the first time in the sequence of the sequence, the bases at each position where the base appears are deleted one by one, and the preliminary data after deletion is obtained according to the following formula after deletion.
  • the base-bit weighted summation value of the DNA sequence is sum'':
  • val(i) is the value of base i
  • val(A), val(T), val(C), val(G) correspond to 1, 2, 3, 4 respectively
  • position(i) is the position coordinate of the base i
  • N is the total length of the preliminary data DNA sequence after deleting the base;
  • the position is the base insertion mutation position
  • the base at the position is deleted, and the sequencing sequence is corrected to the unmutated preliminary data DNA sequence;
  • the base deletion position is judged by inserting the bases one by one from the first position of the sequencing sequence of the preliminary data DNA sequence, and After insertion, the base bitwise weighted summation sum'′′ of the preliminary data DNA sequence after insertion is calculated according to the following formula:
  • val(i) is the value of base i
  • val(A), val(T), val(C), val(G) correspond to 1, 2, 3, 4 respectively
  • position(i) is the position coordinate of the base i
  • N is the total length of the preliminary data DNA sequence after inserting the base
  • the base-bit weighted summation sum'′′ calculated after inserting the base at a certain position and the depth judgment sequence in the mutation correction sequence contained in the DNA sequence obtained by sequencing are restored by the same rule.
  • the position is the base deletion mutation position, and the base is inserted at the position, and the sequencing sequence is corrected to the unmutated preliminary data DNA sequence;
  • the preliminary data DNA sequence comprises a dataDNA sequence representing data content information of the data conversion unit, and the step (4) comprises unmutation according to the dataDNA sequence conversion rule.
  • the preliminary data DNA sequence contains the dataDNA sequence restored to data.
  • step (4) may be the reduction of the dataDNA sequence contained in the unmutated preliminary data DNA sequence to data in binary form, or may further comprise reducing the data in binary form to the original data.
  • the sequence of the DNA sequence obtained by the sequencing is a plurality of data DNA sequences, and preliminary data of each data DNA sequence
  • the DNA sequence includes an index DNA sequence indicating position information of the data conversion unit and a dataDNA sequence indicating data content information of the data conversion unit, and the step (4) includes:
  • (4-1) reducing the index DNA sequence in each data DNA sequence to a ternary number sequence according to the index DNA sequence conversion rule, and then restoring the ternary number sequence to the position number of the conversion unit in the data;
  • the step (4-2) may be: restoring the dataDNA sequence to data in a binary form, or further comprising restoring the data in the binary form to a character string; and the restored data in the step (4-3) is a binary The data in the form of numbers, or the original data obtained by further reducing the data in the form of the binary number, or the string data obtained by restoring the string obtained by the dataDNA sequence according to the position number sequence thereof or further by the string data Restored data.
  • the invention also provides a method of reducing the sequence of encrypted DNA obtained by sequencing into data, comprising:
  • step (2) using any of the foregoing data reduction method methods to reduce the encrypted DNA sequence obtained by sequencing to data, and wherein the data DNA sequence in each DNA sequence is reduced to data according to the data DNA sequence conversion rule, according to step (1)
  • the corresponding way to restore a particular base to the corresponding specific binary number
  • any of the data restoration methods of the present invention is a method implemented on a computer.
  • a method for obtaining data from a cell comprising: extracting a DNA sequence storing the data information from the cell, sequencing, and restoring the DNA sequence obtained by the sequencing by any of the data reduction methods of the present invention For raw data.
  • a system for converting data into a data DNA sequence comprising an input device and a data DNA sequence conversion device;
  • the input device is configured to provide a sequence of binary numbers of the data conversion unit
  • the dataDNA sequence conversion device is configured to convert the binary number sequence of the data conversion unit into a data DNA sequence according to a data DNA sequence conversion rule.
  • the system for converting data to a data DNA sequence further comprises an index DNA generating device and a first integrating device; wherein the index DNA generating device is configured to convert the position number of the data converting unit in the data to a fixed-digit ternary sequence, and converting the ternary sequence into an index DNA sequence having the same number of bases as the ternary sequence according to the index DNA sequence conversion rule; wherein the first integrated device is used
  • the index DNA sequence of the data conversion unit is ligated to the data DNA sequence, and a protection sequence of 2 bases in length is added to the junction to obtain an index+dataDNA sequence.
  • the invention also provides a system for converting data into a data DNA sequence comprising a mutation correction sequence, the system comprising an input device, a preliminary data DNA conversion device, a correction DNA sequence generation device, and a second integration device;
  • the input device is configured to provide a sequence of binary numbers of the data conversion unit
  • the preliminary data DNA conversion device is configured to convert the binary number sequence of the data conversion unit into a preliminary data DNA sequence that does not include the mutation correction sequence, the preliminary data DNA sequence comprising data content information of the data conversion unit;
  • the correction DNA sequence generating device is for generating a correction DNA sequence by the following method:
  • N(i) is the number of i bases present in the preliminary data DNA sequence
  • the base-bit weighted summation value sum of the preliminary data DNA sequence is calculated according to the following formula:
  • val(i) is the value of base i
  • val(A), val(T), val(C), val(G) correspond to 1, 2, 3, 4 respectively
  • position(i) is the position coordinate of base i
  • N is the total length of the preliminary data DNA sequence
  • the preliminary judgment sequence is connected to the depth judgment sequence, and the protective base C is added at the junction to obtain a correction DNA sequence;
  • the second integration device is configured to connect the preliminary data DNA sequence to the correction DNA sequence, and add a protection sequence of 2 bases in length to obtain a DNA sequence containing the mutation correction sequence.
  • the preliminary data DNA conversion device is a data DNA sequence conversion device for converting a binary number sequence of the data conversion unit into a data DNA sequence according to a dataDNA sequence conversion rule, and the data DNA sequence is not included Preliminary data DNA sequence of the mutation-corrected sequence;
  • the preliminary data DNA conversion device comprises an index DNA sequence generating device, a data DNA sequence converting device, and a third integrating device; wherein the index DNA sequence generating device is configured to position the data conversion unit in the data The number is converted into a fixed number of ternary sequence, and the ternary sequence is converted into an index DNA sequence having the same number of bases as the ternary sequence according to the index DNA sequence conversion rule; wherein the dataDNA sequence is converted
  • the apparatus is configured to convert the binary sequence of the data conversion unit into a dataDNA sequence according to a dataDNA sequence conversion rule; wherein the third integration device is configured to connect the index DNA sequence of the data conversion unit with the data DNA sequence, and add a length at the junction As a 2-base protection sequence, an index+dataDNA sequence was obtained, and the obtained index+dataDNA sequence was used as a preliminary data DNA sequence containing no mutation-corrected sequence.
  • the second integration device is for ligating the correction DNA sequence to one end of the dataDNA sequence in the preliminary data DNA sequence,
  • any of the foregoing data conversion systems further comprising an encryption device, the encryption device username and password input device and a dataDNA sequence conversion rule random generation device; wherein the username and password input device are configured to provide a username and password; wherein the dataDNA The sequence conversion rule random generation device is configured to randomly generate a correspondence between a specific binary number and a specific base in each group correspondence relationship of the dataDNA sequence conversion rule according to the user name and the password; wherein the dataDNA sequence conversion device is used to convert the dataDNA sequence according to the data DNA sequence conversion rule The binary sequence of the data conversion unit is converted to an encrypted dataDNA sequence in which the particular base is converted to a corresponding specific binary number in a corresponding manner generated by the dataDNA sequence conversion rule random generation means.
  • a system for reducing a DNA sequence obtained by sequencing to data comprising an input device and a dataDNA sequence reduction device; wherein the input device is for providing a DNA sequence obtained by sequencing, wherein the DNA sequence comprises representation data Converting a dataDNA sequence of data content information of the unit; wherein the dataDNA sequence reducing device is configured to restore the data DNA sequence to data according to a data DNA sequence conversion rule;
  • the dataDNA sequence reduction device is for reducing dataDNA sequences to data in binary form, or for restoring data DNA sequences to data in binary form and further restoring the data in binary form to original data.
  • the present invention also provides another system for reducing a DNA sequence obtained by sequencing into data, comprising an input device, an index DNA sequence reducing device and a fourth integrated device; wherein the input device is for providing a DNA sequence obtained by sequencing, the DNA sequence
  • the sequence is a plurality of data DNA sequences, each of the DNA sequences includes an index DNA sequence indicating information on the position of the data conversion unit and a dataDNA sequence indicating data content information of the data conversion unit; wherein the index DNA sequence reduction device is used for each of the index DNA sequence conversion rules
  • the index DNA sequence in the strip data DNA sequence is reduced to a ternary number sequence, and the ternary number sequence is restored to the position number of the conversion unit in the data;
  • the dataDNA sequence reducing device is used according to the dataDNA sequence conversion rule
  • the dataDNA sequence in the strip data DNA sequence is reduced to data; wherein the fourth integration device is used to link the data reduced from the data DNA sequence of each data DNA sequence in order of their position number to obtain the restored data.
  • the dataDNA sequence reduction device is for reducing dataDNA sequences to data in binary form, or for restoring data DNA sequences to data in binary form and further reducing the data in the form of binary numbers to characters a fourth integration device for restoring the obtained data as data in binary form, or further obtaining the original data by data restoration in the form of the binary number, or a string obtained by restoring by the dataDNA sequence restoration device according to its position number
  • the string data obtained in sequence is obtained or the original data further obtained by the string data restoration.
  • the invention also provides a system for correcting and restoring a DNA sequence obtained by sequencing to data, comprising an input device, an error correction device and a preliminary data DNA sequence reduction device;
  • the input device is for providing a DNA sequence obtained by sequencing, the DNA sequence comprising a preliminary data DNA sequence and a mutation correction sequence, wherein the preliminary data DNA sequence comprises data content information of the data conversion unit; and the DNA sequence obtained by the sequencing is The preliminary data DNA sequence has a mutation of at most one base;
  • the error correction device is used to reduce the sequencing sequence of the preliminary data DNA sequence to the unmutated preliminary data DNA sequence by the following method:
  • N(i) is the number of i bases present in the sequencing sequence of the preliminary data DNA sequence
  • the base number judgment value X'(i) of the sequencing sequence of the preliminary data DNA sequence and the preliminary judgment sequence in the mutation correction sequence contained in the DNA sequence obtained by sequencing are determined by the same rule. (i) Comparison:
  • the judgment value of the base number of only one base changes, it indicates that the sequencing sequence of the preliminary data DNA sequence has the insertion or deletion of the one base relative to the unmutated preliminary data DNA sequence;
  • val(i) is the value of base i
  • val(A), val(T), val(C), val(G) correspond to 1, 2, 3, 4 respectively
  • position(i) is the position coordinate of the base i
  • N is the total length of the sequencing sequence of the preliminary data DNA sequence
  • the base-bit weighted summation value sum' of the sequence of the preliminary data DNA sequence and the depth judgment sequence in the mutation correction sequence contained in the DNA sequence obtained by sequencing are subjected to the same rule to obtain the base weight-based weighting Contrast with the value sum;
  • the sequencing sequence of the preliminary data DNA sequence is replaced with two bases of the unmutated preliminary data DNA sequence: if sum'>sum, the base substitution occurring is a small val(i) value. The base is replaced with a base with a larger val(i) value. If sum' ⁇ sum, the base substitution occurs when the base with a larger val(i) value is replaced with a val(i) value.
  • the small base, the position coordinate at which the base substitution occurs is the absolute value of the divisor obtained by dividing the difference between sum' and sum by the difference between the val(i) of the two bases, and replacing the base at the position with The other of the two bases, the sequencing sequence is corrected to the unmutated preliminary data DNA sequence;
  • base insertion position is judged as follows: starting from the position where the base appears for the first time in the sequencing sequence of the preliminary data DNA sequence, each one is deleted one by one The base at the position of the base appears, and after the deletion, the base-bit weighted summation sum" of the preliminary preliminary DNA sequence obtained by the deletion is calculated according to the following formula:
  • val(i) is the value of base i
  • val(A), val(T), val(C), val(G) correspond to 1, 2, 3, 4 respectively
  • position(i) is the position coordinate of the base i
  • N is the total length of the preliminary data DNA sequence after deleting the base;
  • the position is the base insertion mutation position
  • the base at the position is deleted, and the sequencing sequence is corrected to the unmutated preliminary data DNA sequence;
  • the base deletion position is judged by inserting the bases one by one from the first position of the sequencing sequence of the preliminary data DNA sequence, and After insertion, the base bitwise weighted summation sum'′′ of the preliminary data DNA sequence after insertion is calculated according to the following formula:
  • val(i) is the value of base i
  • val(A), val(T), val(C), val(G) correspond to 1, 2, 3, 4 respectively
  • position(i) is the position coordinate of the base i
  • N is the total length of the preliminary data DNA sequence after inserting the base
  • the base-bit weighted summation sum'′′ calculated after inserting the base at a certain position and the depth judgment sequence in the mutation correction sequence contained in the DNA sequence obtained by sequencing are restored by the same rule.
  • the position is the base deletion mutation position at which the insertion of the base is to be corrected to the unmutated preliminary data DNA sequence;
  • the preliminary data DNA sequence reduction device is used to restore the unmutated preliminary data DNA sequence to data.
  • the preliminary data DNA sequence comprises a dataDNA sequence representing data content information of a data conversion unit
  • the preliminary data DNA sequence reduction The device is a data DNA sequence reduction device for reducing a data DNA sequence contained in an unmutated preliminary data DNA sequence into data according to a data DNA sequence conversion rule.
  • the dataDNA sequence reduction device is used to reduce the dataDNA sequence contained in the unmutated preliminary data DNA sequence to data in binary form, or to use the data DNA sequence contained in the unmutated preliminary data DNA sequence. Reverts to binary data and further restores the binary data to raw data.
  • the sequence of the DNA sequence obtained by sequencing is a plurality of data DNA sequences
  • the preliminary data DNA sequence of each data DNA sequence comprises a representation data conversion An index DNA sequence of unit position information and a dataDNA sequence indicating data content information of the data conversion unit, the preliminary data DNA sequence reduction device comprising an index DNA reduction device, a data DNA sequence reduction device, and a fifth integration device;
  • the index DNA reducing device is configured to restore the index DNA sequence in each data DNA sequence to a ternary number sequence according to an index DNA sequence conversion rule, and then restore the ternary number sequence to a position number of the conversion unit in the data;
  • the dataDNA sequence reducing device is configured to restore the data DNA sequence in each data DNA sequence to data according to a data DNA sequence conversion rule
  • the fifth integration device is configured to connect the data obtained by restoring the data DNA sequence of each data DNA sequence according to the position number sequence thereof to obtain the restored data.
  • the data DNA sequence reducing device is configured to restore the data DNA sequence to data in binary form, or to restore the data DNA sequence to data in binary form and further reduce the data in the binary form to a character string;
  • the reduced data obtained by the fifth integration device is data in the form of a binary number, or the original data obtained by further reducing the data in the form of the binary number, or a string obtained by restoring by the data DNA sequence restoration device according to the The position number is sequentially connected to the obtained string data or the data further reduced by the string data.
  • any one of the data restoration systems of the present invention may further comprise a decryption device, the decryption device comprising an input device and a data DNA sequence conversion rule determining device;
  • the input device is used to provide a username and password
  • the dataDNA sequence conversion rule determining apparatus is configured to obtain, according to the user name and the password, a correspondence manner between a specific binary number and a specific base in each group correspondence relationship of the dataDNA sequence conversion rule, where the corresponding manner is to convert the data into the encryption.
  • a data DNA sequence reduction device is configured to convert a data DNA sequence in a sequence obtained by sequencing into a data according to a data DNA sequence conversion rule, and wherein a specific base is determined in a corresponding manner determined by the data DNA sequence conversion rule determining means The base is restored to the corresponding specific binary number.
  • an executable software product containing program instructions stored on a computer readable storage medium, when executed by a computer, converts the data into a data DNA sequence, the software product including an execution book Program instructions for any of the data conversion methods of the invention.
  • an executable software product containing program instructions stored on a computer readable storage medium, when executed by a computer, can restore the sequenced DNA sequence to data, the software product comprising A program instruction for executing any of the data restoration methods of the present invention.
  • a computer readable storage medium storing any of the software products of the present invention is provided.
  • the method and apparatus of the present invention are capable of preventing the generation of initial codons in data DNA sequences and preventing single sequences in data DNA sequences.
  • the present invention finally integrates and realizes the data DNA sequence, and can be restored to the original data by the DNA sequence; and the data for storing a large amount of data in the living body is realized.
  • FIG. 1 is a schematic diagram of an example of data conversion and data restoration of the present invention.
  • Figure 2 is a schematic diagram of text type data conversion.
  • Figure 3 is a process of generating an index DNA sequence.
  • Figure 4 is a reduction process of the index DNA sequence.
  • Figure 5 is a process of generating a data DNA sequence.
  • Figure 6 is a reduction process of a dataDNA sequence.
  • Figure 7 is a schematic diagram showing the generation of a complete data DNA sequence.
  • Figure 8 is a schematic representation of the reduction of the complete data DNA sequence.
  • Figure 9 is a result of sequencing of DNA fragments stored with data extracted from cells.
  • Figure 10 is the emblem of Tsinghua University.
  • Fig. 11 is a data DNA sequence library obtained by converting the Tsinghua University emblem and the school song lyrics by the method of the present invention, and scrambling the sequence position and introducing a single base mutation.
  • data means any form of carrier capable of expressing information.
  • Data includes, but is not limited to, symbols, text, numbers, voice, images, video, and the like.
  • the data can be in binary form, hexadecimal form or string form, or any other form that can be converted directly or indirectly into binary form.
  • base and “nucleotide” are used interchangeably and refer to A, T, C or G constituting a DNA sequence.
  • data DNA sequence refers to a DNA sequence converted from data, which is a DNA sequence in the form of data. During storage, the compound DNA sequence is synthesized according to the sequence of the data DNA sequence and stored in the cell.
  • data conversion unit and “conversion unit” as used in the present invention are used interchangeably to refer to a component of data.
  • conversion is performed in units of data conversion units, and a data conversion is performed.
  • the unit is converted into a data DNA sequence.
  • all data consists of a data conversion unit, which is converted into a data DNA sequence for storage.
  • the data is divided into a plurality of conversion units, and the sequence of binary numbers corresponding to each conversion unit has a specific The length of each conversion unit is converted into a data DNA sequence, thereby converting the entire data into a plurality of data DNA sequences, so that each DNA sequence is separately synthesized and stored in the cells.
  • the sequence of binary numbers corresponding to the data content information of each conversion unit preferably has the same length.
  • the plurality of data DNA sequences constitute a data DNA library.
  • a collection comprising the plurality of data DNA sequences, such as cells for storing the plurality of data DNA sequences, may also be referred to as a data DNA library.
  • all data can be composed of one data conversion unit, that is, all data is divided into one data conversion unit.
  • the data is first converted into a binary number in bytes, and all the bytes are sequentially connected in order to become a binary sequence of data.
  • the binary number converted from the original data in each 8-bit byte
  • the data information may occupy only 7 bits, for example, in the case where the original data is a character string or can be converted into a character string, in this case, the data information can be stored only by the 7-bit binary number sequence, and all the data representing the data content information will be 7
  • the bit binary sequence is connected in sequence to the binary sequence of the data conversion unit.
  • the data is divided into a plurality of conversion units, and the sequence of binary numbers corresponding to the data content information of each conversion unit has a specific length.
  • the "specific length" may be from 70 to 240 positions, preferably from 140 to 175 positions.
  • the original data can be first converted into a sequence of binary numbers, divided into multiple conversion units, or divided into multiple string units, and then each string unit is converted into a sequence of binary numbers.
  • the raw data can be first converted to a binary number in bytes, and then a specific number of bytes are sequentially connected in sequence to form a binary sequence of the conversion unit.
  • One byte is a sequence of 8-bit binary numbers, as is well known to those skilled in the art.
  • the data information in each 8-bit byte, may occupy only 7 bits, for example, if the original data is a string or can be converted into a string.
  • the data information can be stored only by the 7-bit binary number sequence, and a specific number of 7-bit binary number sequences are connected in sequence to be converted into units.
  • the original data in the case where the original data is a string or can be converted into a string, the original data can be first divided into string units of a specific length, and then each character in the string is converted into a sequence of binary numbers. Then, the sequence of binary numbers corresponding to each character in the string unit is sequentially connected to form a sequence of binary numbers of the conversion unit.
  • the index DNA sequence contains position information of each data conversion unit in the data.
  • the position number of each data conversion unit in the data is first converted into a ternary number sequence, and the ternary number sequence is converted into an index DNA sequence.
  • the number of bits of the ternary number sequence converted by the position number of the conversion unit in the data, or the number of bases of the index DNA sequence may be 5-15, preferably 11-15, and most preferably the maximum is 15.
  • each data DNA library can store up to approximately 300 MB of data.
  • the length of the index DNA sequence can also be reduced or increased as needed. Reducing the length of the index DNA sequence can increase the conversion efficiency, and increasing the length of the index DNA sequence can increase the amount of information stored in the DNA sequence.
  • the "protection sequence" of the present invention is a sequence added at the junction of an index DNA sequence and a data DNA sequence, and a junction of a data DNA sequence and a correction DNA sequence.
  • the protection sequence is preferably CG.
  • the order of joining the index DNA sequence and the dataDNA sequence in the index+dataDNA sequence is not limited, and the index DNA sequence may be at the 5' end, the dataDNA sequence may be at the 3' end, or the dataDNA sequence may be at the 5' end, and the index DNA sequence may be at the 5' end. 'end.
  • the order of the preliminary judgment sequence and the depth judgment sequence in the correction DNA sequence is not limited, and the preliminary judgment sequence may be at the 5' end, the depth judgment sequence may be at the 3' end, or the depth judgment sequence may be at the 5' end.
  • the judgment sequence is at the 3' end.
  • each member of a collection corresponds to any member in another collection.
  • those skilled in the art should understand that they are continuously performed, compared with each other, or echoed. In the step of the relationship, if the correspondence between a certain set and its corresponding set needs to be applied, the correspondence between the specific members in the set and the specific members in the corresponding set should be consistent.
  • each group of ternary numbers or binary numbers and bases in the correspondence relationship between each group of ternary numbers or binary numbers and bases, different bases respectively correspond to different three-inputs.
  • Number or binary number for the purpose of storing data information.
  • the corresponding ternary number or binary number should correspond to the specific base under the same conditions.
  • the "same condition" means that the groups of conditions in the conversion rule table (including the index DNA sequence conversion rule table and the dataDNA conversion rule table) belong to the same group. A set of behaviors in the conversion rules table.
  • the correspondence between certain numbers involved and certain bases, and the corresponding manner of certain variables and certain values should be generated with the DNA of the data.
  • the manner in which the number used in the sequence corresponds to the base, and the manner in which the variable corresponds to the value should be generated with the DNA of the data.
  • the bases to be compared are weighted by bitwise weighting by comparing the bases of the different sequences to determine which mutations occur.
  • the value of val(i) in the calculation formula of the summation value should be the same.
  • the index DNA sequence conversion rule for converting the index DNA sequence into the ternary number sequence and the generation of the index DNA sequence are used.
  • the index DNA sequence conversion rules are the same, and the dataDNA sequence conversion rule by which the original data DNA sequence is converted into a binary number sequence is the same as the data DNA sequence conversion rule used when generating the original data DNA sequence.
  • the term "index DNA sequence conversion rule is the same” or "dataDNA sequence conversion rule is the same” as used herein means that the correspondence between a specific binary number and a specific base in each group correspondence relationship among these conversion rules is the same.
  • the "corresponding manner between a specific binary number and a specific base in each group correspondence" means a corresponding manner of which specific base corresponds to each specific binary number.
  • the base number judgment value is compared with the base bitwise weighted summation value information, wherein the number of bases contained in the correction DNA sequence contained in the sequencing sequence and the base bitwise weighted summation information represent the unmutated sequence
  • Corresponding values by comparison, indicate whether the sequencing sequence has been mutated relative to the unmutated sequence.
  • the calculation formula and corresponding method used to calculate the base number judgment value of the sequencing sequence and the base bitwise weighted summation value should be obtained in the correction DNA sequence included in the obtained sequencing sequence.
  • the calculation formula and corresponding method used for the base number judgment value and the base bitwise weighted summation value are the same.
  • the term "corresponding mode" as used herein means: (1) a specific correspondence method for the number of bases, C/G and -1/1; and/or (2) For the base bitwise weighted summation, the specific correspondence of val(A), val(T), val(C), val(G), and 1, 2, 3, and 4.
  • the "position number” is preferably a decimal number, but may be any number that can indicate the position order and can be converted to and from the ternary number.
  • mutation of one base means that one base is replaced with another base, or one base is inserted or deleted.
  • data conversion method refers to any method of converting data into a DNA sequence of data, a method of converting data into a DNA sequence containing a mutation correction sequence, and converting the data into an encrypted DNA sequence of data.
  • data reduction method refers to any method of reducing a DNA sequence obtained by sequencing to data or a method of reducing an encrypted DNA sequence obtained by sequencing to data.
  • the DNA sequence obtained by the data conversion method of the present invention is suitable for storage in cells.
  • the cells for storing the DNA sequence in the present invention may be microbial cells, such as bacteria such as E. coli cells or fungal cells such as yeast cells, or any other suitable cells or cell lines, such as insect cells or mammalian cells or cells. system.
  • the DNA sequence obtained by the data conversion method of the present invention may be stored in a cell as a plasmid or integrated into a genome of a cell.
  • the DNA sequence obtained by the data conversion method of the present invention can be introduced into a cell for storage by any suitable means, for example, the DNA sequence is cloned into a eukaryotic expression vector, and then directly transformed into a yeast cell for subculture, or The DNA sequence is directly integrated into the yeast genome for storage.
  • the DNA sequence stored in the cell can be extracted by any suitable means, for example, directly extracting the plasmid in yeast, transforming it into E. coli for amplification, extracting the plasmid again for sequencing, or directly extracting the yeast genome for PCR amplification. Increase, take the target fragment for sequencing.
  • the following steps can be performed: synthesizing a plurality of single-stranded DNA sequences based on a data DNA sequence library converted from data, Each single-stranded DNA sequence synthesized has a restriction site corresponding to the plasmid at both ends, and then each single-stranded DNA sequence and plasmid are digested and ligated, and a single-stranded DNA sequence is inserted into each plasmid. The ligated plasmid was transferred to E.
  • the plasmid containing each single-stranded DNA sequence can be mixed and transformed into yeast cells together.
  • the following steps can be performed: synthesizing a plurality of single-stranded DNA sequences based on a data DNA sequence library converted from data, and synthesizing Each single-stranded DNA sequence has a restriction site corresponding to the plasmid at both ends, and then each single-stranded DNA sequence and plasmid are digested and ligated, and a single-stranded DNA sequence is inserted into each plasmid, and the ligation will be performed. The plasmid is transferred into E.
  • the amplified plasmid is extracted and detected by restriction enzyme digestion, and the undetected plasmid is digested to obtain a target fragment (ie, a single-stranded DNA sequence), and the homologous sequences are ligated at both ends.
  • the target fragment having a homologous sequence linked to both ends is homologously recombined with the yeast cell to integrate the target fragment into the yeast cell genome.
  • Yeast cells are then subcultured.
  • the fragment containing each single-stranded DNA sequence can be mixed and homologously recombined with the yeast cell.
  • DNA sequences can be introduced into cells by other methods.
  • the cells used to store the DNA sequence are also not limited to yeast cells. Suitable methods for introducing DNA sequences into cells and suitable cells for storing DNA sequences are well known to those skilled in the art.
  • One or more as used in the present invention means one, two or more than two.
  • “One or more strips” as used in the present invention means one, two or more than two.
  • Each single-stranded DNA short sequence is mainly composed of three parts: indexDNA, which contains the positional information of the DNA sequence in the entire DNA sequence set, that is, the position information of the data content in the entire data; dataDNA, which contains the data content information; correctionDNA Used to check for mutations in the DNA sequence.
  • indexDNA which contains the positional information of the DNA sequence in the entire DNA sequence set, that is, the position information of the data content in the entire data
  • dataDNA which contains the data content information
  • correctionDNA Used to check for mutations in the DNA sequence.
  • a guard sequence CG of 2 bases in length is provided between the index DNA sequence and the dataDNA sequence and between the dataDNA sequence and the correction DNA sequence, respectively.
  • Embodiment 1 Conversion and Reduction of Text Data
  • the information stored in the index DNA sequence is a decimal number indicating that the data DNA single chain corresponds to the first string unit of the data text.
  • the length of the indexDNA sequence is set to 15 nt.
  • the index DNA generation module accepts the decimal sequence number N as the encoded start data (as shown in step a of Figure 3); then converts the decimal number into a triple
  • the algorithm of the number, the decimal number N is converted to a ternary number (as shown in the process b in Figure 3, the core of the decimal to ternary number conversion algorithm is N divided by three to take the remainder, the resulting quotient continues to take the remainder, This loops until the quotient is less than 3); after getting the ternary number, it is converted into a fifteen-digit ternary number sequence, whose initial state is set to "000000000000000", and the insufficient digits are kept with "0"
  • the state of the padding (as shown in the c process in Figure 3); after that, the resulting fifteen-digit ternary sequence is encoded by a set of conversion algorithms into an index DNA sequence of length 15 nt,
  • S ⁇ ATG, CTG, TTG, CAT, CAG, CAA, AAA, TTT, CCC. , the sequence combination in GGG ⁇ .
  • the encoding of the i-th position of the index DNA sequence it is first judged based on the base types on the i-th and i-th positions that have been encoded, and then which base is encoded at the position. In other words, the encoding of the i-th base is the same. It is subject to the information of its first two base sequences and the type of ternary number that needs to be stored at that site.
  • the number of elements of the candidate base set Sd becomes three, which are A, T, and C, respectively.
  • the number of elements of the candidate base set Sd is reduced to two, but information stored at the site is needed.
  • the complementary base T is selected as the third element in the set of alternative bases.
  • This design corresponds to a column with column number 6 in the algorithm table.
  • the situation corresponds to a column with column number 7 in the algorithm table, and G, A, and T are used to store 0, 1, and 2, respectively, thereby reducing the frequency of occurrence of base C.
  • the 15-digit ternary number sequence will be bit-by-bit encoded from the first bit into a 15-bit index DNA sequence.
  • the information of each bit of the two sequences corresponds one-to-one, and finally the required indexDNA will be generated.
  • the reduction of the index DNA sequence ie the decoding of the index DNA sequence, is the inverse of the above encoding process, as shown in FIG.
  • the module starts with a piece of data DNA sequence inside the program.
  • the index DNA sequence with a length of 15 nt at the head end is extracted from the entire sequence (as shown in step a in Figure 4); and then through the index DNA sequence and the ternary number sequence.
  • the conversion between the calculations is a fifteen-digit ternary number sequence (as shown in the b process in Figure 4); after that, the ternary sequence is degenerated into a ternary sequence number (see Figure 4 for c).
  • the process is shown); the ternary number is further decoded into the sequence number N of the decimal (as shown in the d process in Figure 4), and the ternary number is restored to the decimal number.
  • N ⁇ (Xi * 3i), Where X represents the ternary number of the i-th bit, i indicates that the bit is the first bit, and i is taken from 0. Finally, the decimal sequence number N is output, and the string data obtained by synchronizing the decoded dataDNA sequence in the data DNA sequence is stored in the Nth position of the data array, and the program ushers in a new piece of data DNA sequence to enter the next cycle (see FIG. 4). The e/f process is shown).
  • the core part of the above process is the process of decoding the fifteen-digit index DNA sequence to the fifteen-digit ternary number sequence.
  • the dataDNA sequence is generated by using every 20 characters in the string sequence as a conversion unit, and each dataDNA sequence stores 20 characters of information.
  • the process of generating a data DNA sequence is shown in FIG.
  • the encoding of the dataDNA sequence is started when a string sequence of 20 characters is input inside the algorithm, and each character is first converted into a corresponding decimal number of the character on the ASCII code table (as shown in a process in FIG. 5); Then each of the obtained decimal numbers is sequentially converted into a corresponding binary format, where the conversion algorithm can call the internal functions of the operating system, and the generated binary numbers start with "0b" (as shown in the process b in Fig. 5) After that, each binary number is sequentially converted into a 7-bit binary number sequence.
  • the algorithm of this process is to sequentially fill in the numbers after the prefix "0b" in the binary number into 7 bits whose initial value is set to "0000000".
  • sequence of binary numbers In the sequence of binary numbers, the sequence of 7-bit binary numbers obtained from all 20 decimal numbers is sequentially connected to a sequence of 140-bit binary numbers (as shown in the c process in Figure 5); and then according to the sequence of binary numbers and dataDNA sequences.
  • the conversion algorithm converts it into a dataDNA sequence (as shown by the d process in Figure 5); finally, the dataDNA sequence is output for the next step, and the variables in the module are returned to the initial values. Wait for the input of the next string conversion unit.
  • the core part of the above process is the part of the 140-bit binary number sequence converted to the dataDNA sequence (as shown in the d process in Figure 5).
  • the algorithm design is shown in Table 2.
  • the first two bases of the dataDNA sequence are encoded according to the algorithm in the case of X2 ⁇ B.
  • the decoding of the dataDNA sequence is the inverse of the above process.
  • the programming flow is shown in Figure 6.
  • the module starts by inputting a piece of data DNA sequence into the program.
  • the module will grab the dataDNA sequence in the data DNA sequence. 17:-17] (as shown in step a in Figure 6); then the dataDNA sequence is decoded into a 140-bit binary number sequence by a conversion algorithm between the dataDNA sequence and the binary number sequence (Table 2) (see Figure 6).
  • this 140-bit binary number sequence is actually a sequence of 20 7-bit binary number sequences, now separate them from each other, and sequentially restore the binary numbers stored in each sequence ( Figure 6 In the c process shown); in turn, add the binary number identifier "0b" for each binary number, and call the system internal function to decode it into a decimal number (as shown in the d procedure in Figure 6); Write the corresponding character of the decimal number in the ASCII table (as shown in the e process in Figure 6); finally, 20 characters are sequentially composed into a 20-byte string, and the string is output from the module. All variables of the module are returned to the initial state ( As shown in the f/g process in Figure 6.
  • the part of the binary data sequence decoded from the dataDNA sequence to 140 bits is the core of the module, and its algorithm design is shown in Table 2.
  • the base C only functions as a placeholder, does not store any information, and therefore does not restore any content; the above process is continued until the last two bases of the dataDNA sequence, and the two bases at the end are decoded according to Table 4.
  • the correctionDNA is mainly composed of two parts, a preliminary judgment sequence of 4 nt in length and a depth judgment sequence of 10 nt in length.
  • the function of the preliminary judgment sequence is to be able to determine the type of single-base mutation in the sequence (base substitution or base deletion or base insertion), and to determine the type of single base of the mutation (which occurs between the two bases)
  • the function of the depth judgment sequence is to judge the site of the mutation and the specific mutation based on the result of preliminary judgment of the sequence. After correcting the mutation, it can be reduced to the original sequence.
  • the initial judgment sequence generation algorithm relies on the mathematical function:
  • N(i) is the number of i bases present in the index DNA sequence and the dataDNA sequence.
  • the preliminary judgment sequence is CGGC
  • val(i) is the value of base i, as shown in Table 5; position(i) is the position coordinate of base i; N is the total length of the index DNA sequence and the dataDNA sequence.
  • Each piece of data DNA sequence will generate a summation result sum of decimal numbers, convert the decimal number into a ternary number, and pass it to the 10-digit ternary number sequence, and then according to the indexDNA sequence conversion algorithm (ternary The conversion algorithm between the number sequence and the DNA sequence, Table 1) converts it into a 10 nt depth judgment sequence.
  • a protective base C is added between the two moieties.
  • a 15 nt correction sequence is generated which will be ligated to the end of the data DNA sequence, resulting in a complete DNA sequence containing the three portions of index DNA, dataDNA, and correction DNA.
  • the protective base C at the two partial junctions is added between the preliminary judgment sequence and the depth judgment sequence, and the correction DNA sequence is CGGCcGGCGAATCCT.
  • the module begins by inputting a data DNA sequence into the program.
  • the module first captures the correction DNA sequence at the end of the data DNA sequence, and first restores the preliminary judgment sequence to a judgment sequence consisting of 1 and -1, which is also a four-digit sequence. , respectively, stores the judgment value of the number of bases in the original data DNA sequence; and simultaneously restores the 10 nt depth judgment sequence to a decimal number (the process algorithm is completely similar to the reduction of the index DNA sequence, and will not be described again), the decimal number represents Bases of the original data DNA sequence are weighted by bitwise summation.
  • the index DNA and the dataDNA portion of the data DNA received by the module are calculated using the preliminary judgment function and the depth judgment function, and the base number judgment value and the base bitwise weighted summation value of the existing data DNA sequence are obtained. Will appear The result of the operation of the data DNA sequence is compared with the operation result of the original data DNA restored by the correction DNA sequence.
  • the mutated sequence ATCCTTCGACGTCGAGgcCGGCcGGCGAATCCT was sequenced, and the correction DNA sequence was first restored to obtain:
  • the mutation site was obtained as
  • /(4-3) 3. Further, by ⁇ ' ⁇ , it is possible to mutate from G to C. Therefore, in the end, it was determined that the third base in the dataDNA sequence was mutated from G to C, and the original sequence was obtained by reducing this site.
  • the mutated sequence -ATGCTATCGACGTCGAGgcCGGCcGGCGAATCCT was sequenced, and the correction DNA sequence was first restored to obtain:
  • the mutated sequence -ATGCTT-GACGTCGAGgcCGGCcGGCGAATCCT was sequenced, and the correction DNA sequence was first restored to obtain:
  • the conversion of the data text to the data DNA sequence is performed by using 20 characters as a conversion unit.
  • the index DNA generation module is generated to generate an index DNA sequence identifying the serial number information; at the same time, the string sequence is entered into the dataDNA generation module to generate and store the character of the unit.
  • DNA sequence accepts the next 20-byte string conversion unit, and so on until all the txt texts are converted into data DNA sequences, and a data DNA sequence library storing all the information of the original data is obtained.
  • FIG. 1 A schematic representation of the reduction of the complete data DNA sequence is shown in FIG.
  • the data DNA sequence library stored in the data DNA cell bank is sequenced and saved into a txt text format. Each line of the text represents a data DNA sequence, and the data DNA sequences are arranged in an out-of-order manner.
  • the conversion software starts from the first line of the txt text.
  • the complete data DNA sequence first passes through the correction module, and the index DNA sequence and the dataDNA sequence are evaluated and restored through the error correction mechanism.
  • the index DNA sequence and the data DNA sequence captured by the program are respectively input into the index module and the data module for reduction, the former restores the serial number corresponding to the data segment, and the latter restores the DNA in the segment data.
  • the stored data information that is, a 20-byte character string; then, the character string is stored in the data generation text corresponding to the serial number, and the converter grabs the next line sequence in the txt text, and thus loops. Finally, the text data consisting of the string in the ASCII table will be obtained, and then the later data format conversion will be performed to obtain the final data after the restoration.
  • the first generation of converters did not have an index and correction module, so they could only convert some very short text.
  • the length of the DNA sequence is shortened and the efficiency is improved because there is no part of the index DNA sequence and the correction DNA sequence, which is a cost reduction at the application level.
  • the short term it is currently applied to short text bio-storage. The situation will be more common.
  • the test text was translated into "Dai Lab, Tsinghua University, Synthetic Yeast, Synthetic Biology" and converted into the dataDNA sequence as shown in Table 6:
  • the above data DNA sequence was transformed into yeast and stored in a plasmid form and integrated into the genome for storage and subculture. After 100 generations, these fragments were extracted and sequenced.
  • the sequence of the obtained DNA DNA was basically the same as the initial state. Only one group integrated into the genome showed a single base loss, as shown in Figure 9. . This also verifies the need to add error correction mechanisms later.
  • the encryption mechanism was introduced in the second-generation converter and tested with the text "Hello, World!”. As shown in Table 7, under the different usernames and passwords, the same text will generate different dataDNA sequences. When restoring dataDNA data, it must also provide the correct username and password to achieve decoding, so that the user's data is more secure and confidential.
  • the third generation of biotransformation software is mainly for large-scale data storage tasks.
  • the index module and the correction module were added.
  • the Tsinghua University emblem shown in Figure 10
  • the Tsinghua University song lyrics were used as test objects.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

将数据转换为数据DNA序列以及将该DNA序列库还原为原始数据的方法、装置、软件产品和储存该软件产品的存储介质。其中所述将数据转换为数据DNA序列的方法包括:将数据划分为一个或更多个数据转换单元,并提供每个数据转换单元的二进制数序列,按照dataDNA序列转换规则将每个数据单元转换为一条数据DNA序列,由此获得数据DNA序列库。实现了通过构建数据DNA文库进行生物体内储存数据的可能。

Description

将数据进行生物存储并还原的方法 技术领域
本发明属于涉及生物信息学、合成生物学和计算机领域,尤其涉及一种能够将数据转换为具有生物适应性的DNA序列,以及将该DNA序列库还原为原始数据的转换方法。
背景技术
21世纪是生命科学的世纪,也是信息和大数据的世纪。当前,信息技术蓬勃发展,所伴生的一个重要问题就是如何处理日益庞大的数据。根据International Data Corporation提供的资料,全世界所产生的信息数据总量在2009年已经达到约0.8ZB(1ZB=1.18*1021B),同时,该机构还预测,至2020年为止,全球数据总量将达到40ZB。现有的数据存储技术在如此巨大规模的数据量前暴露了其储存密度小,储存能耗高,储存周期短的不足。人们越来越需要一种新的途径来解决数据储存的难题。在这种现实背景下,长期以来一直承担着生物遗传信息的储存任务的生命大分子——DNA逐渐受到科学家们的关注。作为遗传信息的承载者,DNA具有远远超越现有存储技术的数据存储密度;并且在次优环境中也能保持储存信息的完整;生命周期可以很长,并且能够通过自我复制或人为扩增实现信息的拷贝。
前人在利用DNA信息储存技术实现数据的生物存储上做出了很多努力,如Church等人通过数据DNA的“破碎化”及基于ASCII码的二进制转换,改变了原来的研究中将全部数据转换到一条完整的长单链DNA的思路,而采取通过一系列部分重叠的短DNA序列(序列的全集代表完整的数据信息)进行数据储存。在此基础上,Goldman等人进一步优化了策略,采用三进制的转换算法以提高信息储存率,通过“自由碱基”来防止单碱基连续重复的出现,通过部分重叠短序列产生4倍冗余的机制增加数据DNA的拷贝,用于防御DNA合成、保存和测序过程中出现的错误。Church和Goldman等人认为应该在体外保存得到的数据DNA,将数据DNA转入生物载体内是不具备任何经济效益的,反而会带来很多的问题。而真正实现在生物载体内保存人工合成的数据DNA的是David Haughton等人,通过在载体细胞noncoding DNA序列中的无用区植入数据DNA,“类四进制”的算法实现高信息储存率的同时防止起始密码子的出现,LDPC codes+modified watermark synchronisation code解决基因突变后的再同步化和纠错等手段实现了既不让外源DNA显著的影响载体生物的生命活动,也让载体生物的传代过程向数据DNA序列引入突变。
尽管前人在利用DNA储存数据的工作已经取得了很大的进展,但目前仍然存在着很多问题。首先,Church等人采取的二进制算法在信息储存密度上有很大的提升空间,由单碱基连续重复而引入的较高突变率问题也未得到解决;其次,Goldman教授团队虽然应用三进制算法同时改善了以上两个问题,但他们得到的2.2PB/克单链DNA的信息储存密度较445EB/克单链DNA的理论值仍还有很大的距离,这个问题的出现一方面是来自于三进制的转换法则本身的限制,另一方面是由于四倍冗余的纠错机制将序列长度增加到原序列的4倍,将转换效率降低到四分之一,相应地DNA合成和测序的成本也将同时增加4倍;而且,Church和Goldman等人都只解决了在体外保存 DNA的前提下通过DNA储存数据的问题,对于将数据DNA植入生物体内所需解决的生物适应性和纠错机制问题,他们没有能够给出好的解决方案;最后,来自计算机领域的David Haughton等人用“类四进制”的算法和信道编码技术相结合的手段显著提高了信息储存密度并给出了满足生物适应性和纠错机制的接近最优解,但同样地也存在着问题,如“类四进制”算法中会出现0/1二进制序列末端1或2位无法被正确编码的问题,以及位置信息序列的生成和整合过程中防止起始密码子出现的问题,而且David Haughton等人只给出了如何将数据转换为数据DNA序列的一套方案,对生物储存的完整过程没有给出方案,也没有进行实际的尝试和测试。
发明内容
本发明提供将数据转换为数据DNA序列的方法,利用DNA序列作为信息存储介质,来储存数据。利用本发明的方法转换获得的DNA序列,适合于储存在生物体内,例如以质粒形式储存在细胞中,或者被整合在细胞基因组上。
本发明的方法中,将信息量较大的数据划分为数据转换单元,将每个数据转换单元转换为一条单链DNA短序列,由此将数据转换为一系列单链DNA短序列的集合。其中每条单链DNA短序列的长度适合于进行基因操作,例如适合于被克隆到质粒中或者适合于被整合到细胞基因组中,因而便于将转换得到的DNA序列储存在生物体内。
本发明中,使用特别设计的dataDNA序列转换规则将数据转换单元转换为表示该转换单元数据信息的dataDNA序列,以及将单链DNA短序列中的dataDNA序列还原为数据转换单元的二进制数序列。所述dataDNA序列转换规则可以防止dataDNA序列中初始密码子的生成、防止数据DNA序列中单碱基连续重复序列的生成。所述dataDNA序列转换规则是:
(a)对于dataDNA序列中的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
(b)对于dataDNA序列的首两位,按下表中与条件d
Figure PCTCN2016097398-appb-000001
集合{AT,CT,TT,CA,AA,GG,CC}相应的对应关系进行二进制数与碱基的对应转换;
Figure PCTCN2016097398-appb-000002
*其中当d=[C,A]时,位置i上为碱基C,该碱基C不对应任何二进制数
(c)从dataDNA序列的第三位起,依次按上表所示规则进行转换,首先判断第i位满足上表中的哪一条件,然后按照与该条件相应的对应关系进行第i位上二进制数与碱基的对应转换;
(d)当二进制数序列剩余1位或2位时,使用下表所示规则进行二进制数与碱基的对应转换
碱基 AC TC CG GA GT GC
二进制数序列 0 1 00 01 10 11
除非特别指出,下文中所述的任何方案中所提及的“dataDNA序列转换规则”均是指上述dataDNA序列转换规则。
本发明中,每条单链DNA短序列还可以包含表示数据转换单元的位置信息的indexDNA序列,以指示该单链DNA短序列中包含的数据转换单元信息在整个数据中的位置信息,从而便于在将一系列单链DNA短序列的集合还原为一系列数据转换单元时,将这些数据转换单元拼接而成原始数据。本发明中,在获得indexDNA序列时,首先将数据转换单元在数据中的位置编号转换为固定位数的三进制数序列,然后使用特别设计的indexDNA序列转换规则将所述三进制数序列转换为碱基数与三进制数序列的位数相同的indexDNA序列。在进行数据还原时,首先用所述indexDNA序列转换规则将indexDNA序列转换为三进制数序列,然后再将该三进制数序列转换为数据转换单元在数据中的位置编号。所述indexDNA序列转换规则是:
(a)对于indexDNA序列的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
(b)对于indexDNA序列的首两位,按下表中与条件d
Figure PCTCN2016097398-appb-000003
集合{AT,CT,TT,CA,AA,CC,GG}相应的对应关系进行三进制数与碱基的对应转换;
Figure PCTCN2016097398-appb-000004
(c)从indexDNA序列的第三位起,依次按照上表所示规则进行转换,首先判断第i位满足上表中的哪一组条件,然后按照与该条件相应的对应关系进行第i位上三进制数与碱基的对应转换。
除非特别指出,下文中所述的任何方案中所提及的“indexDNA序列转换规则”均是指上述indexDNA序列转换规则。
本发明还特别设计了用于防御体外操作及细胞传代过程中可能出现的突变的方法,即在每条单链DNA短序列中包含用于检验该单链DNA短序列是否发生突变和校正突变的correctionDNA序列。
根据本发明的一个方面,提供将数据转换为数据DNA序列的方法,包括将数据划分为一个或更多个数据转换单元,并提供每个数据转换单元的二进制数序列,按照下述步骤将每个数据单元转换为一条数据DNA序列,由此获得数据DNA序列库;所述数据DNA序列库包含一条或更多条数据DNA序列,每条数据DNA序列由一个数据转换单元转换而来;所述步骤包括:按照dataDNA序列转换规则将每个数据转换单元的二进制数序列转换为一条dataDNA序列,即为一条数据DNA序列。
本发明还提供另一种将数据转换为数据DNA序列的方法,所述方法包括将数据划分为一个或更多个数据转换单元,并提供每个数据转换单元的二进制数序列,按照下述步骤将每个数据转换 单元转换为一条数据DNA序列,由此获得数据DNA序列库;所述数据DNA序列库包含一条或更多条数据DNA序列,每条数据DNA序列由一个数据转换单元转换而来;所述步骤包括:
(1)将数据转换单元在数据中的位置编号转换为固定位数的三进制数序列,根据indexDNA序列转换规则将所述三进制数序列转换为碱基数与三进制数序列的位数相同的indexDNA序列;
(2)按照dataDNA序列转换规则将数据转换单元的二进制数序列转换为dataDNA序列;
(3)将该数据转换单元的indexDNA序列与dataDNA序列相连,连接处加入长度为2个碱基的保护序列,得到index+dataDNA序列,即为一条数据DNA序列。
本发明还提供将数据转换为包含突变校正序列的数据DNA序列的方法,所述方法包括将数据划分为一个或更多个数据转换单元,并提供每个数据转换单元的二进制数序列,按照下述步骤将每个数据转换单元转换为一条包含突变校正序列的数据DNA序列,由此获得数据DNA序列库;所述数据DNA序列库包含一条或更多条数据DNA序列,每条数据DNA序列由一个数据转换单元转换而来;所述步骤包括:
(1)将数据转换单元的二进制数序列转换为不包含突变校正序列的初步数据DNA序列,所述初步数据DNA序列包含数据转换单元的数据内容信息;
(2)首先根据初步数据DNA序列生成4位碱基的初步判断序列:根据下式计算i=A,T,C,G时的碱基数量判断值X(i):
X(i)=(-1)N(i)
其中i=A,T,C,G;N(i)为i碱基在初步数据DNA序列中出现的个数;
用初步判断序列的4位碱基分别储存i=A,T,C,G时的碱基数量判断值X(i),用碱基C和G分别储存-1和1,生成初步判断序列;
然后根据初步数据DNA序列生成10位碱基的深度判断序列:根据下式计算初步数据DNA序列的碱基按位加权求和值sum:
Figure PCTCN2016097398-appb-000005
其中i=A,T,C,G;val(i)为碱基i的值,val(A)、val(T)、val(C)、val(G)分别对应1、2、3、4;position(i)为碱基i的位置坐标;N为初步数据DNA序列的总长;
将碱基按位加权求和值sum的值转换为10位的三进制数序列,生成深度判断序列;
将初步判断序列与深度判断序列相连,并在连接处加入保护碱基C,获得correctionDNA序列;
(3)将初步数据DNA序列与correctionDNA序列相连,并在连接处加入长度为2个碱基的保护序列,获得包含突变校正序列的数据DNA序列。
在将数据转换为包含突变校正序列的数据DNA序列的方法的一些优选的实施方案中,步骤(1)包括:按照dataDNA序列转换规则将所述数据转换单元的二进制数序列转换为dataDNA序列,以该dataDNA序列作为不包含突变校正序列的初步数据DNA序列。
在将数据转换为包含突变校正序列的数据DNA序列的方法的另一些优选的实施方案中,步骤(1)包括:
(1-1)将所述数据转换单元在数据中的位置编号转换为固定位数的三进制数序列,根据indexDNA序列转换规则将所述三进制数序列转换为碱基数与三进制数序列的位数相同的 indexDNA序列;
(1-2)按照dataDNA序列转换规则将所述数据转换单元的二进制数序列转换为dataDNA序列;
(1-3)将所述数据转换单元的indexDNA序列与dataDNA序列相连,连接处加入长度为2个碱基的保护序列,得到index+dataDNA序列,获得的index+dataDNA序列作为不包含突变校正序列的初步数据DNA序列。
在该实施方案中,将数据的每一个转换单元转换为一条包含数据转换单元位置信息、数据转换单元数据内容信息和突变校正序列的数据DNA序列,其中优选在步骤(1-3)中将correctionDNA连接在index+dataDNA序列中的dataDNA一端。
在将数据转换为包含突变校正序列的数据DNA序列的方法的其它实施方案中,在步骤(1)中可以通过其它方法将数据转换单元的二进制数序列转换为不包含突变校正序列的初步数据DNA序列。
本发明还进一步提供加密的数据DNA序列转换方法,包括:
(1)提供用户名和密码,根据用户名和密码随机生成dataDNA序列转换规则中每一组对应关系中特定二进制数和特定碱基之间的对应方式;
(2)利用前述任一种方法将数据转换为数据DNA序列,其中按照dataDNA序列转换规则将数据转换单元的二进制数序列转换为dataDNA序列时,按照步骤(1)生成的对应方式将特定二进制数转换为相应的特定碱基。
在一些实施方案中,前述的任一种数据转换方法是在计算机上实施的方法。
根据本发明的另一个方面,提供利用DNA序列存储数据的方法,包括:利用本发明所述的任何一种数据转换方法将数据转换为数据DNA序列,合成所述DNA序列,以及储存合成的DNA序列。
在一个实施方案中,所述储存合成的DNA序列是将DNA序列以质粒形式储存在细胞中,或者是将DNA序列整合到细胞基因组中。
根据本发明的另一方面,提供将测序获得的DNA序列还原为数据的方法,包括:
(1)提供测序获得的DNA序列,其中所述DNA序列包括表示数据转换单元的数据内容信息的dataDNA序列;
(2)按照本发明的dataDNA序列转换规则将dataDNA序列还原为数据。
在一些实施方案中,步骤(2)可以是将dataDNA序列还原为二进制数形式的数据,或者步骤(2)可以包括将dataDNA序列还原为二进制数形式的数据以及进一步由该二进制数形式的数据还原为原始数据。
本发明还提供另一种将测序获得的DNA序列还原为数据的方法,包括:
(1)提供测序获得的DNA序列,所述DNA序列的序列为多条数据DNA序列,每条数据DNA序列包括表示数据转换单元位置信息的indexDNA序列和表示数据转换单元的数据内容信息的dataDNA序列;
(2)按照indexDNA序列转换规则将每条数据DNA序列中的indexDNA序列还原为三进制数序列,再将该三进制数序列还原为该转换单元在数据中的位置编号;
(3)按照dataDNA序列转换规则将每条数据DNA序列中的dataDNA序列还原为数据;
(4)将由每条数据DNA序列的dataDNA序列还原而来的数据按照其位置编号顺序相连,获得还原后的数据。
在一些实施方案中,步骤(3)可以是将dataDNA序列还原为二进制数形式的数据,或者可以进一步包括将该二进制数形式的数据进一步还原成的字符串。步骤(4)中所获得的还原后的数据,可以是二进制数形式的数据,或者可以是由该二进制数形式的数据进一步还原而成的原始数据,或者还可以是由步骤(3)获得的字符串按照其位置编号顺序相连获得的字符串数据或由该字符串数据进一步还原而成的数据。
本发明还提供将测序获得的DNA序列校正还原为数据的方法,包括:
(1)提供测序获得的DNA序列,所述DNA序列包含初步数据DNA序列和突变校正序列,其中所述初步数据DNA序列包含数据转换单元的数据内容信息;所述测序获得的DNA序列中初步数据DNA序列最多具有一个碱基的突变;
(2)根据该初步数据DNA序列的测序序列,按照下式规则计算获得该初步数据DNA序列的测序序列的碱基数量判断值X'(i):
X′(i)=(-1)N(i)
其中i=A,T,C,G;N(i)为i碱基在该初步数据DNA序列的测序序列中出现的个数;
将该初步数据DNA序列的测序序列的碱基数量判断值X'(i)与由测序获得的DNA序列中包含的突变校正序列中的初步判断序列按相同规则还原获得的碱基数量判断值X(i)对比:
如果有两个碱基的碱基数量判断值发生变化,则表明该初步数据DNA序列的测序序列相对于未突变的初步数据DNA序列发生了碱基替换,且该替换是这两个碱基之一被另一个替换;
如果仅有一个碱基的碱基数量判断值发生变化,则表明该初步数据DNA序列的测序序列相对于未突变的初步数据DNA序列发生了这一个碱基的插入或删除;
如果没有碱基的碱基数量判断值发生变化,则表明该初步数据DNA序列的测序序列未发生突变;
(3)根据该初步数据DNA序列的测序序列,按照下式规则计算获得该初步数据DNA序列的测序序列的碱基按位加权求和值sum':
Figure PCTCN2016097398-appb-000006
其中i=A,T,C,G;val(i)为碱基i的值,val(A)、val(T)、val(C)、val(G)分别对应1、2、3、4;position(i)为碱基i的位置坐标;N为该初步数据DNA序列的测序序列的总长;
将该初步数据DNA序列的测序序列的碱基按位加权求和值sum'与由测序获得的DNA序列中包含的突变校正序列中的深度判断序列按相同规则还原获得的碱基按位加权求和值sum对比;
在该初步数据DNA序列的测序序列相对于未突变的初步数据DNA序列发生碱基替换的情况下:如果sum'>sum,则所发生的碱基替换是val(i)值较小的碱基被替换为val(i)值较大的碱基,如果sum'<sum,则所发生的碱基替换是val(i)值较大的碱基被替换为val(i)值较小的碱基,发生碱基替换的位置坐标是sum'和sum之差除以所述两个碱基的val(i)之差所得除数的绝对值,将该位置上的碱基替换为所述两个碱基中的另一个,将测序序列校正为未突变的初步数据DNA序列;
在该初步数据DNA序列的测序序列相对于未突变的初步数据DNA序列发生一个碱基的插入或删除的情况下:
如果sum'>sum,则发生碱基插入,所述碱基插入位置按下述方法判断:从该初步数据DNA 序列的测序序列中第一次出现所述碱基的位置开始,逐个删除每一个出现所述碱基的位置上的所述碱基,并在删除后按照下式规则计算获得删除后的初步数据DNA序列的碱基按位加权求和值sum″:
Figure PCTCN2016097398-appb-000007
其中i=A,T,C,G;val(i)为碱基i的值,val(A)、val(T)、val(C)、val(G)分别对应1、2、3、4;position(i)为碱基i的位置坐标;N为删除所述碱基后初步数据DNA序列的总长;
当删除某个位置上的所述碱基之后计算获得的碱基按位加权求和值sum″与由测序获得的DNA序列中包含的突变校正序列中的深度判断序列按相同规则还原获得的碱基按位加权求和值sum相等时,该位置即为所述碱基插入突变位置,将该位置上的所述碱基删除,将测序序列校正为未突变的初步数据DNA序列;
如果sum'<sum,则发生碱基删除,所述碱基删除位置按下述方法判断:从该初步数据DNA序列的测序序列的第一位开始,逐个位置上插入所述碱基,并在插入后按照下式规则计算获得插入后的初步数据DNA序列的碱基按位加权求和值sum″′:
Figure PCTCN2016097398-appb-000008
其中i=A,T,C,G;val(i)为碱基i的值,val(A)、val(T)、val(C)、val(G)分别对应1、2、3、4;position(i)为碱基i的位置坐标;N为插入所述碱基后初步数据DNA序列的总长;
当在某个位置上插入所述碱基之后计算获得的碱基按位加权求和值sum″′与由测序获得的DNA序列中包含的突变校正序列中的深度判断序列按相同规则还原获得的碱基按位加权求和值sum相等时,该位置即为所述碱基删除突变位置,在该位置上插入所述碱基,将测序序列校正为未突变的初步数据DNA序列;
(4)将未突变的初步数据DNA序列还原为数据。
在将测序获得的DNA序列校正还原为数据的方法的优选的实施方案中,初步数据DNA序列包含表示数据转换单元的数据内容信息的dataDNA序列,步骤(4)包括按照dataDNA序列转换规则将未突变的初步数据DNA序列包含的dataDNA序列还原为数据。在一些实施方案中,步骤(4)可以是将未突变的初步数据DNA序列包含的dataDNA序列还原为二进制数形式的数据,或者可以进一步包括将该二进制数形式的数据还原为原始数据。
在将测序获得的DNA序列校正还原为数据的方法的另一些优选的实施方案中,在所述方法中,测序获得的DNA序列的序列为多条数据DNA序列,每条数据DNA序列的初步数据DNA序列包含表示数据转换单元位置信息的indexDNA序列和表示数据转换单元的数据内容信息的dataDNA序列,步骤(4)包括:
(4-1)按照indexDNA序列转换规则将每条数据DNA序列中的indexDNA序列还原为三进制数序列,再将该三进制数序列还原为该转换单元在数据中的位置编号;
(4-2)按照dataDNA序列转换规则将每条数据DNA序列中的dataDNA序列还原为数据;
(4-3)将由每条数据DNA序列的dataDNA序列还原而来的数据按照其位置编号顺序相连,获得还原后的数据。
其中,步骤(4-2)可以是将dataDNA序列还原为二进制数形式的数据,或者进一步包括将该二进制数形式的数据还原成字符串;且步骤(4-3)中还原后的数据是二进制数形式的数据,或者是由该二进制数形式的数据进一步还原而成的原始数据,或者是由dataDNA序列还原获得的字符串按照其位置编号顺序相连获得的字符串数据或由该字符串数据进一步还原而成的数据。
本发明还提供将测序获得的加密DNA序列还原为数据的方法,包括:
(1)提供用户名和密码,得到dataDNA序列转换规则中每一组对应关系中特定二进制数和特定碱基之间的对应方式,所述对应方式是将数据转换为所述加密DNA序列时针对同一用户名和密码设定的对应方式;
(2)用前述任一种数据还原方法方法将测序获得的加密DNA序列还原为数据,且其中按照dataDNA序列转换规则将每一条DNA序列中的dataDNA序列还原为数据时,按照步骤(1)得到的对应方式将特定碱基还原为相应的特定二进制数。
在一些实施方案中,本发明的任一种数据还原方法是在计算机上实施的方法。
根据本发明的另一个方面,提供从细胞中获取数据的方法,包括:从细胞中提取储存有数据信息的DNA序列,测序,通过本发明的任一种数据还原方法将测序获得的DNA序列还原为原始数据。
根据本发明的另一个方面,提供用于将数据转换为数据DNA序列的系统,包括输入装置和dataDNA序列转换装置;
其中输入装置用于提供数据转换单元的二进制数序列;
其中dataDNA序列转换装置用于按照dataDNA序列转换规则将所述数据转换单元的二进制数序列转换为dataDNA序列。
在一些实施方案中,所述用于将数据转换为数据DNA序列的系统进一步包括indexDNA生成装置和第一整合装置;其中indexDNA生成装置用于将所述数据转换单元在数据中的位置编号转换为固定位数的三进制数序列,并根据indexDNA序列转换规则将所述三进制数序列转换为碱基数与三进制数序列的位数相同的indexDNA序列;其中第一整合装置用于将所述数据转换单元的indexDNA序列与dataDNA序列相连,并在连接处加入长度为2个碱基的保护序列,得到index+dataDNA序列。
本发明还提供用于将数据转换为包含突变校正序列的数据DNA序列的系统,所述系统包括输入装置、初步数据DNA转换装置、correctionDNA序列生成装置和第二整合装置;
其中输入装置用于提供数据转换单元的二进制数序列;
其中初步数据DNA转换装置用于将数据转换单元的二进制数序列转换为不包含突变校正序列的初步数据DNA序列,所述初步数据DNA序列包含数据转换单元的数据内容信息;
其中correctionDNA序列生成装置用于通过下述方法生成correctionDNA序列:
首先根据初步数据DNA序列生成4位碱基的初步判断序列:根据下式计算i=A,T,C,G时的碱基数量判断值X(i):
X(i)=(-1)N(i)
其中i=A,T,C,G;N(i)为i碱基在初步数据DNA序列中出现的个数;
用初步判断序列的4位碱基分别储存i=A,T,C,G时的碱基数量判断值X(i),用碱基C和G分别储存-1和1,生成初步判断序列;
然后根据初步数据DNA序列生成10位碱基的深度判断序列:根据下式计算初步数据DNA序列的碱基按位加权求和值sum:
Figure PCTCN2016097398-appb-000009
其中i=A,T,C,G;val(i)为碱基i的值,val(A)、val(T)、val(C)、val(G)分别对应1、2、3、4;position(i)为碱基i的位置坐标;N为初步数据DNA序列的总长;
将碱基按位加权求和值sum的值转换为10位的三进制数序列,生成深度判断序列;
将初步判断序列与深度判断序列相连,并在连接处加入保护碱基C,获得correctionDNA序列;
其中第二整合装置用于将初步数据DNA序列与correctionDNA序列相连,并在连接处加入长度为2个碱基的保护序列,获得包含突变校正序列的数据DNA序列。
在一些优选的实施方案中,所述初步数据DNA转换装置是dataDNA序列转换装置,用于按照dataDNA序列转换规则将所述数据转换单元的二进制数序列转换为dataDNA序列,以该dataDNA序列作为不包含突变校正序列的初步数据DNA序列;
在另一些优选的实施方案中,所述初步数据DNA转换装置包括indexDNA序列生成装置、dataDNA序列转换装置和第三整合装置;其中indexDNA序列生成装置用于将所述数据转换单元在数据中的位置编号转换为固定位数的三进制数序列,并根据indexDNA序列转换规则将所述三进制数序列转换为碱基数与三进制数序列的位数相同的indexDNA序列;其中dataDNA序列转换装置用于按照dataDNA序列转换规则将所述数据转换单元的二进制数序列转换为dataDNA序列;其中第三整合装置用于将所述数据转换单元的indexDNA序列与dataDNA序列相连,并在连接处加入长度为2个碱基的保护序列,得到index+dataDNA序列,以获得的index+dataDNA序列作为不包含突变校正序列的初步数据DNA序列。优选地,第二整合装置用于将correctionDNA序列连接在初步数据DNA序列中的dataDNA序列一端,并在连接处加入长度为2个碱基的保护序列,获得包含突变校正序列的数据DNA序列。
在前述任一种数据转换系统中,还可以进一步包括加密装置,所述加密装置用户名和密码输入装置和dataDNA序列转换规则随机生成装置;其中用户名和密码输入装置用于提供用户名和密码;其中dataDNA序列转换规则随机生成装置用于根据用户名和密码随机生成dataDNA序列转换规则中每一组对应关系中特定二进制数和特定碱基之间的对应方式;其中dataDNA序列转换装置用于按照dataDNA序列转换规则将数据转换单元的二进制数序列转换为加密的dataDNA序列,其中按照dataDNA序列转换规则随机生成装置生成的对应方式将特定碱基转换为相应的特定二进制数。
根据本发明的另一方面,提供将测序获得的DNA序列还原为数据的系统,包括输入装置和dataDNA序列还原装置;其中输入装置用于提供测序获得的DNA序列,其中所述DNA序列包括表示数据转换单元的数据内容信息的dataDNA序列;其中dataDNA序列还原装置用于按照dataDNA序列转换规则将dataDNA序列还原为数据;
在一些实施方案中,所述dataDNA序列还原装置用于将dataDNA序列还原为二进制数形式的数据,或者用于将dataDNA序列还原为二进制数形式的数据以及进一步将该二进制数形式的数据还原为原始数据。
本发明还提供另一种将测序获得的DNA序列还原为数据的系统,包括输入装置、indexDNA序列还原装置和第四整合装置;其中输入装置用于提供测序获得的DNA序列,所述DNA序列的序列为多条数据DNA序列,每条数据DNA序列包括表示数据转换单元位置信息的indexDNA序列和表示数据转换单元的数据内容信息的dataDNA序列;其中indexDNA序列还原装置用于按照indexDNA序列转换规则将每条数据DNA序列中的indexDNA序列还原为三进制数序列,再将该三进制数序列还原为该转换单元在数据中的位置编号;其中dataDNA序列还原装置用于按照dataDNA序列转换规则将每条数据DNA序列中的dataDNA序列还原为数据;其中第四整合装置用于将由每条数据DNA序列的dataDNA序列还原而来的数据按照其位置编号顺序相连,获得还原后的数据。
在一些实施方案中,所述dataDNA序列还原装置用于将dataDNA序列还原为二进制数形式的数据,或者用于将dataDNA序列还原为二进制数形式的数据以及进一步将该二进制数形式的数据还原成字符串;第四整合装置用于还原获得的数据是二进制数形式的数据,或者是进一步由该二进制数形式的数据还原获得原始数据,或者是由dataDNA序列还原装置还原获得的字符串按照其位置编号顺序相连获得的字符串数据或是进一步由该字符串数据还原获得的原始数据。
本发明还提供将测序获得的DNA序列校正还原为数据的系统,包括输入装置、纠错装置和初步数据DNA序列还原装置;
其中输入装置用于提供测序获得的DNA序列,所述DNA序列包含初步数据DNA序列和突变校正序列,其中所述初步数据DNA序列包含数据转换单元的数据内容信息;所述测序获得的DNA序列中初步数据DNA序列最多具有一个碱基的突变;
其中纠错装置用于通过下述方法将初步数据DNA序列的测序序列还原为未突变的初步数据DNA序列:
(a)根据该初步数据DNA序列的测序序列,按照下式规则计算获得该初步数据DNA序列的测序序列的碱基数量判断值X'(i):
X′(i)=(-1)N(i)
其中i=A,T,C,G;N(i)为i碱基在该初步数据DNA序列的测序序列中出现的个数;
将该初步数据DNA序列的测序序列的碱基数量判断值X'(i)与由测序获得的DNA序列中包含的突变校正序列中的初步判断序列按相同规则还原获得的碱基数量判断值X(i)对比:
如果有两个碱基的碱基数量判断值发生变化,则表明该初步数据DNA序列的测序序列相对于未突变的初步数据DNA序列发生了碱基替换,且该替换是这两个碱基之一被另一个替换;
如果仅有一个碱基的碱基数量判断值发生变化,则表明该初步数据DNA序列的测序序列相对于未突变的初步数据DNA序列发生了这一个碱基的插入或删除;
如果没有碱基的碱基数量判断值发生变化,则表明该初步数据DNA序列的测序序列未发生突变;
(b)根据该初步数据DNA序列的测序序列,按照下式规则计算获得该初步数据DNA序列的测序序列的碱基按位加权求和值sum':
Figure PCTCN2016097398-appb-000010
其中i=A,T,C,G;val(i)为碱基i的值,val(A)、val(T)、val(C)、val(G)分别对应1、2、3、4;position(i)为碱基i的位置坐标;N为该初步数据DNA序列的测序序列的总长;
将该初步数据DNA序列的测序序列的碱基按位加权求和值sum'与由测序获得的DNA序列中包含的突变校正序列中的深度判断序列按相同规则还原获得的碱基按位加权求和值sum对比;
在该初步数据DNA序列的测序序列相对于未突变的初步数据DNA序列发生两个碱基的替换的情况下:如果sum'>sum,则所发生的碱基替换是val(i)值较小的碱基被替换为val(i)值较大的碱基,如果sum'<sum,则所发生的碱基替换是val(i)值较大的碱基被替换为val(i)值较小的碱基,发生碱基替换的位置坐标是sum'和sum之差除以所述两个碱基的val(i)之差所得除数的绝对值,将该位置上的碱基替换为所述两个碱基中的另一个,将测序序列校正为未突变的初步数据DNA序列;
在该初步数据DNA序列的测序序列相对于未突变的初步数据DNA序列发生一个碱基的插入或删除的情况下:
如果sum'>sum,则发生碱基插入,所述碱基插入位置按下述方法判断:从该初步数据DNA序列的测序序列中第一次出现所述碱基的位置开始,逐个删除每一个出现所述碱基的位置上的所述碱基,并在删除后按照下式规则计算获得删除后的初步数据DNA序列的碱基按位加权求和值sum″:
Figure PCTCN2016097398-appb-000011
其中i=A,T,C,G;val(i)为碱基i的值,val(A)、val(T)、val(C)、val(G)分别对应1、2、3、4;position(i)为碱基i的位置坐标;N为删除所述碱基后初步数据DNA序列的总长;
当删除某个位置上的所述碱基之后计算获得的碱基按位加权求和值sum″与由测序获得的DNA序列中包含的突变校正序列中的深度判断序列按相同规则还原获得的碱基按位加权求和值sum相等时,该位置即为所述碱基插入突变位置,将该位置上的所述碱基删除,将测序序列校正为未突变的初步数据DNA序列;
如果sum'<sum,则发生碱基删除,所述碱基删除位置按下述方法判断:从该初步数据DNA序列的测序序列的第一位开始,逐个位置上插入所述碱基,并在插入后按照下式规则计算获得插入后的初步数据DNA序列的碱基按位加权求和值sum″′:
Figure PCTCN2016097398-appb-000012
其中i=A,T,C,G;val(i)为碱基i的值,val(A)、val(T)、val(C)、val(G)分别对应1、2、3、4;position(i)为碱基i的位置坐标;N为插入所述碱基后初步数据DNA序列的总长;
当在某个位置上插入所述碱基之后计算获得的碱基按位加权求和值sum″′与由测序获得的DNA序列中包含的突变校正序列中的深度判断序列按相同规则还原获得的碱基按位加权求和值sum相等时,该位置即为所述碱基删除突变位置,在该位置上插入所述碱基即将测序序列校正为未突变的初步数据DNA序列;
其中初步数据DNA序列还原装置用于将未突变的初步数据DNA序列还原为数据。
在将测序获得的DNA序列校正还原为数据的系统的一些优选的实施方案中,所述初步数据DNA序列包含表示数据转换单元的数据内容信息的dataDNA序列,所述初步数据DNA序列还原 装置是dataDNA序列还原装置,用于按照dataDNA序列转换规则将未突变的初步数据DNA序列包含的dataDNA序列还原为数据。在进一步的实施方案中,所述dataDNA序列还原装置用于将未突变的初步数据DNA序列包含的dataDNA序列还原为二进制数形式的数据,或者用于将未突变的初步数据DNA序列包含的dataDNA序列还原为二进制数形式的数据并进一步将该二进制数形式的数据还原成原始数据。
在将测序获得的DNA序列校正还原为数据的系统的另一些优选的实施方案中,测序获得的DNA序列的序列为多条数据DNA序列,每条数据DNA序列的初步数据DNA序列包含表示数据转换单元位置信息的indexDNA序列和表示数据转换单元的数据内容信息的dataDNA序列,所述初步数据DNA序列还原装置包括indexDNA还原装置、dataDNA序列还原装置和第五整合装置;
其中indexDNA还原装置用于按照indexDNA序列转换规则将每条数据DNA序列中的indexDNA序列还原为三进制数序列,再将该三进制数序列还原为该转换单元在数据中的位置编号;
其中dataDNA序列还原装置用于按照dataDNA序列转换规则将每条数据DNA序列中的dataDNA序列还原为数据;
其中第五整合装置,用于将由每条数据DNA序列的dataDNA序列还原而来的数据按照其位置编号顺序相连,获得还原后的数据。
其中,所述dataDNA序列还原装置用于将dataDNA序列还原为二进制数形式的数据,或者用于将dataDNA序列还原为二进制数形式的数据并进一步将该二进制数形式的数据还原成字符串;所述第五整合装置用于获得的还原后的数据是二进制数形式的数据,或者是由该二进制数形式的数据进一步还原而成的原始数据,或者是由dataDNA序列还原装置还原获得的字符串按照其位置编号顺序相连获得的字符串数据或由该字符串数据进一步还原而成的数据。
前述的本发明的任一种数据还原系统还可以进一步包括解密装置,所述解密装置包括输入装置和dataDNA序列转换规则确定装置;
其中输入装置用于提供用户名和密码;
其中dataDNA序列转换规则确定装置用于根据用户名和密码得到dataDNA序列转换规则中每一组对应关系中特定二进制数和特定碱基之间的对应方式,所述对应方式是将数据转换为所述加密DNA序列时针对同一用户名和密码设定的对应方式。
在包括解密装置的系统中,dataDNA序列还原装置用于按照dataDNA序列转换规则将测序获得的加密DNA序列中的dataDNA序列转换为数据,且其中按照dataDNA序列转换规则确定装置确定的对应方式将特定碱基还原为相应的特定二进制数。
根据本发明的另一个方面,提供储存在计算机可读存储介质上的含有程序指令的可执行的软件产品,当由计算机执行时,可将数据转换为数据DNA序列,所述软件产品包括执行本发明所述的任一种数据转换方法的程序指令。
根据本发明的另一个方面,提供储存在计算机可读存储介质上的含有程序指令的可执行的软件产品,当由计算机执行时,可将测序获得的DNA序列还原为数据,所述软件产品包括执行本发明所述的任一种数据还原方法的程序指令。
根据本发明的另一个方面,提供一种计算机可读存储介质,其中储存有本发明所述的任一种软件产品。
本发明的方法和装置能够防止数据DNA序列中初始密码子的生成、防止数据DNA序列中单 碱基连续重复序列的生成,以及应对可能产生的数据DNA的突变。本发明通过分别设计dataDNA模块,indexDNA模块以及correctionDNA模块,最终整合实现数据DNA序列,并能通过DNA序列还原为原始数据;并实现了在生物体内储存数据量较大的数据。
附图说明
图1是本发明的数据转换和数据还原的一个实例的示意图。
图2是文本类型数据转换的示意图。
图3是indexDNA序列的生成过程。
图4是indexDNA序列的还原过程。
图5是dataDNA序列的生成过程。
图6是dataDNA序列的还原过程。
图7是完整数据DNA序列的生成示意图。
图8是完整数据DNA序列的还原示意图。
图9是用从细胞中提取的储存有数据的DNA片段的测序结果。
图10是清华大学校徽。
图11是用本发明的方法对清华大学校徽和校歌歌词进行转换,并打乱序列位置和引入单碱基突变后获得的数据DNA序列库。
具体实施方式
本发明中,术语“数据”是指任何形式的能够表达信息的载体。“数据”包括但不限于符号、文字、数字、语音、图像、视频等。数据可以是二进制形式、十六进制形式或字符串形式,也可以是其它任何能够直接或间接地转换为二进制形式的形式。
本发明中,术语“碱基”与“核苷酸”可互换使用,指组成DNA序列的A、T、C或G。
本发明中使用的术语“数据DNA序列”是指由数据转换而来的DNA序列,是数据形式的DNA序列。在储存过程中,按照该数据DNA序列的序列合成化合物DNA序列,并储存在细胞中。
本发明中所使用的术语“数据转换单元”与“转换单元”可以互换使用,是指数据的组成部分,将数据转换为数据DNA序列时,以数据转换单元为单位进行转换,一个数据转换单元被转换为一条数据DNA序列。当数据量较小时,全部数据由一个数据转换单元组成,它被转换为一条数据DNA序列进行储存。当数据量较大时,由于由完整数据转换而来的DNA序列很长,不便于合成和储存在细胞中,因此将数据划分为多个转换单元,每个转换单元对应的二进制数序列具有特定的长度,将每个转换单元转换为一条数据DNA序列,由此将整个数据转换为多条数据DNA序列,以便于分别合成每条DNA序列并储存在细胞中。当数据被划分为多个转换单元时,每个转换单元的数据内容信息对应的二进制数序列优选具有相同的长度。所述多条数据DNA序列组成数据DNA库。包含所述多条数据DNA序列的集合,例如用于储存所述多条数据DNA序列的细胞,也可以称为数据DNA文库。
当数据量较小时,可由全部数据构成一个数据转换单元,即将全部数据划分为一个数据转换单元。此时,例如,先将数据转换为以字节为单位的二进制数,再将所有字节依顺序前后相连成为数据的二进制数序列。一些情况下,由原始数据转换而来的二进制数中,在每一个8位的字节 中,数据信息可能仅占用7位,例如在原始数据是字符串或者可以被转换成字符串的情况下,此时可以仅用7位二进制数序列存储数据信息,将表示数据内容信息的所有7位二进制数序列依顺序前后相连成为数据转换单元的二进制数序列。
当数据量较大时,将数据划分为多个转换单元,对应每个转换单元的数据内容信息的二进制数序列具有特定的长度。所述“特定长度”可以是70-240位,优选140-175位。原始数据可以先转换为二进制数序列,再划分为多个转换单元,也可以先划分为多个字符串单元,再将每个字符串单元转换为二进制数序列。例如,原始数据可以先转换为以字节为单位的二进制数,再将特定数量的字节依顺序前后相连成为转换单元的二进制数序列。根据本领域技术人员所公知的,一个字节是一个8位二进制数序列。一些情况下,由原始数据转换而来的二进制数中,在每一个8位的字节中,数据信息可能仅占用7位,例如在原始数据是字符串或者可以被转换成字符串的情况下,此时可以仅用7位二进制数序列存储数据信息,将特定数量的7位二进制数序列依顺序前后相连成为转换单元。再例如,在原始数据是字符串或者可以被转换成字符串的情况下,可以先将原始数据划分为特定长度的字符串单元,再将该字符串中的每个字符转换为二进制数序列,再将字符串单元中的每个字符对应的二进制数序列依顺序相连而成转换单元的二进制数序列。
本发明中,indexDNA序列包含每个数据转换单元在数据中的位置信息。进行数据转换时,先将每个数据转换单元在数据中的位置编号转换成三进制数序列,再将该三进制数序列转换为indexDNA序列。转换单元在数据中的位置编号转换而成的三进制数序列的位数,或者indexDNA序列的碱基数可以是5-15个,优选为11-15个,最优选最大值15个。indexDNA序列的碱基数决定了构建的文库大小,indexDNA碱基数为15nt的情况下,一个数据DNA库最多能够包含(315-1=14,348,906)条数据DNA序列,又因为每条数据DNA序列储存着20个字符数据文本,因而每个数据DNA库最多能够储存约300MB的数据。当要转换的数据量较少或较多时,indexDNA序列的长度也可以根据需要减少或增加。减少indexDNA序列的长度可以提高转换效率,增加indexDNA序列的长度可以增加DNA序列储存的信息量。
本发明所述的“保护序列”是在indexDNA序列与dataDNA序列连接处,以及dataDNA序列与correctionDNA序列连接处加入的序列。保护序列应使得在indexDNA序列与dataDNA序列连接处,以及dataDNA序列与correctionDNA序列连接处不会形成集合S={ATG,CTG,TTG,CAT,CAG,CAA,AAA,TTT,CCC,GGG}中的序列组合。本发明中,保护序列优选为CG。
本发明中,index+dataDNA序列中indexDNA序列和dataDNA序列的连接顺序没有限制,可以是indexDNA序列在5'端,dataDNA序列在3'端,也可以是dataDNA序列在5'端,indexDNA序列在3'端。
本发明中,correctionDNA序列中初步判断序列与深度判断序列的连接顺序没有限制,可以是初步判断序列在5'端,深度判断序列在3'端,也可以是深度判断序列在5'端,初步判断序列在3'端。
本发明中,当提及一个集合中的多个成员分别与另一个集合中的多个成员相对应(例如某些数分别与某些碱基相对应,或某些变量分别与某些值相对应),或用一个集合中的多个成员分别存储另一个集合中的多个成员(例如用某些碱基分别存储某些数)时,如无特别说明,一个集合中的每一个成员所对应的另一个集合中的具体成员没有限制,一个集合中的每一个成员都可以与另一个集合中的任何一个成员相对应。但本领域技术人员应当理解,在连续进行、相互比较、或者有呼应 关系的步骤中,如果都需要应用某个集合和其相应集合的对应关系时,该集合中的特定成员与其相应集合中的特定成员之间的对应方式应当保持一致。
具体来说,例如,在本发明中,在indexDNA序列转换规则和dataDNA转换规则中,在每组三进制数或二进制数与碱基的对应关系中,不同的碱基分别对应不同的三进制数或二进制数,以实现储存数据信息的目的。每组三进制数或二进制数中的每一个数对应的具体碱基的没有限制,每组三进制数或二进制数中的每一个数都可以与对应组的碱基中的任何一个相对应。例如,当一组三进制数0、1、2与一组碱基A、T、C相互对应时,可以是0=A、1=T、2=C,也可以是0=T、1=C、2=A,还可以是0=T、1=A、2=C,或者还可以以其它方式相对应。但是,当对同一套数据中的不同转换单元中应用转换规则时,在相同条件下,具体三进制数或二进制数与具体碱基的对应方式应当相同。所述“相同条件”是指按转换规则表(包括indexDNA序列转换规则表、dataDNA转换规则表)中条件的分组属于同一组。转换规则表中每一行为一组。
再例如,将测序获得的数据DNA序列还原为原始数据时,其中所涉及的某些数与某些碱基的对应方式,以及某些变量与某些值的对应方式,应当与生成该数据DNA序列时所使用的所述数与所述碱基的对应方式,以及所述变量与所述值的对应方式相同。
再例如,在将测序获得的数据DNA序列还原为原始数据的方法中,在通过比较不同序列的碱基按位加权求和值,以确定发生何种突变时,所比较的碱基按位加权求和值的计算公式中val(i)的取值方式应当相同。
本领域技术人员将会理解,在本发明中,将测序获得的数据DNA序列还原为原始数据时,将indexDNA序列转换为三进制数序列所依据的indexDNA序列转换规则与生成该indexDNA序列时使用的indexDNA序列转换规则相同,将原始dataDNA序列转换为二进制数序列所依据的dataDNA序列转换规则与生成该原始dataDNA序列时使用的dataDNA序列转换规则相同。这里所说的“indexDNA序列转换规则相同”或“dataDNA序列转换规则相同”是指在这些转换规则中每一组对应关系中特定二进制数和特定碱基之间的对应方式相同。
本发明所述的“每一组对应关系中特定二进制数和特定碱基之间的对应方式”是指每一个特定的二进制数对应哪一个特定碱基的对应方式。
在本发明方法的加密和解密过程中,针对不同的用户名设定不同的dataDNA序列转换规则中每一组对应关系中特定二进制数和特定碱基之间的对应方式(在本段落中简称对应方式)。在加密的数据转换方法中,根据输入的用户名随机生成一种对应方式,在数据还原方法的解密过程中,根据输入的用户名获取之前对于该用户名生成的那种对应方式,随后按照该对应方式进行进行还原。
在本发明中,当对测序序列进行突变检验和校正时,需要计算测序序列的碱基数量判断值和碱基按位加权求和值,并与测序序列中包含的correctionDNA序列中所含有的碱基数量判断值和碱基按位加权求和值信息进行比较,其中测序序列中包含的correctionDNA序列中所含有的碱基数量判断值和碱基按位加权求和值信息代表着未突变序列的相应值,通过比较可知测序序列相对于未突变序列是否发生了突变。本领域技术人员应当理解,再进行比较时,计算测序序列的碱基数量判断值和碱基按位加权求和值所使用的计算公式和对应方式应当与获得测序序列中包含的correctionDNA序列中的碱基数量判断值和碱基按位加权求和值所使用的计算公式和对应方式相同。这里所说的“对应方式”是指:(1)对于碱基数量判断值,C/G和-1/1的具体对应方式;和/或(2) 对于碱基按位加权求和值,val(A)、val(T)、val(C)、val(G)和1、2、3、4的具体对应方式。
本发明中,“位置编号”优选为十进制数,但也可以是任何能够表明位置顺序并且能与三进制数互相转换的任何数。
本发明中,“一个碱基的突变”是指一个碱基被替换为另一个碱基、或者一个碱基的插入或删除。
本发明中,术语“数据转换方法”是指任何一种将数据转换为数据DNA序列的方法、将数据转换为包含突变校正序列的数据DNA序列的方法、将数据转换为加密的数据DNA序列的方法或加密的数据转换方法。术语“数据还原方法”是指任何一种将测序获得的DNA序列还原为数据的方法或将测序获得的加密DNA序列还原为数据的方法。
利用本发明的数据转换方法获得的DNA序列适合于储存在细胞中。本发明中用于储存该DNA序列的细胞可以是微生物细胞,例如细菌例如大肠杆菌细胞或真菌细胞例如酵母细胞,也可以是任何适合的其它细胞或细胞系,例如昆虫细胞或哺乳动物细胞或细胞系。利用本发明的数据转换方法获得的DNA序列可以以质粒形式储存在细胞中,或者是将DNA序列整合到细胞基因组中。
利用本发明的数据转换方法获得的DNA序列可以通过任何适当的方式被导入到细胞中进行储存,例如将DNA序列克隆到真核表达载体上,然后直接转化进酵母细胞中进行传代储存,或是将DNA序列直接整合进酵母基因组中进行储存。储存在细胞中的DNA序列可以通过任何适当的方式被提取出来,例如直接提取酵母中的质粒后转化进大肠杆菌中进行扩增,再次提取质粒进行测序,或是直接提取酵母基因组后进行PCR扩增,拿目的片段进行测序。
作为将利用本发明的数据转换方法获得的DNA序列以质粒的形式储存在细胞中的操作实例,可以按以下步骤进行:根据由数据转换而成的数据DNA序列库合成多条单链DNA序列,所合成的每条单链DNA序列两端均带有与质粒对应的酶切位点,然后将每条单链DNA序列和质粒酶切并进行连接,每个质粒中插入一条单链DNA序列,将连接好的质粒转入大肠杆菌进行扩增,提取扩增的质粒并通过酶切检测,将检测无误的质粒转化进酵母细胞中。酵母细胞随后被传代培养。其中可以将包含每一条单链DNA序列的质粒混合后一起转化进酵母细胞中。
作为将利用本发明的数据转换方法获得的DNA序列整合到细胞基因组中的操作实例,可以按以下步骤进行:根据由数据转换而成的数据DNA序列库合成多条单链DNA序列,所合成的每条单链DNA序列两端均带有与质粒对应的酶切位点,然后将每条单链DNA序列和质粒酶切并进行连接,每个质粒中插入一条单链DNA序列,将连接好的质粒转入大肠杆菌进行扩增,提取扩增的质粒并通过酶切检测,将检测无误的质粒进行酶切,获得目的片段(即单链DNA序列)后两端连接上同源序列,使两端连接有同源序列的目的片段与酵母细胞进行同源重组,使目的片段整合到酵母细胞基因组中。酵母细胞随后被传代培养。其中可以将包含每一条单链DNA序列的片段混合后一起与酵母细胞进行同源重组。
本领域技术人员知道,上述步骤仅仅是举例说明,可以通过其它方法将DNA序列导入到细胞中。用于储存DNA序列的细胞也不限于酵母细胞。适合的将DNA序列导入到细胞中的方法和适合的用于储存DNA序列的细胞是本领域技术人员公知的。
本发明中所述的“一个或更多个”意指一个、两个或多于两个。本发明中所述的“一条或更多条”意指一条、两条或多于两条。
应当理解,以下描述仅仅是举例说明,并非是限制本发明的范围,本发明的保护范围由权利 要求确定。在不背离本发明的范围和精神的条件下,本发明还可以以其它方式实现。本领域技术人员可以对下述实例进行各种修改和改进,例如改变本发明中使用的具体参数,而不背离本发明的范围和精神。
图1是本发明的数据转换和数据还原的一个实例的示意图,通过所设计的转换算法将数据转换为一系列单链DNA短序列(即数据DNA序列)的集合,该单链DNA短序列的集合可以通过还原算法被还原为原始数据。每一条单链DNA短序列主要由三个部分组成:indexDNA,包含该条DNA序列在整个DNA序列集合中的位置信息,即数据内容在整个数据中的位置信息;dataDNA,包含数据内容信息;correctionDNA,用于校验DNA序列中的突变。在indexDNA序列和dataDNA序列之间以及dataDNA序列和correctionDNA序列之间分别具有长度为2个碱基的保护序列CG。
实施例一文本数据的转换和还原
下面以文本类型的数据为例,说明本发明的数据转换过程和还原过程。
不同类型的数据已经过预先处理,数据格式转换为一个由ASCII表中的字符“写成”的文本文件。因此,转换器面对的将是一个字符串文本,也可理解为一条很长的字符串序列。以数据文本的字符串单元为单位将数据文本转换为数据DNA序列。如图2所示,每20个字符组成一个字符串,为一个转换单元,被编码成为一条数据DNA序列单链。由数据文本的第一个转换单元(#1)开始,依次编码每一个转换单元(#2、#3等),生成多条数据DNA序列单链。
indexDNA序列的生成和还原
(1)indexDNA序列的生成算法
indexDNA序列储存着的信息是一个十进制数,它指示该条数据DNA单链对应着数据文本的第几个字符串单元。indexDNA序列的长度设定为15nt,一个数据DNA库最多能够包含(315-1=14,348,906)条数据DNA序列,由于每条数据DNA序列储存着20个字符数据文本,因而每个数据DNA库最多能够储存约300MB的数据。
indexDNA序列的生成过程如图3所示。当编码进行到数据文本的第N个转换单元时,indexDNA生成模块接受十进制的序列数N为编码的起始数据(如图3中a过程所示);随后通过一个将十进制数转换为三进制数的算法,十进制数N被转换为三进制数(如图3中b过程所示,十进制向三进制数转换算法的核心是N除以三取余数,所得的商继续取余数,如此循环,直到商小于3);得到三进制数后,将其转变为十五位的三进制数序列,其初始状态被设定为“000000000000000”,不足的位数保持用“0”填补的状态(如图3中c过程所示);之后,所得的十五位的三进制数序列由一套转换算法编码为长度为15nt的indexDNA序列,与此同时,十五位的三进制数序列回到初始状态等待下一次循环(如图3中d过程所示);最后,indexDNA序列被输出,与和它对应的dataDNA序列整合后进入下一个运算,而该indexDNA生成模块中将迎来下一个字符串单元的编码,N=N+1后继续上述的流程(如图3中e/f过程所示)。
其中,图3中的d过程,即十五位的三进制数序列编码为十五位的indexDNA序列的过程,是实现该部分功能的关键,其算法设计如表1所示。
indexDNA序列中需避免出现起始密码子序列以及尽量避免出现单碱基连续重复序列,也就是说,需要防止出现集合S={ATG,CTG,TTG,CAT,CAG,CAA,AAA,TTT,CCC,GGG}中的序列组合。为了实现这一目的,在编码indexDNA序列的第i位时,先根据已经编码出的第i-2和第i-1位上的碱基类型进行判断,再决定该位置编码何种碱基。也就是说,第i位碱基的编码同 时受其前两位碱基序列的信息和需要储存在该位点的三进制数的类型约束。
表1 indexDNA序列转换算法
Figure PCTCN2016097398-appb-000013
对于每一位置i,将indexDNA序列中的前两位碱基用d=[i-2,i-1]表示,当d∈集合D={AT,CT,TT,CA,AA,CC,GG}时,位置i的碱基类型受到d的约束,而当d
Figure PCTCN2016097398-appb-000014
集合D时,位置i的碱基类型不受d的约束。以d=[A,T]为例,该情形对应于算法表中列号为0的一列,因为ATG是起始密码子序列,不能出现在indexDNA序列中,因此,该位点不能够被编码为G,候选碱基集合Sd的元素个数变为3个,分别是A、T、C,此种情形下的转换算法设计为2=A,1=T,0=C。当d=[T,T]时,该情形对应于算法表中列号为2的一列,此时备选碱基集合Sd的元素个数减为2个,但需要在该位点储存的信息种类有三种,在要求indexDNA序列位数不变的约束条件下,必须在这种情形中补回一种碱基,若补回T则可能引入一种单碱基重复重复序列,若补回G则可能引入一种起始密码子序列。在权衡两种方式可能造成的结果后,因为需要优先避免起始密码子序列的生成,所以选择补回碱基T作为备选碱基集合中的第三个元素。最终,在该情形下的转换算法设计为0=C,1=A,2=T。另一个特殊情况是当d=[C,A]时,该情形对应于算法表中列号为3的一列,备选碱基集合Sd中只剩一个碱基C,而在此情况下,补回任何一个碱基都会引入起始密码子,同时又受到indexDNA序列长度恒定这一条件的约束,无法在这一情形下进行信息的储存,于是,又额外添加了一种设计,使得-CA-序列不会在indexDNA序列中出现。该设计对应于算法表中列号为6的一列。当d的第二位元素是碱基C时,将转换算法设计为0=G,1=T,2=C,规避了CA序列的生成。同时在
Figure PCTCN2016097398-appb-000015
时,该情形对应于算法表中列号为7的一列,分别用G、A、T来储存0、1、2,从而减少碱基C出现的频率。对首两位碱基按照图示中列号为7的一列的转换算法编码,即G=0,A=1,T=2。
以上述算法为基础,15位的三进制数序列将从第一位开始逐位编码为15位的indexDNA序列, 两种序列的每一位的信息一一对应,最终将生成需要的indexDNA。
(2)indexDNA序列的还原算法
indexDNA序列的还原,即indexDNA序列的解码是上述编码过程的逆过程,如图4所示。
该模块始于程序内部得到了一段数据DNA序列,首先从整段序列中提取出首端长度为15nt的indexDNA序列(如图4中a过程所示);再通过indexDNA序列与三进制数序列之间的转换算解码为十五位的三进制数序列(如图4中b过程所示);之后,将该三进制数序列退化为三进制的序列数(如图4中c过程所示);三进制数再进一步被解码为十进制的序列数N(如图4中d过程所示),三进制数还原为十进制数算法核心是N=∑(Xi*3i),其中X表示第i位的三进制数,i表示该位点是第几位,i从0取起。最后,输出十进制序列数N,将该数据DNA序列中同步解码的dataDNA序列所得的字符串数据保存在数据数组的第N位,程序迎来新的一段数据DNA序列进入下一个循环(如图4中e/f过程所示)。
同样,上述流程中的核心部分是十五位的indexDNA序列向十五位的三进制数序列解码的过程,其算法设计如表1所示。与indexDNA序列编码过程类似,进行解码时,对首两位碱基按照图示中列号为7的一列的转换算法解码,即G=0,A=1,T=2;往后,将indexDNA序列中第i个位点的碱基转换为三进制数序列中第i个位点的三进制数时,受到碱基序列d=[i-2,i-1]的约束。不同的d序列将会决定在i位置采取不同的转换算法。所以同样的,当解码第i位的碱基时,先考察d=[i-2,i-1],当d
Figure PCTCN2016097398-appb-000016
集合D={AT,CT,TT,CA,AA,GG,CC,GC,TC,AC}时,解码算法按照图示中列号为7的一列进行,即G=0,A=1,T=2;而当d∈D时,则根据d的具体序列,采用对应列中的转换算法进行解码。
dataDNA序列的生成和还原
(1)dataDNA序列的生成算法
dataDNA序列的生成以字符串序列中每20个字符为一个转换单元,每个dataDNA序列中储存20个字符的信息。dataDNA序列的生成过程如图5所示。
dataDNA序列的编码在算法内部输入一个包含20个字符的字符串序列时启动,首先逐次将每个字符转换为该字符在ASCII码表上对应的十进制数字(如图5中a过程所示);随后将每个得到的每个十进制数顺次转换为对应的二进制格式,此处的转换算法可以调用操作系统内部函数,生成的二进制数会以“0b”打头(如图5中b过程所示);之后,再将每一个二进制数依次转换为7位的二进制数序列,此过程的算法为将二进制数中字头“0b”以后的数字依次填入初始值设定为“0000000”的7位二进制数序列中,再将所有20个十进制数得到的7位二进制数序列依顺序前后相连为一条140位的二进制数序列(如图5中c过程所示);再根据二进制数序列与dataDNA序列之间的转换算法将其转换为dataDNA序列(如图5中d过程所示);最后,输出该dataDNA序列进行下一步操作,该模块中各变量回归初始值,等待下一个字符串转换单元的输入。
上述过程中最核心的部分是140位的二进制数序列向dataDNA序列转换的部分(如图5中d过程所示),其算法设计如表2所示。
表2 dataDNA序列转换算法
Figure PCTCN2016097398-appb-000017
Figure PCTCN2016097398-appb-000018
dataDNA序列的转换遵循上述“类四进制”算法,除个别情况外,dataDNA序列的每个位点将储存两位二进制数序列的信息。与indexDNA序列的生成类似,dataDNA序列的编码过程中也要防止起始密码子序列和单碱基连续重复序列的出现,因此,要避免集合S={ATG,CTG,TTG,CAT,CAG,CAA,AAA,TTT,CCC,GGG}中的序列,因而集合D={AT,CT,TT,CA,AA,GG,CC}中出现的序列将成为下一个位点的约束条件。dataDNA序列的前两个碱基按照X2\B情形下的算法编码,此时备用碱基集合Sd的元素个数为4,不受任何条件约束,即按照00=A,01=T,10=C,11=G的法则,将4位的二进制数序列保存在两位的dataDNA序列中。随后的序列中,在编码第i的位点的碱基时,先考察d=[i-2,i-1]的序列值,若
Figure PCTCN2016097398-appb-000019
则仍按照X2\B情形下的算法编码;若d∈D,则i位点的编码将受到d的约束:若d=AT或CT或GG,则备用碱基集合的元素数为3,备用碱基为A,T,C,只能保存三种信息,所以转换规则由完全的四进制退化为“类四进制”,按0=A,10=T,11=C的法则编码;若d=AA,分析过程同上,转换法则变为0=T,10=C,11=G;若d=CC,分析过程同上,转换法则变为0=A,10=T,11=G;若d=TT,则备用碱基集合的元素数为2,备用碱基为A,C,只能保存两种信息,所以转换规则由完全的四进制退化为二进制,按0=A,1=C的法则编码;若d=CA,则备用碱基集合的元素数为1,备用碱基只有C,无法保存一个二进制位点的信息,因而这种情况下碱基C不储存任何信息,单纯作为占位碱基编码在位点i。
在上述转换算法的基础上,为了提高数据储存的安全性,又加入了一定的加密功能。在加密版的算法中,转换规则的设计仍然如表2所示,只是备用碱基集合Sd中的碱基不再是固定排列,而是让它在每列中随机排列,这样一来,转换规则由1种扩充到了6*6*4*1*6*6*6*24=373,284种,用户在对数据进行生物储存时通过用户名和密码申请一种随机生成的转换规则,数据还原时通过提供用户名和密码获得正确的规则才能实现。
由于上述算法是二进制与四进制转换的混合型,因此,在编码二进制数序列的末端两个位点时很可能出现无法编码的问题(例如二进制数序列只剩一个位点1,而对应转换算法中无此情形)。因此,对于末端的最后一次转换,改用表3所示算法。算法表中的两位碱基序列,无论前后接上何种碱基都不会形成起始密码子序列。至此,字符串文本中的20个字符已经编码为dataDNA序列储存在其中,该序列将进入程序的下一个模块继续被加工,该模块则迎来新的转换文本。
表3 二进制数序列末端的转换算法
Figure PCTCN2016097398-appb-000020
(2)dataDNA序列的还原算法
dataDNA序列的解码是上述过程的逆过程,程序设计流程如图6所示,该模块始于向程序内部输入一段数据DNA序列,该模块会抓取其中的dataDNA序列——数据DNA序列中的[17:-17]的部分(如图6中a过程所示);随后dataDNA序列通过dataDNA序列与二进制数序列之间的转换算法(表2)解码为140位的二进制数序列(如图6中b过程所示);这140位的二进制数序列其实是20个7位的二进制数序列的连接,现在将它们彼此分离,并依序还原出每个序列中储存着的二进制数(如图6中c过程所示);依次为每个二进制数加上二进制数标识符“0b”,并调用系统内部函数将其解码为十进制数(如图6中d过程所示);依次通过系统内部函数写出该十进制数在ASCII表中对应的字符(如图6中e过程所示);最后,由20个字符依序组成一个20字节的字符串,将该字符串从本模块输出,本模块的所有变量回归初始状态(如图6中f/g过程所示)。
由dataDNA序列解码为140位的二进制数序列的部分是该模块的核心,其算法设计如表2所示。解码过程仍然受到序列集合D={AT,CT,TT,CA,AA,GG,CC}中元素的约束。对dataDNA序列的前两个碱基,按照表中X2\B一列的规则进行解码,即A=00,T=01,C=10,G=11;往后,对dataDNA序列的第i位碱基解码时,先考察d=[i-2,i-1]的序列,若
Figure PCTCN2016097398-appb-000021
则转换算法如上;若d∈D,则解码过程受序列d的约束,按照图表中不同序列d所在竖列的转换规则解码即可,对d=CA的情况进行特别说明,此时第i位的碱基C只起到占位的作用,不储存任何信息,因此不还原任何内容;上述过程一直进行到dataDNA序列的最后两个碱基时停止,末端两位碱基按照表4解码。
表4 dataDNA序列末端两个碱基的转换算法
Figure PCTCN2016097398-appb-000022
correctionDNA序列的生成和还原
(1)correctionDNA序列的生成算法
提高数据储存的保真度,避免数据在储存过程中发生丢失或失真是实现数据的生物储存的重要前提。由于转换算法的设计使得dataDNA序列相邻碱基的依赖程度很高,一旦某一个位点碱基发生突变,便可能影响整段dataDNA序列的解码。因而,设计了一套算法,生成correctionDNA序列,根据correctionDNA序列能够评估数据DNA序列是否发生突变,能够帮助还原一个位点上发生的单碱基突变。
correctionDNA主要由两部分组成,分别是长度为4nt的初步判断序列和长度为10nt的深度判断序列。初步判断序列的功能是能够判断出序列中单碱基突变的的类型(碱基替换或碱基删除或碱基插入),以及判断突变的单碱基的种类(哪两种碱基之间发生了替换或何种碱基发生了插入或丢失);深度判断序列的功能则是在初步判断序列所得结果的基础上判断发生突变的位点和具体突变。改正突变后可将其还原为原始序列。
初步判断序列的生成算法依托数学函数:
X(i)=(-1)N(i)
其中i=A,T,C,G;N(i)为i碱基在indexDNA序列和dataDNA序列中出现的个数。
correctionDNA序列的一端4位碱基依次保存i=A,T,C,G时的X(i)值,由于X(i)的取值只有1或-1,因此,用碱基C储存值-1,用碱基G储存值1。由此初步判断序列已经形成,它是位于correctionDNA序列末端4位的只由G、C组成的序列。
以序列-ATGCTTCGACGTCGAG-为例,对初步判断序列的生成进行演示。首先,分别计算:
X(A)=(-1)N(A)=(-1)3=-1;
X(T)=(-1)N(T)=(-1)4=1;
X(C)=(-1)N(C)=(-1)4=1;
X(G)=(-1)N(G)=(-1)5=-1;
初步判断序列为CGGC;
深度判断函数的生成算法依托数学函数:
Figure PCTCN2016097398-appb-000023
其中i=A,T,C,G;val(i)为碱基i的值,如表5所示;position(i)为碱基i的位置坐标;N为indexDNA序列和dataDNA序列的总长。
表5 纠错机制中各碱基的赋值表
Figure PCTCN2016097398-appb-000024
每段数据DNA序列都会生成一个十进制数的求和结果sum,将该十进制数转换为三进制数,并传递到10位的三进制数序列中,再根据indexDNA序列转换算法(三进制数序列与DNA序列之间的转换算法,表1)将其转换为10nt的深度判断序列。为了防止两个部分的连接处出现起始密码子序列,在两部分序列之间添加一个保护碱基C。最终生成15nt的correction序列,它将连接到数据DNA序列的末端,最终生成一条包含indexDNA、dataDNA、correctionDNA三个部分的完整的数据DNA序列。
以序列-ATGCTTCGACGTCGAG-为例,对深度判断序列的生成进行演示。首先,计算:
Figure PCTCN2016097398-appb-000025
再将其转换为10位的三进制数序列:0000112021;再按照indexDNA生成模块中三进制数序列与DNA序列之间的转换算法将其转换为十位的深度判断序列:GGCGAATCCT。
在初步判断序列和深度判断序列之间加上两个部分连接处的保护碱基C,correctionDNA序列为CGGCcGGCGAATCCT。
(2)correctionDNA序列的还原算法
该模块始于向程序内部输入一段数据DNA序列,该模块会先抓取处于数据DNA序列末端的correctionDNA序列,先将初步判断序列还原为由1和-1组成的判断序列,该序列也是四位,分别储存着对原始数据DNA序列中各碱基数量判断值;同时将10nt的深度判断序列还原为十进制数(该过程算法完全类似于indexDNA序列的还原,不再赘述),该十进制数代表着原始数据DNA序列的碱基按位加权求和值。
另一方面,对该模块接收的数据DNA的indexDNA和dataDNA部分,使用初步判断函数和深度判断函数进行运算,将得到现有数据DNA序列的碱基数量判断值和碱基按位加权求和值;将现 有数据DNA序列的运算结果与由correctionDNA序列还原的原始数据DNA的运算结果对比,
即得到是否发生突变,何种碱基发生哪种类型的突变,突变发生在哪个位点的全部信息;进而对发生突变的碱基进行还原,便得到了与原始数据DNA序列相同的序列,能够进行准确的数据还原。
下面以-ATGCTTCGACGTCGAG-的储存为例,分别向其中引入删除、插入、替换三种突变形式,来进一步说明纠错机制的运行。我们已经生成correctionDNA序列并连接在上述序列末端,故储存的序列为-ATGCTTCGACGTCGAGgc CGGCcGGCGAATCCT。
1)碱基替换:-ATCCTTCGACGTCGAGgcCGGCcGGCGAATCCT(序列第三位在储存过程中由G突变为了C)。
测序得到突变后的序列ATCCTTCGACGTCGAGgcCGGCcGGCGAATCCT,首先对correctionDNA序列进行还原,得到:
X(A)=-1;X(T)=1;X(C)=1;X(G)=-1;∑=385
再对dataDNA部分进行初步与深度判断,得:
X’(A)=-1;X’(T)=1;X’(C)=-1;X’(G)=1;∑’=382
由于X(C)和X(G)的值均发生了变化,由初步判断可得C、G碱基之间发生了碱基替换。
再由公式:
Figure PCTCN2016097398-appb-000026
得到突变位点为|382-385|/(4-3)=3。又通过∑’<∑,可得是由G突变为C。所以最终,确定是dataDNA序列中第三位碱基由G突变为了C,将此位点还原便得到了原始序列。
2)碱基插入:-ATGCTATCGACGTCGAGgcCGGCcGGCGAATCCT(序列第五位碱基后加入了一个A)
测序得到突变后的序列-ATGCTATCGACGTCGAGgcCGGCcGGCGAATCCT,首先对correctionDNA序列进行还原,得到:
X(A)=-1;X(T)=1;X(C)=1;X(G)=-1;∑=385
再对dataDNA部分进行初步与深度判断,得:
X’(A)=1;X’(T)=1;X’(C)=1;X’(G)=-1;∑’=422
由于只有X(A)的值发生变化,根据初步判断可推断发生了碱基A的插入或删除。再根据深度判断结果∑’>∑,进一步判断是发生了碱基A的插入。从突变后序列的第一个碱基A开始,分别删除每一个位置上的碱基A后计算∑’,当删除某个位置上的A之后的求和结果与385相等时即找到了插入的位点,将它去除便得到了原始序列。
3)碱基删除:-ATGCTT-GACGTCGAGgcCGGCcGGCGAATCCT(序列第六、七位碱基之间丢失了一个碱基C)
测序得到突变后的序列-ATGCTT-GACGTCGAGgcCGGCcGGCGAATCCT,首先对correctionDNA序列进行还原,得到:
X(A)=-1;X(T)=1;X(C)=1;X(G)=-1;∑=385
再对dataDNA部分进行初步与深度判断,得:
X’(A)=-1;X’(T)=1;X’(C)=-1;X’(G)=-1;∑’=338
由于只有X(C)的值发生变化,根据初步判断由初步判断函数可推断发生了碱基C的插入或删除。再根据深度判断结果∑’<∑,进一步判断是发生了碱基C的删除。因此,从突变后序列的第一位开始依次分别在每一位之后加入一个碱基C计算∑’,当加入某个C之后的求和结果与385相等时即找到了删除的位点,将C添加在该位点即得到了原始序列。
完整数据DNA序列的生成及还原
(1)完整数据DNA序列的生成
在进入转换程序前对不同类型的数据进行预处理,将图像还是文本亦或是音频数据先转换为“字符串文本”格式,将该文件中的内容以txt文本格式保存,这个txt文本就是生物转换器操作的对象。完整数据DNA序列的生成示意图如图7所示。
数据文本向数据DNA序列的转换以20字符为一个转换单位进行,首先进入indexDNA生成模块生成标识序列号信息的indexDNA序列;与此同时进行的是字符串序列进入dataDNA生成模块生成储存该单元的字符串信息的dataDNA序列;然后将indexDNA序列与dataDNA序列前后连接形成index+dataDNA序列,该序列进入correctionDNA生成模块,生成correctionDNA序列;最后将indexDNA、dataDNA、correctionDNA三条序列首尾相连,共同组成一条完整的数据DNA序列。而后,整个程序接受下一个20字节的字符串转换单元,如此循环直至全部txt文本都转换为数据DNA序列,便得到了储存着原始数据全部信息的数据DNA序列库。
在将三段模块序列连接为一条数据DNA序列时,为避免前一个模块的末端碱基与后一个模块的首端碱基形成起始密码子序列,在两个连接位点分别加入了2nt的保护序列。在考察全部起始密码子序列集合中元素的特点后,发现CG序列无论前后加入何种碱基都不会生成起始密码子。因此,该序列即被选为保护序列。一条完整的数据DNA序列最终生成,它包括15nt的indexDNA片段、15nt的correctionDNA片段、100nt左右的dataDNA片段以及两个2nt的保护序列。
(2)完整数据DNA序列的还原
完整数据DNA序列的还原示意图如图8所示。储存在数据DNA细胞库中的数据DNA序列库经测序后保存成为txt文本格式,文本的每一行即代表一段数据DNA序列,此时的数据DNA序列是以乱序的方式排列的。数据还原时,转换软件从txt文本的第一行开始抓取,该段完整的数据DNA序列首先经过correction模块,通过纠错机制评估和还原其中的indexDNA序列和dataDNA序列。得到纠错后的数据DNA序列后,程序抓取其中的indexDNA序列和dataDNA序列分别进入index模块和data模块进行还原,前者还原该段数据DNA对应的序列号,后者还原出该段数据DNA中储存的数据信息,即20字节的字符串;随后,将这条字符串储存在数据生成文中该序列号对应的位置,转换器抓取txt文本中的下一行序列,如此循环。最终,将得到由ASCII表中的字符串构成的文本数据,再对此进行后期数据格式转换,得到还原后的最终数据。
实施例2算法测试与结果
以上述算法和设计为核心,编写了一个简易的生物转换器,并对此转换器的性能进行了测试,(1)小规模文本数据的储存
第一代的转换器没有index和correction的模块,因而也只能转换一些很短的文本。在应对一些短文本时,由于没有indexDNA序列和correctionDNA序列部分,使得数据DNA序列长度缩短,效率提高,对应用层面来讲是成本的降低。另一方面,就短期来看,当前应用于短文本生物储存 的情形将更加常见。以“Dai Lab,Tsinghua University,Synthetic Yeast,Synthetic Biology”为测试文本,将其转换为如表6所示dataDNA序列:
表6 小规模文本数据的储存测试结果
Figure PCTCN2016097398-appb-000027
将上述dataDNA序列转化到酵母中,并用以质粒形式储存和整合到基因组中储存两种方式进行测试,并传代培养。经历100代之后,提取这些片段并进行测序,测序得到的dataDNA序列基本与初始状态一样,唯有整合到基因组上的一组中,某一份拷贝出现了单碱基丢失,如图9所示。这也验证了后期加入纠错机制的必要性。
(2)加密机制的测试
在第二代转换器中引入了加密机制,并用“Hello,World!”这段文本进行测试,如表7所示,在不同的用户名和密码下,同样的文本将会生成不同的dataDNA序列,而在对dataDNA数据进行还原时,也必须同时提供正确的用户名和密码才能实现解码,使得用户的数据获得了更高的安全性和保密性。
表7 加密机制的测试文本及测试结果
Figure PCTCN2016097398-appb-000028
(2)较大规模数据(KB级)的转换测试
第三代生物转换软件主要面向较大规模的数据储存任务。在第三代程序中,加入了index模块和correction模块。为了对其性能进行测试,用大小为24kB的清华大学校徽(如图10所示)和清华大学校歌歌词作为测试对象进行了转换。
用第三代生物转换器将图像和歌词转换为了一个包含1084条数据DNA序列的数据DNA库后,又人工打乱了库中各序列的位置,同时随机在部分数据DNA序列中引入了单碱基突变,希望由此 模拟真正的生物储存过程,如图11所示。通过对上述数据DNA序列库进行了还原,最终能够还原得到原始的图像数据和文本数据。

Claims (42)

  1. 将数据转换为数据DNA序列的方法,包括将数据划分为一个或更多个数据转换单元,并提供每个数据转换单元的二进制数序列,按照下述步骤将每个数据单元转换为一条数据DNA序列,由此获得数据DNA序列库;所述数据DNA序列库包含一条或更多条数据DNA序列,每条数据DNA序列由一个数据转换单元转换而来;
    所述步骤包括:
    按照dataDNA序列转换规则将每个数据转换单元的二进制数序列转换为一条dataDNA序列,即为一条数据DNA序列;
    所述dataDNA序列转换规则是:
    (a)对于dataDNA序列中的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于dataDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100001
    集合{AT,CT,TT,CA,AA,GG,CC}相应的对应关系进行二进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100002
    *其中当d=[C,A]时,位置i上为碱基C,该碱基C不对应任何二进制数
    (c)从dataDNA序列的第三位起,依次按上表所示规则进行转换,首先判断第i位满足上表中的哪一条件,然后按照与该条件相应的对应关系进行第i位上二进制数与碱基的对应转换;
    (d)当二进制数序列剩余1位或2位时,使用下表所示规则进行二进制数与碱基的对应转换
    Figure PCTCN2016097398-appb-100003
  2. 将数据转换为数据DNA序列的方法,所述方法包括将数据划分为一个或更多个数据转换单元,并提供每个数据转换单元的二进制数序列,按照下述步骤将每个数据转换单元转换为一条数据DNA序列,由此获得数据DNA序列库;所述数据DNA序列库包含一条或更多条数据DNA序列,每条数据DNA序列由一个数据转换单元转换而来;所述步骤包括:
    (1)将数据转换单元在数据中的位置编号转换为固定位数的三进制数序列,根据indexDNA序列转换规则将所述三进制数序列转换为碱基数与三进制数序列的位数相同的indexDNA序列;
    所述indexDNA序列转换规则是:
    (a)对于indexDNA序列的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于indexDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100004
    集合{AT,CT,TT,CA,AA,CC,GG}相应的对应关系进行三进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100005
    (c)从indexDNA序列的第三位起,依次按照上表所示规则进行转换,首先判断第i位满足上表中的哪一组条件,然后按照与该条件相应的对应关系进行第i位上三进制数与碱基的对应转换;
    (2)按照dataDNA序列转换规则将数据转换单元的二进制数序列转换为dataDNA序列;
    所述dataDNA序列转换规则是:
    (a)对于dataDNA序列中的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于dataDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100006
    集合{AT,CT,TT,CA,AA,GG,CC}相应的对应关系进行二进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100007
    *其中当d=[C,A]时,位置i上为碱基C,该碱基C不对应任何二进制数
    (c)从dataDNA序列的第三位起,依次按上表所示规则进行转换,首先判断第i位满足上表中的哪一条件,然后按照与该条件相应的对应关系进行第i位上二进制数与碱基的对应转换;
    (d)当二进制数序列剩余1位或2位时,使用下表所示规则进行二进制数与碱基的对应转换;
    碱基 AC TC CG GA GT GC 二进制数序列 0 1 00 01 10 11
    (3)将该数据转换单元的indexDNA序列与dataDNA序列相连,连接处加入长度为2个碱基的保护序列,得到index+dataDNA序列,即为一条数据DNA序列。
  3. 将数据转换为包含突变校正序列的数据DNA序列的方法,所述方法包括将数据划分为一个 或更多个数据转换单元,并提供每个数据转换单元的二进制数序列,按照下述步骤将每个数据转换单元转换为一条包含突变校正序列的数据DNA序列,由此获得数据DNA序列库;所述数据DNA序列库包含一条或更多条数据DNA序列,每条数据DNA序列由一个数据转换单元转换而来;所述步骤包括:
    (1)将数据转换单元的二进制数序列转换为不包含突变校正序列的初步数据DNA序列,所述初步数据DNA序列包含数据转换单元的数据内容信息;
    (2)首先根据初步数据DNA序列生成4位碱基的初步判断序列:根据下式计算i=A,T,C,G时的碱基数量判断值X(i):
    X(i)=(-1)N(i)
    其中i=A,T,C,G;N(i)为i碱基在初步数据DNA序列中出现的个数;
    用初步判断序列的4位碱基分别储存i=A,T,C,G时的碱基数量判断值X(i),用碱基C和G分别储存-1和1,生成初步判断序列;
    然后根据初步数据DNA序列生成10位碱基的深度判断序列:根据下式计算初步数据DNA序列的碱基按位加权求和值sum:
    Figure PCTCN2016097398-appb-100008
    其中i=A,T,C,G;val(i)为碱基i的值,val(A)、val(T)、val(C)、val(G)分别对应1、2、3、4;position(i)为碱基i的位置坐标;N为初步数据DNA序列的总长;
    将碱基按位加权求和值sum的值转换为10位的三进制数序列,生成深度判断序列;
    将初步判断序列与深度判断序列相连,并在连接处加入保护碱基C,获得correctionDNA序列;
    (3)将初步数据DNA序列与correctionDNA序列相连,并在连接处加入长度为2个碱基的保护序列,获得包含突变校正序列的数据DNA序列。
  4. 根据权利要求3的方法,其中步骤(1)包括:
    按照dataDNA序列转换规则将所述数据转换单元的二进制数序列转换为dataDNA序列,以该dataDNA序列作为不包含突变校正序列的初步数据DNA序列;
    所述dataDNA序列转换规则是:
    (a)对于dataDNA序列中的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于dataDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100009
    集合{AT,CT,TT,CA,AA,GG,CC}相应的对应关系进行二进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100010
    Figure PCTCN2016097398-appb-100011
    *其中当d=[C,A]时,位置i上为碱基C,该碱基C不对应任何二进制数
    (c)从dataDNA序列的第三位起,依次按上表所示规则进行转换,首先判断第i位满足上表中的哪一条件,然后按照与该条件相应的对应关系进行第i位上二进制数与碱基的对应转换;
    (d)当二进制数序列剩余1位或2位时,使用下表所示规则进行二进制数与碱基的对应转换
    Figure PCTCN2016097398-appb-100012
  5. 根据权利要求3的方法,其中步骤(1)包括:
    (1-1)将所述数据转换单元在数据中的位置编号转换为固定位数的三进制数序列,根据indexDNA序列转换规则将所述三进制数序列转换为碱基数与三进制数序列的位数相同的indexDNA序列;
    所述indexDNA序列转换规则是:
    (a)对于indexDNA序列的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于indexDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100013
    集合{AT,CT,TT,CA,AA,CC,GG}相应的对应关系进行三进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100014
    (c)从indexDNA序列的第三位起,依次按照上表所示规则进行转换,首先判断第i位满足上表中的哪一组条件,然后按照与该条件相应的对应关系进行第i位上三进制数与碱基的对应转换;
    (1-2)按照dataDNA序列转换规则将所述数据转换单元的二进制数序列转换为dataDNA序列;
    所述dataDNA序列转换规则是:
    (a)对于dataDNA序列中的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于dataDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100015
    集合{AT,CT,TT,CA,AA,GG,CC}相应的对应关系进行二进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100016
    Figure PCTCN2016097398-appb-100017
    *其中当d=[C,A]时,位置i上为碱基C,该碱基C不对应任何二进制数
    (c)从dataDNA序列的第三位起,依次按上表所示规则进行转换,首先判断第i位满足上表中的哪一条件,然后按照与该条件相应的对应关系进行第i位上二进制数与碱基的对应转换;
    (d)当二进制数序列剩余1位或2位时,使用下表所示规则进行二进制数与碱基的对应转换;
    碱基 AC TC CG GA GT GC 二进制数序列 0 1 00 01 10 11
    (1-3)将所述数据转换单元的indexDNA序列与dataDNA序列相连,连接处加入长度为2个碱基的保护序列,得到index+dataDNA序列,获得的index+dataDNA序列作为不包含突变校正序列的初步数据DNA序列。
  6. 根据权利要求5的方法,其中在步骤(1-3)中,将correctionDNA连接在index+dataDNA序列中的dataDNA一端。
  7. 加密的数据DNA序列转换方法,包括:
    (1)提供用户名和密码,根据用户名和密码随机生成dataDNA序列转换规则中每一组对应关系中特定二进制数和特定碱基之间的对应方式;
    (2)利用权利要求1-6任一项的方法将数据转换为数据DNA序列,其中按照dataDNA序列转换规则将数据转换单元的二进制数序列转换为dataDNA序列时,按照步骤(1)生成的对应方式将特定二进制数转换为相应的特定碱基。
  8. 根据权利要求1-7任一项的方法,其中所述方法是在计算机上实施的。
  9. 利用DNA序列存储数据的方法,包括:利用权利要求1-8任一项的方法将数据转换为数据DNA序列,合成所述DNA序列,以及储存合成的DNA序列。
  10. 根据权利要求9的方法,其中所述储存合成的DNA序列是将DNA序列以质粒形式储存在细胞中,或者是将DNA序列整合到细胞基因组中。
  11. 将测序获得的DNA序列还原为数据的方法,包括:
    (1)提供测序获得的DNA序列,其中所述DNA序列包括表示数据转换单元的数据内容信息的dataDNA序列;
    (2)按照dataDNA序列转换规则将dataDNA序列还原为数据;
    所述dataDNA序列转换规则是:
    (a)对于dataDNA序列中的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于dataDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100018
    集合{AT,CT,TT,CA,AA,GG,CC}相应的对应关系进行二进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100019
    *其中当d=[C,A]时,位置i上为碱基C,该碱基C不对应任何二进制数
    (c)从dataDNA序列的第三位起,依次按上表所示规则进行转换,首先判断第i位满足上表中的哪一条件,然后按照与该条件相应的对应关系进行第i位上二进制数与碱基的对应转换;
    (d)当二进制数序列剩余1位或2位时,使用下表所示规则进行二进制数与碱基的对应转换
    Figure PCTCN2016097398-appb-100020
  12. 根据权利要求11的方法,其中步骤(2)中将dataDNA序列还原为二进制数形式的数据,或者进一步由该二进制数形式的数据还原为原始数据。
  13. 将测序获得的DNA序列还原为数据的方法,包括:
    (1)提供测序获得的DNA序列,所述DNA序列的序列为多条数据DNA序列,每条数据DNA序列包括表示数据转换单元位置信息的indexDNA序列和表示数据转换单元的数据内容信息的dataDNA序列;
    (2)按照indexDNA序列转换规则将每条数据DNA序列中的indexDNA序列还原为三进制数序列,再将该三进制数序列还原为该转换单元在数据中的位置编号;
    所述indexDNA序列转换规则是:
    (a)对于indexDNA序列的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于indexDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100021
    集合{AT,CT,TT,CA,AA,CC,GG}相应的对应关系进行三进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100022
    Figure PCTCN2016097398-appb-100023
    (c)从indexDNA序列的第三位起,依次按照上表所示规则进行转换,首先判断第i位满足上表中的哪一组条件,然后按照与该条件相应的对应关系进行第i位上三进制数与碱基的对应转换;
    (3)按照dataDNA序列转换规则将每条数据DNA序列中的dataDNA序列还原为数据;
    所述dataDNA序列转换规则是:
    (a)对于dataDNA序列中的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于dataDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100024
    集合{AT,CT,TT,CA,AA,GG,CC}相应的对应关系进行二进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100025
    *其中当d=[C,A]时,位置i上为碱基C,该碱基C不对应任何二进制数
    (c)从dataDNA序列的第三位起,依次按上表所示规则进行转换,首先判断第i位满足上表中的哪一条件,然后按照与该条件相应的对应关系进行第i位上二进制数与碱基的对应转换;
    (d)当二进制数序列剩余1位或2位时,使用下表所示规则进行二进制数与碱基的对应转换;
    碱基 AC TC CG GA GT GC 二进制数序列 0 1 00 01 10 11
    (4)将由每条数据DNA序列的dataDNA序列还原而来的数据按照其位置编号顺序相连,获得还原后的数据。
  14. 根据权利要求13的方法,其中步骤(3)中将dataDNA序列还原为二进制数形式的数据,或者进一步由该二进制数形式的数据还原成字符串;且步骤(4)中的还原后的数据,是二进制数形式的数据,或者是由该二进制数形式的数据进一步还原而成的原始数据,或者是由步骤(3)获得的字符串按照其位置编号顺序相连获得的字符串数据或由该字符串数据进一步还原而成的数据。
  15. 将测序获得的DNA序列校正还原为数据的方法,包括:
    (1)提供测序获得的DNA序列,所述DNA序列包含初步数据DNA序列和突变校正序列,其中所述初步数据DNA序列包含数据转换单元的数据内容信息;所述测序获得的DNA序列中初步数据DNA序列最多具有一个碱基的突变;
    (2)根据该初步数据DNA序列的测序序列,按照下式规则计算获得该初步数据DNA序列的测序序列的碱基数量判断值X'(i):
    X′(i)=(-1)N(i)
    其中i=A,T,C,G;N(i)为i碱基在该初步数据DNA序列的测序序列中出现的个数;
    将该初步数据DNA序列的测序序列的碱基数量判断值X'(i)与由测序获得的DNA序列中包含的突变校正序列中的初步判断序列按相同规则还原获得的碱基数量判断值X(i)对比:
    如果有两个碱基的碱基数量判断值发生变化,则表明该初步数据DNA序列的测序序列相对于未突变的初步数据DNA序列发生了碱基替换,且该替换是这两个碱基之一被另一个替换;
    如果仅有一个碱基的碱基数量判断值发生变化,则表明该初步数据DNA序列的测序序列相对于未突变的初步数据DNA序列发生了这一个碱基的插入或删除;
    如果没有碱基的碱基数量判断值发生变化,则表明该初步数据DNA序列的测序序列未发生突变;
    (3)根据该初步数据DNA序列的测序序列,按照下式规则计算获得该初步数据DNA序列的测序序列的碱基按位加权求和值sum':
    Figure PCTCN2016097398-appb-100026
    其中i=A,T,C,G;val(i)为碱基i的值,val(A)、val(T)、val(C)、val(G)分别对应1、2、3、4;position(i)为碱基i的位置坐标;N为该初步数据DNA序列的测序序列的总长;
    将该初步数据DNA序列的测序序列的碱基按位加权求和值sum'与由测序获得的DNA序列中包含的突变校正序列中的深度判断序列按相同规则还原获得的碱基按位加权求和值sum对比;
    在该初步数据DNA序列的测序序列相对于未突变的初步数据DNA序列发生碱基替换的情况下:如果sum'>sum,则所发生的碱基替换是val(i)值较小的碱基被替换为val(i)值较大的碱基,如果sum'<sum,则所发生的碱基替换是val(i)值较大的碱基被替换为val(i)值较小的碱基,发生碱基替换的位置坐标是sum'和sum之差除以所述两个碱基的val(i)之差所得除数的绝对值,将该位置上的碱基替换为所述两个碱基中的另一个,将测序序列校正为未突变的初步数据DNA序列;
    在该初步数据DNA序列的测序序列相对于未突变的初步数据DNA序列发生一个碱基的插入或删除的情况下:
    如果sum'>sum,则发生碱基插入,所述碱基插入位置按下述方法判断:从该初步数据DNA序列的测序序列中第一次出现所述碱基的位置开始,逐个删除每一个出现所述碱基的位置上的所述碱基,并在删除后按照下式规则计算获得删除后的初步数据DNA序列的碱基按位加权求和值sum”:
    Figure PCTCN2016097398-appb-100027
    其中i=A,T,C,G;val(i)为碱基i的值,val(A)、val(T)、val(C)、val(G)分别对应1、2、3、4;position(i) 为碱基i的位置坐标;N为删除所述碱基后初步数据DNA序列的总长;
    当删除某个位置上的所述碱基之后计算获得的碱基按位加权求和值sum”与由测序获得的DNA序列中包含的突变校正序列中的深度判断序列按相同规则还原获得的碱基按位加权求和值sum相等时,该位置即为所述碱基插入突变位置,将该位置上的所述碱基删除,将测序序列校正为未突变的初步数据DNA序列;
    如果sum'<sum,则发生碱基删除,所述碱基删除位置按下述方法判断:从该初步数据DNA序列的测序序列的第一位开始,逐个位置上插入所述碱基,并在插入后按照下式规则计算获得插入后的初步数据DNA序列的碱基按位加权求和值sum”':
    Figure PCTCN2016097398-appb-100028
    其中i=A,T,C,G;val(i)为碱基i的值,val(A)、val(T)、val(C)、val(G)分别对应1、2、3、4;position(i)为碱基i的位置坐标;N为插入所述碱基后初步数据DNA序列的总长;
    当在某个位置上插入所述碱基之后计算获得的碱基按位加权求和值sum”'与由测序获得的DNA序列中包含的突变校正序列中的深度判断序列按相同规则还原获得的碱基按位加权求和值sum相等时,该位置即为所述碱基删除突变位置,在该位置上插入所述碱基,将测序序列校正为未突变的初步数据DNA序列;
    (4)将未突变的初步数据DNA序列还原为数据。
  16. 根据权利要求15的方法,其中初步数据DNA序列包含表示数据转换单元的数据内容信息的dataDNA序列,步骤(4)包括按照dataDNA序列转换规则将未突变的初步数据DNA序列包含的dataDNA序列还原为数据;
    所述dataDNA序列转换规则是:
    (a)对于dataDNA序列中的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于dataDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100029
    集合{AT,CT,TT,CA,AA,GG,CC}相应的对应关系进行二进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100030
    *其中当d=[C,A]时,位置i上为碱基C,该碱基C不对应任何二进制数
    (c)从dataDNA序列的第三位起,依次按上表所示规则进行转换,首先判断第i位满足上表中的哪一条件,然后按照与该条件相应的对应关系进行第i位上二进制数与碱基的对应转换;
    (d)当二进制数序列剩余1位或2位时,使用下表所示规则进行二进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100031
  17. 根据权利要求16的方法,其中步骤(4)中将未突变的初步数据DNA序列包含的dataDNA序列还原为二进制数形式的数据,或者进一步由该二进制数形式的数据还原为原始数据。
  18. 根据权利要求15的方法,其中测序获得的DNA序列的序列为多条数据DNA序列,每条数据DNA序列的初步数据DNA序列包含表示数据转换单元位置信息的indexDNA序列和表示数据转换单元的数据内容信息的dataDNA序列,步骤(4)包括:
    (4-1)按照indexDNA序列转换规则将每条数据DNA序列中的indexDNA序列还原为三进制数序列,再将该三进制数序列还原为该转换单元在数据中的位置编号,所述indexDNA序列转换规则是:
    (a)对于indexDNA序列的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于indexDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100032
    集合{AT,CT,TT,CA,AA,CC,GG}相应的对应关系进行三进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100033
    (c)从indexDNA序列的第三位起,依次按照上表所示规则进行转换,首先判断第i位满足上表中的哪一组条件,然后按照与该条件相应的对应关系进行第i位上三进制数与碱基的对应转换;
    (4-2)按照dataDNA序列转换规则将每条数据DNA序列中的dataDNA序列还原为数据,所述dataDNA序列转换规则是:
    (a)对于dataDNA序列中的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于dataDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100034
    集合{AT,CT,TT,CA,AA,GG,CC}相应的对应关系进行二进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100035
    Figure PCTCN2016097398-appb-100036
    *其中当d=[C,A]时,位置i上为碱基C,该碱基C不对应任何二进制数
    (c)从dataDNA序列的第三位起,依次按上表所示规则进行转换,首先判断第i位满足上表中的哪一条件,然后按照与该条件相应的对应关系进行第i位上二进制数与碱基的对应转换;
    (d)当二进制数序列剩余1位或2位时,使用下表所示规则进行二进制数与碱基的对应转换;
    碱基 AC TC CG GA GT GC 二进制数序列 0 1 00 01 10 11
    (4-3)将由每条数据DNA序列的dataDNA序列还原而来的数据按照其位置编号顺序相连,获得还原后的数据。
  19. 根据权利要求18的方法,其中步骤(4-2)中将dataDNA序列还原为二进制数形式的数据,或者进一步由该二进制数形式的数据还原成字符串;且步骤(4-3)中还原后的数据是二进制数形式的数据,或者是由该二进制数形式的数据进一步还原而成的原始数据,或者是由dataDNA序列还原获得的字符串按照其位置编号顺序相连获得的字符串数据或由该字符串数据进一步还原而成的数据。
  20. 将测序获得的加密DNA序列还原为数据的方法,包括:
    (1)提供用户名和密码,得到dataDNA序列转换规则中每一组对应关系中特定二进制数和特定碱基之间的对应方式,所述对应方式是将数据转换为所述加密DNA序列时针对同一用户名和密码设定的对应方式;
    (2)用权利要求11-19任一项的方法将测序获得的加密DNA序列还原为数据,且其中按照dataDNA序列转换规则将每一条DNA序列中的dataDNA序列还原为数据时,按照步骤(1)得到的对应方式将特定碱基还原为相应的特定二进制数。
  21. 根据权利要求11-20任一项的方法,其中所述方法是在计算机上实施的。
  22. 从细胞中获取数据的方法,包括:从细胞中提取储存有数据信息的DNA序列,测序,然后通过权利要求11-21任一项的方法将测序获得的DNA序列还原为原始数据。
  23. 用于将数据转换为数据DNA序列的系统,包括输入装置和dataDNA序列转换装置;
    其中输入装置用于提供数据转换单元的二进制数序列;
    其中dataDNA序列转换装置用于按照dataDNA序列转换规则将所述数据转换单元的二进制数序列转换为dataDNA序列;
    所述dataDNA序列转换规则是:
    (a)对于dataDNA序列中的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于dataDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100037
    集合{AT,CT,TT,CA,AA,GG,CC}相应的对应关系进行二进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100038
    *其中当d=[C,A]时,位置i上为碱基C,该碱基C不对应任何二进制数
    (c)从dataDNA序列的第三位起,依次按上表所示规则进行转换,首先判断第i位满足上表中的哪一条件,然后按照与该条件相应的对应关系进行第i位上二进制数与碱基的对应转换;
    (d)当二进制数序列剩余1位或2位时,使用下表所示规则进行二进制数与碱基的对应转换
    Figure PCTCN2016097398-appb-100039
  24. 根据权利要求23的系统,其进一步包括indexDNA生成装置和第一整合装置;
    其中indexDNA生成装置用于将所述数据转换单元在数据中的位置编号转换为固定位数的三进制数序列,并根据indexDNA序列转换规则将所述三进制数序列转换为碱基数与三进制数序列的位数相同的indexDNA序列;
    所述indexDNA序列转换规则是:
    (a)对于indexDNA序列的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于indexDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100040
    集合{AT,CT,TT,CA,AA,CC,GG}相应的对应关系进行三进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100041
    (c)从indexDNA序列的第三位起,依次按照上表所示规则进行转换,首先判断第i位满足上表中的哪一组条件,然后按照与该条件相应的对应关系进行第i位上三进制数与碱基的对应转换;
    其中第一整合装置用于将所述数据转换单元的indexDNA序列与dataDNA序列相连,并在连接处加入长度为2个碱基的保护序列,得到index+dataDNA序列。
  25. 将数据转换为包含突变校正序列的数据DNA序列的系统,所述系统包括输入装置、初步数据DNA转换装置、correctionDNA序列生成装置和第二整合装置;
    其中输入装置用于提供数据转换单元的二进制数序列;
    其中初步数据DNA转换装置用于将数据转换单元的二进制数序列转换为不包含突变校正序列的初步数据DNA序列,所述初步数据DNA序列包含数据转换单元的数据内容信息;
    其中correctionDNA序列生成装置用于通过下述方法生成correctionDNA序列:
    首先根据初步数据DNA序列生成4位碱基的初步判断序列:根据下式计算i=A,T,C,G时的碱基数量判断值X(i):
    X(i)=(-1)N(i)
    其中i=A,T,C,G;N(i)为i碱基在初步数据DNA序列中出现的个数;
    用初步判断序列的4位碱基分别储存i=A,T,C,G时的碱基数量判断值X(i),用碱基C和G分别储存-1和1,生成初步判断序列;
    然后根据初步数据DNA序列生成10位碱基的深度判断序列:根据下式计算初步数据DNA序列的碱基按位加权求和值sum:
    Figure PCTCN2016097398-appb-100042
    其中i=A,T,C,G;val(i)为碱基i的值,val(A)、val(T)、val(C)、val(G)分别对应1、2、3、4;position(i)为碱基i的位置坐标;N为初步数据DNA序列的总长;
    将碱基按位加权求和值sum的值转换为10位的三进制数序列,生成深度判断序列;
    将初步判断序列与深度判断序列相连,并在连接处加入保护碱基C,获得correctionDNA序列;
    其中第二整合装置用于将初步数据DNA序列与correctionDNA序列相连,并在连接处加入长度为2个碱基的保护序列,获得包含突变校正序列的数据DNA序列。
  26. 根据权利要求25的系统,其中所述初步数据DNA转换装置是dataDNA序列转换装置,用于按照dataDNA序列转换规则将所述数据转换单元的二进制数序列转换为dataDNA序列,以该dataDNA序列作为不包含突变校正序列的初步数据DNA序列;
    所述dataDNA序列转换规则是:
    (a)对于dataDNA序列中的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于dataDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100043
    集合{AT,CT,TT,CA,AA,GG,CC}相应的对应关系进行二进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100044
    Figure PCTCN2016097398-appb-100045
    *其中当d=[C,A]时,位置i上为碱基C,该碱基C不对应任何二进制数
    (c)从dataDNA序列的第三位起,依次按上表所示规则进行转换,首先判断第i位满足上表中的哪一条件,然后按照与该条件相应的对应关系进行第i位上二进制数与碱基的对应转换;
    (d)当二进制数序列剩余1位或2位时,使用下表所示规则进行二进制数与碱基的对应转换
    Figure PCTCN2016097398-appb-100046
  27. 根据权利要求25的方法,其中所述初步数据DNA转换装置包括indexDNA序列生成装置、dataDNA序列转换装置和第三整合装置;
    其中indexDNA序列生成装置用于将所述数据转换单元在数据中的位置编号转换为固定位数的三进制数序列,并根据indexDNA序列转换规则将所述三进制数序列转换为碱基数与三进制数序列的位数相同的indexDNA序列;
    所述indexDNA序列转换规则是:
    (a)对于indexDNA序列的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于indexDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100047
    集合{AT,CT,TT,CA,AA,CC,GG}相应的对应关系进行三进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100048
    (c)从indexDNA序列的第三位起,依次按照上表所示规则进行转换,首先判断第i位满足上表中的哪一组条件,然后按照与该条件相应的对应关系进行第i位上三进制数与碱基的对应转换;
    其中dataDNA序列转换装置用于按照dataDNA序列转换规则将所述数据转换单元的二进制数序列转换为dataDNA序列;
    所述dataDNA序列转换规则是:
    (a)对于dataDNA序列中的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于dataDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100049
    集合{AT,CT,TT,CA,AA,GG,CC}相应的对应关系进行二进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100050
    *其中当d=[C,A]时,位置i上为碱基C,该碱基C不对应任何二进制数
    (c)从dataDNA序列的第三位起,依次按上表所示规则进行转换,首先判断第i位满足上表中的哪一条件,然后按照与该条件相应的对应关系进行第i位上二进制数与碱基的对应转换;
    (d)当二进制数序列剩余1位或2位时,使用下表所示规则进行二进制数与碱基的对应转换。
    碱基 AC TC CG GA GT GC 二进制数序列 0 1 00 01 10 11
    其中第三整合装置用于将所述数据转换单元的indexDNA序列与dataDNA序列相连,并在连接处加入长度为2个碱基的保护序列,得到index+dataDNA序列,以获得的index+dataDNA序列作为不包含突变校正序列的初步数据DNA序列。
  28. 根据权利要求27的系统,其中第二整合装置用于将correctionDNA序列连接在初步数据DNA序列中的dataDNA序列一端,并在连接处加入长度为2个碱基的保护序列,获得包含突变校正序列的数据DNA序列。
  29. 根据权利要求23-28任一项的系统,进一步包括加密装置,所述加密装置包括用户名和密码输入装置和dataDNA序列转换规则随机生成装置;
    其中用户名和密码输入装置用于提供用户名和密码;
    其中dataDNA序列转换规则随机生成装置用于根据用户名和密码随机生成dataDNA序列转换规则中每一组对应关系中特定二进制数和特定碱基之间的对应方式;
    其中dataDNA序列转换装置用于按照dataDNA序列转换规则将数据转换单元的二进制数序列转换为加密的dataDNA序列,其中按照dataDNA序列转换规则随机生成装置生成的对应方式将特定碱基转换为相应的特定二进制数。
  30. 将测序获得的DNA序列还原为数据的系统,包括输入装置和dataDNA序列还原装置;
    其中输入装置用于提供测序获得的DNA序列,其中所述DNA序列包括表示数据转换单元的 数据内容信息的dataDNA序列;
    其中dataDNA序列还原装置用于按照dataDNA序列转换规则将dataDNA序列还原为数据;
    所述dataDNA序列转换规则是:
    (a)对于dataDNA序列中的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于dataDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100051
    集合{AT,CT,TT,CA,AA,GG,CC}相应的对应关系进行二进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100052
    *其中当d=[C,A]时,位置i上为碱基C,该碱基C不对应任何二进制数
    (c)从dataDNA序列的第三位起,依次按上表所示规则进行转换,首先判断第i位满足上表中的哪一条件,然后按照与该条件相应的对应关系进行第i位上二进制数与碱基的对应转换;
    (d)当二进制数序列剩余1位或2位时,使用下表所示规则进行二进制数与碱基的对应转换
    Figure PCTCN2016097398-appb-100053
  31. 根据权利要求30的系统,其中dataDNA序列还原装置用于将dataDNA序列还原为二进制数形式的数据,或者用于进一步将该二进制数形式的数据还原为原始数据。
  32. 将测序获得的DNA序列还原为数据的系统,包括输入装置、indexDNA序列还原装置和第四整合装置;
    其中输入装置用于提供测序获得的DNA序列,所述DNA序列的序列为多条数据DNA序列,每条数据DNA序列包括表示数据转换单元位置信息的indexDNA序列和表示数据转换单元的数据内容信息的dataDNA序列;
    其中indexDNA序列还原装置用于按照indexDNA序列转换规则将每条数据DNA序列中的indexDNA序列还原为三进制数序列,再将该三进制数序列还原为该转换单元在数据中的位置编号,所述indexDNA序列转换规则是:
    (a)对于indexDNA序列的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于indexDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100054
    集合{AT,CT,TT,CA,AA,CC,GG}相应的对应关系进行三进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100055
    (c)从indexDNA序列的第三位起,依次按照上表所示规则进行转换,首先判断第i位满足上表中的哪一组条件,然后按照与该条件相应的对应关系进行第i位上三进制数与碱基的对应转换;
    其中dataDNA序列还原装置用于按照dataDNA序列转换规则将每条数据DNA序列中的dataDNA序列还原为数据,所述dataDNA序列转换规则是:
    (a)对于dataDNA序列中的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于dataDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100056
    集合{AT,CT,TT,CA,AA,GG,CC}相应的对应关系进行二进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100057
    *其中当d=[C,A]时,位置i上为碱基C,该碱基C不对应任何二进制数
    (c)从dataDNA序列的第三位起,依次按上表所示规则进行转换,首先判断第i位满足上表中的哪一条件,然后按照与该条件相应的对应关系进行第i位上二进制数与碱基的对应转换;
    (d)当二进制数序列剩余1位或2位时,使用下表所示规则进行二进制数与碱基的对应转换;
    碱基 AC TC CG GA GT GC 二进制数序列 0 1 00 01 10 11
    其中第四整合装置用于将由每条数据DNA序列的dataDNA序列还原而来的数据按照其位置编号顺序相连,获得还原后的数据。
  33. 根据权利要求32的系统,其中所述dataDNA序列还原装置用于将dataDNA序列还原为二进制数形式的数据,或者进一步用于将该二进制数形式的数据还原成字符串;第四整合装置用于还原获得二进制数形式的数据,或者进一步由该二进制数形式的数据还原获得原始数据,或者用 于由dataDNA序列还原装置还原获得的字符串按照其位置编号顺序相连获得字符串数据或由该字符串数据进一步还原获得原始数据。
  34. 将测序获得的DNA序列校正还原为数据的系统,包括输入装置、纠错装置和初步数据DNA序列还原装置;
    其中输入装置用于提供测序获得的DNA序列,所述DNA序列包含初步数据DNA序列和突变校正序列,其中所述初步数据DNA序列包含数据转换单元的数据内容信息;所述测序获得的DNA序列中初步数据DNA序列最多具有一个碱基的突变;
    其中纠错装置用于通过下述方法将初步数据DNA序列的测序序列还原为未突变的初步数据DNA序列:
    (a)根据该初步数据DNA序列的测序序列,按照下式规则计算获得该初步数据DNA序列的测序序列的碱基数量判断值X'(i):
    X′(i)=(-1)N(i)
    其中i=A,T,C,G;N(i)为i碱基在该初步数据DNA序列的测序序列中出现的个数;
    将该初步数据DNA序列的测序序列的碱基数量判断值X'(i)与由测序获得的DNA序列中包含的突变校正序列中的初步判断序列按相同规则还原获得的碱基数量判断值X(i)对比:
    如果有两个碱基的碱基数量判断值发生变化,则表明该初步数据DNA序列的测序序列相对于未突变的初步数据DNA序列发生了碱基替换,且该替换是这两个碱基之一被另一个替换;
    如果仅有一个碱基的碱基数量判断值发生变化,则表明该初步数据DNA序列的测序序列相对于未突变的初步数据DNA序列发生了这一个碱基的插入或删除;
    如果没有碱基的碱基数量判断值发生变化,则表明该初步数据DNA序列的测序序列未发生突变;
    (b)根据该初步数据DNA序列的测序序列,按照下式规则计算获得该初步数据DNA序列的测序序列的碱基按位加权求和值sum':
    Figure PCTCN2016097398-appb-100058
    其中i=A,T,C,G;val(i)为碱基i的值,val(A)、val(T)、val(C)、val(G)分别对应1、2、3、4;position(i)为碱基i的位置坐标;N为该初步数据DNA序列的测序序列的总长;
    将该初步数据DNA序列的测序序列的碱基按位加权求和值sum'与由测序获得的DNA序列中包含的突变校正序列中的深度判断序列按相同规则还原获得的碱基按位加权求和值sum对比;
    在该初步数据DNA序列的测序序列相对于未突变的初步数据DNA序列发生两个碱基的替换的情况下:如果sum'>sum,则所发生的碱基替换是val(i)值较小的碱基被替换为val(i)值较大的碱基,如果sum'<sum,则所发生的碱基替换是val(i)值较大的碱基被替换为val(i)值较小的碱基,发生碱基替换的位置坐标是sum'和sum之差除以所述两个碱基的val(i)之差所得除数的绝对值,将该位置上的碱基替换为所述两个碱基中的另一个,将测序序列校正为未突变的初步数据DNA序列;
    在该初步数据DNA序列的测序序列相对于未突变的初步数据DNA序列发生一个碱基的插入或删除的情况下:
    如果sum'>sum,则发生碱基插入,所述碱基插入位置按下述方法判断:从该初步数据DNA 序列的测序序列中第一次出现所述碱基的位置开始,逐个删除每一个出现所述碱基的位置上的所述碱基,并在删除后按照下式规则计算获得删除后的初步数据DNA序列的碱基按位加权求和值sum”:
    Figure PCTCN2016097398-appb-100059
    其中i=A,T,C,G;val(i)为碱基i的值,val(A)、val(T)、val(C)、val(G)分别对应1、2、3、4;position(i)为碱基i的位置坐标;N为删除所述碱基后初步数据DNA序列的总长;
    当删除某个位置上的所述碱基之后计算获得的碱基按位加权求和值sum”与由测序获得的DNA序列中包含的突变校正序列中的深度判断序列按相同规则还原获得的碱基按位加权求和值sum相等时,该位置即为所述碱基插入突变位置,将该位置上的所述碱基删除,将测序序列校正为未突变的初步数据DNA序列;
    如果sum'<sum,则发生碱基删除,所述碱基删除位置按下述方法判断:从该初步数据DNA序列的测序序列的第一位开始,逐个位置上插入所述碱基,并在插入后按照下式规则计算获得插入后的初步数据DNA序列的碱基按位加权求和值sum”':
    Figure PCTCN2016097398-appb-100060
    其中i=A,T,C,G;val(i)为碱基i的值,val(A)、val(T)、val(C)、val(G)分别对应1、2、3、4;position(i)为碱基i的位置坐标;N为插入所述碱基后初步数据DNA序列的总长;
    当在某个位置上插入所述碱基之后计算获得的碱基按位加权求和值sum”'与由测序获得的DNA序列中包含的突变校正序列中的深度判断序列按相同规则还原获得的碱基按位加权求和值sum相等时,该位置即为所述碱基删除突变位置,在该位置上插入所述碱基即将测序序列校正为未突变的初步数据DNA序列;
    其中初步数据DNA序列还原装置用于将未突变的初步数据DNA序列还原为数据。
  35. 根据权利要求34的系统,其中所述初步数据DNA序列包含表示数据转换单元的数据内容信息的dataDNA序列,所述初步数据DNA序列还原装置是dataDNA序列还原装置,用于按照dataDNA序列转换规则将未突变的初步数据DNA序列包含的dataDNA序列还原为数据;所述dataDNA序列转换规则是:
    (a)对于dataDNA序列中的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于dataDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100061
    集合{AT,CT,TT,CA,AA,GG,CC}相应的对应关系进行二进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100062
    Figure PCTCN2016097398-appb-100063
    *其中当d=[C,A]时,位置i上为碱基C,该碱基C不对应任何二进制数
    (c)从dataDNA序列的第三位起,依次按上表所示规则进行转换,首先判断第i位满足上表中的哪一条件,然后按照与该条件相应的对应关系进行第i位上二进制数与碱基的对应转换;
    (d)当二进制数序列剩余1位或2位时,使用下表所示规则进行二进制数与碱基的对应转换
    Figure PCTCN2016097398-appb-100064
  36. 根据权利要求35的系统,其中所述dataDNA序列还原装置用于将未突变的初步数据DNA序列包含的dataDNA序列还原为二进制数形式的数据,或者进一步用于将该二进制数形式的数据进一步还原成原始数据。
  37. 根据权利要求34的系统,其中测序获得的DNA序列的序列为多条数据DNA序列,每条数据DNA序列的初步数据DNA序列包含表示数据转换单元位置信息的indexDNA序列和表示数据转换单元的数据内容信息的dataDNA序列,所述初步数据DNA序列还原装置包括indexDNA还原装置、dataDNA序列还原装置和第五整合装置;
    其中indexDNA还原装置用于按照indexDNA序列转换规则将每条数据DNA序列中的indexDNA序列还原为三进制数序列,再将该三进制数序列还原为该转换单元在数据中的位置编号,所述indexDNA序列转换规则是:
    (a)对于indexDNA序列的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于indexDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100065
    集合{AT,CT,TT,CA,AA,CC,GG}相应的对应关系进行三进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100066
    (c)从indexDNA序列的第三位起,依次按照上表所示规则进行转换,首先判断第i位满足上表中的哪一组条件,然后按照与该条件相应的对应关系进行第i位上三进制数与碱基的对应转换;
    其中dataDNA序列还原装置用于按照dataDNA序列转换规则将每条数据DNA序列中的 dataDNA序列还原为数据,所述dataDNA序列转换规则是:
    (a)对于dataDNA序列中的第i位,将该位置之前的两位碱基表示为d=[i-2,i-1];
    (b)对于dataDNA序列的首两位,按下表中与条件
    Figure PCTCN2016097398-appb-100067
    集合{AT,CT,TT,CA,AA,GG,CC}相应的对应关系进行二进制数与碱基的对应转换;
    Figure PCTCN2016097398-appb-100068
    *其中当d=[C,A]时,位置i上为碱基C,该碱基C不对应任何二进制数
    (c)从dataDNA序列的第三位起,依次按上表所示规则进行转换,首先判断第i位满足上表中的哪一条件,然后按照与该条件相应的对应关系进行第i位上二进制数与碱基的对应转换;
    (d)当二进制数序列剩余1位或2位时,使用下表所示规则进行二进制数与碱基的对应转换;
    碱基 AC TC CG GA GT GC 二进制数序列 0 1 00 01 10 11
    其中第五整合装置,用于将由每条数据DNA序列的dataDNA序列还原而来的数据按照其位置编号顺序相连,获得还原后的数据。
  38. 根据权利要求37的系统,其中所述dataDNA序列还原装置用于将dataDNA序列还原为二进制数形式的数据,或者进一步用于将该二进制数形式的数据还原成字符串;所述第五整合装置用于获得的还原后的数据是二进制数形式的数据,或者是由该二进制数形式的数据进一步还原而成的原始数据,或者是由dataDNA序列还原装置还原获得的字符串按照其位置编号顺序相连获得的字符串数据或由该字符串数据进一步还原而成的数据。
  39. 根据权利要求30-38的系统,进一步包括解密装置,所述解密装置包括输入装置和dataDNA序列转换规则确定装置;
    其中输入装置用于提供用户名和密码;
    其中dataDNA序列转换规则确定装置用于根据用户名和密码得到dataDNA序列转换规则中每一组对应关系中特定二进制数和特定碱基之间的对应方式,所述对应方式是将数据转换为所述加密DNA序列时针对同一用户名和密码设定的对应方式;
    其中dataDNA序列还原装置用于按照dataDNA序列转换规则将测序获得的加密DNA序列中的dataDNA序列转换为数据,且其中按照dataDNA序列转换规则确定装置确定的对应方式将特定碱基还原为相应的特定二进制数。
  40. 储存在计算机可读存储介质上的含有程序指令的可执行的软件产品,当由计算机执行时,可将数据转换为数据DNA序列,所述软件产品包括执行权利要求1-8任一项的方法的程序指令。
  41. 储存在计算机可读存储介质上的含有程序指令的可执行的软件产品,当由计算机执行时,可将测序获得的DNA序列还原为数据,所述软件产品包括执行权利要求11-21任一项的方法的程序指令。
  42. 计算机可读存储介质,其中储存有权利要求40或41的软件产品。
PCT/CN2016/097398 2016-08-30 2016-08-30 将数据进行生物存储并还原的方法 WO2018039938A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP16914505.9A EP3509018B1 (en) 2016-08-30 2016-08-30 Method for biologically storing and restoring data
PCT/CN2016/097398 WO2018039938A1 (zh) 2016-08-30 2016-08-30 将数据进行生物存储并还原的方法
US16/328,745 US11177019B2 (en) 2016-08-30 2016-08-30 Method for biologically storing and restoring data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/097398 WO2018039938A1 (zh) 2016-08-30 2016-08-30 将数据进行生物存储并还原的方法

Publications (1)

Publication Number Publication Date
WO2018039938A1 true WO2018039938A1 (zh) 2018-03-08

Family

ID=61299573

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/097398 WO2018039938A1 (zh) 2016-08-30 2016-08-30 将数据进行生物存储并还原的方法

Country Status (3)

Country Link
US (1) US11177019B2 (zh)
EP (1) EP3509018B1 (zh)
WO (1) WO2018039938A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020239806A1 (en) * 2019-05-27 2020-12-03 Vib Vzw A method of storing digital information in pools of nucleic acid molecules
CN114958828A (zh) * 2022-06-14 2022-08-30 深圳先进技术研究院 基于dna分子介质的数据信息存储方法

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11106633B2 (en) * 2018-04-24 2021-08-31 EMC IP Holding Company, LLC DNA-based data center with deduplication capability
US11456759B2 (en) * 2018-05-25 2022-09-27 Erlich Lab Llc Optimized encoding for storage of data on polymers in asynchronous synthesis
US11017170B2 (en) * 2018-09-27 2021-05-25 At&T Intellectual Property I, L.P. Encoding and storing text using DNA sequences
WO2020075282A1 (ja) * 2018-10-11 2020-04-16 富士通株式会社 変換方法、変換装置および変換プログラム
CN111368132B (zh) * 2020-02-28 2023-04-14 元码基因科技(北京)股份有限公司 基于dna序列存储音频或视频文件的方法及存储介质
CN111489791B (zh) * 2020-04-07 2023-05-26 中国科学院重庆绿色智能技术研究院 固态纳米孔高密度编码dna数字存储读取方法
WO2023015550A1 (zh) * 2021-08-13 2023-02-16 深圳先进技术研究院 Dna数据的存储方法、装置、设备及可读存储介质
CN114374504A (zh) * 2021-08-16 2022-04-19 中电长城网际系统应用有限公司 数据加密方法、解密方法、装置、服务器
CN114356222B (zh) * 2021-12-13 2022-08-19 深圳先进技术研究院 数据存储方法、装置、终端设备及计算机可读存储介质
CN114758703B (zh) * 2022-06-14 2022-09-13 深圳先进技术研究院 基于重组质粒dna分子的数据信息存储方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2376686A (en) * 2001-02-10 2002-12-24 Nat Inst Of Agricultural Botan Storage of encoded information within biological macromolecules
US20050053968A1 (en) * 2003-03-31 2005-03-10 Council Of Scientific And Industrial Research Method for storing information in DNA
CN104520864A (zh) * 2012-06-01 2015-04-15 欧洲分子生物学实验室 Dna中数字信息的高容量存储
CN104850760A (zh) * 2015-03-27 2015-08-19 苏州泓迅生物科技有限公司 带有编码信息的人工合成dna存储介质及信息的存储读取方法和应用
CN105022935A (zh) * 2014-04-22 2015-11-04 中国科学院青岛生物能源与过程研究所 一种利用dna进行信息存储的编码方法和解码方法
CN105119717A (zh) * 2015-07-21 2015-12-02 郑州轻工业学院 一种基于dna编码的加密系统及加密方法

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040086861A1 (en) * 2000-04-19 2004-05-06 Satoshi Omori Method and device for recording sequence information on nucleotides and amino acids
US8412462B1 (en) * 2010-06-25 2013-04-02 Annai Systems, Inc. Methods and systems for processing genomic data
US10068054B2 (en) * 2013-01-17 2018-09-04 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
US9792405B2 (en) * 2013-01-17 2017-10-17 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
US20170141793A1 (en) * 2015-11-13 2017-05-18 Microsoft Technology Licensing, Llc Error correction for nucleotide data stores
US10566077B1 (en) * 2015-11-19 2020-02-18 The Board Of Trustees Of The University Of Illinois Re-writable DNA-based digital storage with random access
US10640822B2 (en) * 2016-02-29 2020-05-05 Iridia, Inc. Systems and methods for writing, reading, and controlling data stored in a polymer
US20180211001A1 (en) * 2016-04-29 2018-07-26 Microsoft Technology Licensing, Llc Trace reconstruction from noisy polynucleotide sequencer reads
WO2018132457A1 (en) * 2017-01-10 2018-07-19 Roswell Biotechnologies, Inc. Methods and systems for dna data storage
US11550939B2 (en) * 2017-02-22 2023-01-10 Twist Bioscience Corporation Nucleic acid based data storage using enzymatic bioencryption
US10930370B2 (en) * 2017-03-03 2021-02-23 Microsoft Technology Licensing, Llc Polynucleotide sequencer tuned to artificial polynucleotides
US10742233B2 (en) * 2017-07-11 2020-08-11 Erlich Lab Llc Efficient encoding of data for storage in polymers such as DNA
CN111373049A (zh) * 2017-08-30 2020-07-03 罗斯威尔生命技术公司 用于dna数据存储的进行性酶分子电子传感器
US10810495B2 (en) * 2017-09-20 2020-10-20 University Of Wyoming Methods for data encoding in DNA and genetically modified organism authentication
WO2019075100A1 (en) * 2017-10-10 2019-04-18 Roswell Biotechnologies, Inc. METHODS, APPARATUS AND SYSTEMS FOR STORING DNA DATA WITHOUT AMPLIFICATION
KR20210053292A (ko) * 2018-08-03 2021-05-11 카탈로그 테크놀로지스, 인크. 오류 방지 기능을 갖춘 핵산 기반 데이터 저장 및 판독 시스템 및 방법
US11249941B2 (en) * 2018-12-21 2022-02-15 Palo Alto Research Center Incorporated Exabyte-scale data storage using sequence-controlled polymers
US20210074380A1 (en) * 2019-09-05 2021-03-11 Microsoft Technology Licensing, Llc Reverse concatenation of error-correcting codes in dna data storage

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2376686A (en) * 2001-02-10 2002-12-24 Nat Inst Of Agricultural Botan Storage of encoded information within biological macromolecules
US20050053968A1 (en) * 2003-03-31 2005-03-10 Council Of Scientific And Industrial Research Method for storing information in DNA
CN104520864A (zh) * 2012-06-01 2015-04-15 欧洲分子生物学实验室 Dna中数字信息的高容量存储
CN105022935A (zh) * 2014-04-22 2015-11-04 中国科学院青岛生物能源与过程研究所 一种利用dna进行信息存储的编码方法和解码方法
CN104850760A (zh) * 2015-03-27 2015-08-19 苏州泓迅生物科技有限公司 带有编码信息的人工合成dna存储介质及信息的存储读取方法和应用
CN105119717A (zh) * 2015-07-21 2015-12-02 郑州轻工业学院 一种基于dna编码的加密系统及加密方法

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020239806A1 (en) * 2019-05-27 2020-12-03 Vib Vzw A method of storing digital information in pools of nucleic acid molecules
CN114958828A (zh) * 2022-06-14 2022-08-30 深圳先进技术研究院 基于dna分子介质的数据信息存储方法
CN114958828B (zh) * 2022-06-14 2024-04-19 深圳先进技术研究院 基于dna分子介质的数据信息存储方法

Also Published As

Publication number Publication date
EP3509018B1 (en) 2023-10-18
US11177019B2 (en) 2021-11-16
EP3509018A1 (en) 2019-07-10
EP3509018A4 (en) 2020-06-03
US20190311782A1 (en) 2019-10-10

Similar Documents

Publication Publication Date Title
WO2018039938A1 (zh) 将数据进行生物存储并还原的方法
CN107798219B (zh) 将数据进行生物存储并还原的方法
Venter et al. Synthetic chromosomes, genomes, viruses, and cells
De Silva et al. New trends of digital data storage in DNA
Rowe et al. Museum genomics: low‐cost and high‐accuracy genetic data from historical specimens
US8554492B2 (en) Method and apparatus for searching nucleic acid sequence
JP2020515243A (ja) 核酸ベースのデータ記憶
Forslund et al. Evolution of protein domain architectures
Akram et al. Trends to store digital data in DNA: an overview
CN112802549B (zh) Dna序列完整性校验和纠错的编解码方法
Song et al. Orthogonal information encoding in living cells with high error-tolerance, safety, and fidelity
Wang et al. Hidden addressing encoding for DNA storage
Dey et al. Complete mitogenome of endemic plum-headed parakeet Psittacula cyanocephala–characterization and phylogenetic analysis
Lee et al. Reversible DNA data hiding using multiple difference expansions for DNA authentication and storage
Zhang et al. EasyCGTree: a pipeline for prokaryotic phylogenomic analysis based on core gene sets
Weissman et al. Benchmarking community-wide estimates of growth potential from metagenomes using codon usage statistics
CN111095423B (zh) 编码/解码方法、装置和数据处理装置
Ping et al. Towards practical and robust DNA-based data archiving by codec system named ‘Yin-Yang’
Dagher et al. Data storage in cellular DNA: contextualizing diverse encoding schemes
Abdullah et al. New data hiding approach based on biological functionality of DNA sequence
Balado On the embedding capacity of DNA strands under substitution, insertion, and deletion mutations
Haughton et al. A modified watermark synchronisation code for robust embedding of data in DNA
JP2003517664A (ja) 合成遺伝子による情報暗号化および回収
US20230032409A1 (en) Method for Information Encoding and Decoding, and Method for Information Storage and Interpretation
Bi et al. Extended XOR Algorithm with Biotechnology Constraints for Data Security in DNA Storage

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16914505

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2016914505

Country of ref document: EP

Effective date: 20190401