WO2004107243A1 - 情報担体としてのdna符号の設計方法 - Google Patents
情報担体としてのdna符号の設計方法 Download PDFInfo
- Publication number
- WO2004107243A1 WO2004107243A1 PCT/JP2004/007271 JP2004007271W WO2004107243A1 WO 2004107243 A1 WO2004107243 A1 WO 2004107243A1 JP 2004007271 W JP2004007271 W JP 2004007271W WO 2004107243 A1 WO2004107243 A1 WO 2004107243A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- dna
- code
- sequence
- sequences
- template
- Prior art date
Links
Classifications
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B82—NANOTECHNOLOGY
- B82Y—SPECIFIC USES OR APPLICATIONS OF NANOSTRUCTURES; MEASUREMENT OR ANALYSIS OF NANOSTRUCTURES; MANUFACTURE OR TREATMENT OF NANOSTRUCTURES
- B82Y10/00—Nanotechnology for information processing, storage or transmission, e.g. quantum computing or single electron logic
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/123—DNA computing
Definitions
- the present invention can avoid errors that may occur when using artificially designed DNA as an information carrier, and can be used as a simple and general information carrier for writing information into a biopolymer.
- the present invention relates to a DNA code design method, a DNA code obtained by a powerful design method, and a method of writing arbitrary information to DNA by embedding a powerful DNA code word in an arbitrary non-coding region containing no genetic information.
- DNA has a structure in which four types of bases, namely, adenine (A), cytosine (C), guanine (G), and thymine (T) are linked in a chain, where A is T and C is G A--T and CG are said to be complementary because they form a base pair by hydrogen bonding, and the two DNA strands have a complementary double helix structure, and the strong DNA double helix When the temperature rises, it dissociates into single-stranded DNA units, and when the temperature falls, it rejoins the complementary strand. The process of binding to the complementary strand is called hybridisation, and it is well known that the temperature at which DNA strands dissociate and hybridize depends on the GC content in the sequence.
- A adenine
- C cytosine
- G guanine
- T thymine
- Non-Patent Document 1 To describe information using this DNA, prepare a plurality of oligonucleotide sequences corresponding to the characters. Such a collection of fixed-length artificial oligonucleotide sequences is used in many fields of application as described below.
- Non-Patent Document 2 there is a completely new type of computer called a "DNA computer” as a representative of a computational paradigm different from current computers (for example, see Non-Patent Document 2).
- symbolic processing is realized by expressing logical variables or components of graphs as DNA base sequences in order to solve mathematical problems, etc., and applying experimental methods in molecular biology to the base sequences. I do. Again, an artificially designed set of fixed-length oligonucleotide sequences is used.
- DNA code is a set of mutually different base sequences having the same length.
- the constraints that the DNA code designed in this way must satisfy are that physical properties such as melting temperature are constant for all codewords (base sequences), and that the desired (Mishybridization) does not occur, and the design method has much in common with the classic error correction code design method.
- the design of DNA codes is different from that of error correction codes, and there is no standard design method.
- the following describes three basic approaches that have been used in the design of DNA codes: (1) template-map strategy, (2) De Bruijn sequence design (DeBruijn construction), and (3) Stochastic method ( stochastic method).
- the DNA code designed by this method can only satisfy the properties that have been studied in the conventional binary code.
- DNA cannot identify the code word delimiter (comma), unlike codes used electronically. Therefore, if the reading frame of a code word shifts, it is necessary to provide a mechanism that can detect the shift. There is.
- This property is called comma-free in that it does not require commas.
- a code that always generates d mismatches (when the reading frame is shifted) between a concatenated portion of codewords and each codeword is called a comma-free code with index d.
- the template map strategy cannot make the DNA code comma-free.
- a De Bruijn array of order k is a circular sequence of length 2 k in which an array of length k occurs exactly once, and a linear time algorithm (linear time) for constructing the De Bruijn array. algorithm) is known.
- De Bruijn sequence power of order k The selected oligonucleotide sequence does not have a continuous match of length k or more, so if the length of the DNA codeword is 2k or more, the concatenated portion of the codeword will It is possible to prevent a perfect match with the code word (comma-free code at index 1) o In fact, Brenner applied the comma-free code at index 1 to the design of oligonucleotide tags (e.g. See Patent Documents 16 and 17). When using the De Bruijn array, it is difficult to have comma-free codes with indices of 2 or more. It is also difficult to guarantee the number of mismatches between codewords designed using the De Bruijn array. Therefore, it is very difficult to design a DNA code that has high commas-free property of high index and a large number of mismatches between code words.
- Stochastic methods are the most widely used approach to code design. Deaton et al. Use a genetic algorithm to find codewords that satisfy the “extended” Hamming constraint, i.e., a constraint that also takes into account mismatched shifts, and have a consistent melting temperature. (For example, see Non-Patent Document 18). According to their report, due to the complexity of the problem, genetic algorithms can only be applied to the design of codewords up to length 25 (see, for example, Non-Patent Document 19).
- Landweber et al. Used a random codeword generator to design two sets of 10 codewords of length 15. The resulting sequence satisfies the following conditions: (1) no concatenation of 5 or more bases, no matter which codeword is spliced, (2) melting temperature aligned at 45 ° C, (3) Avoid secondary structure, and (4) there is no consecutive combination of more than 7 base pairs (if the first condition is satisfied, the fourth condition is unnecessary.
- the conditions given in the original text are provided). They realized these restrictions with only three types of bases (for example, see Non-Patent Document 20). Similarly, a group that designed a power codeword with only three bases uses random code generation for its design (for example, see Non-Patent Document 21-123).
- the disadvantages of the probabilistic method are that the designed codewords are different each time (because of the stochastic nature), the number of codewords that can be designed cannot be estimated, and the characteristics of the designed codewords (for example, mismatch) , Etc.) cannot be estimated in advance.
- the designed DNA code must maintain a large distance between the codewords and the mining distance. What makes DNA code design more difficult than error correction code theory is not only the codewords, but also the number of mismatches in hybridization with their complementary sequences! It is.
- Comma-Free is a property that guarantees not only the number of mismatches when the reading frames of codewords are aligned but also a predetermined number of mismatches even when the reading frames of the array are shifted. Since DNA does not have a fixed reading frame, it is desirable that the designed code is comma-free. By definition, the concatenation of two not necessarily different codewords, XX
- the code is comma-free with index d (for example, see Non-Patent Documents 25 and 26). ). Therefore, the DNA code must be comma-free with high indices. It should be noted that this is not compensated for by introducing a “spacer” codeword between the comma-free codewords. Although the presence of such a spacer facilitates decoding of a code word, it does not contribute to avoiding mishybridization. In addition, the spacer inserts an extra DNA sequence between each codeword, thereby reducing the information density.
- the melting temperature of the DNA code is necessary to ensure unbiased reactions in the experiments.
- Patent Document 1 JP 2001-352980 A
- Patent Document 2 European Patent No. 97302313
- Patent Document 3 US Patent No. 5604097
- Non-Patent Document 1 Biochemistry 37, 26, 9435-9444, 1998
- Non-Patent Document 2 Science 266, 5187, 1021-1024, 1994
- Non-Patent Document 3 Proceedings of the National Academy of Sciences of USA 89, 12, 5381-5383, 1992
- Non-Patent Document 4 Proceedings of the National Academy of Sciences of USA 97, 4, 1665-1670, 2000
- Non-Patent Document 5 Journal of Computational Biology 7, 3-4, 503-519, 2000
- Non-Patent Document 6 10th Foresight Conference on Molecular Nanotechnology (Bethesda,
- Non-Patent Document 7 Nucleic Acids Research 25, 23, 4748-4757, 1997
- Non-Patent Document 9 Langmuir 18, 3, 805-812, 2002
- Non-Patent Document 10 Journal of Computational Biology 8, 3, 201-219, 2001
- Non-Patent Document 11 Journal of Computational Biology 7, 3-4, 503-519, 2000
- Non-Patent Document 12 Genome Research 10, 6, 853-860, 2000
- Non-Patent Document 13 Judson, H.F .: The Eighth Day of Creation: Makers of the
- Non-Patent Document 14 IEEE Transactions on Information Theory, IT-11, 107-112, 1965
- Non-Patent Document 15 Stiffler, J. J .: Theory of Synchronous Communication.Prentice-Hall, Inc., Englewood Cliffs, N. J., 1971
- Non-Patent Document 16 Proceedings of the National Academy of Sciences of USA 89, 12, 5381-5383, 1992
- Non-Patent Document 17 Proceedings of the National Academy of sciences of USA 97, 4, 1665-1670, 2000
- Non-Patent Document 18 DNA Based Computers II, DIMACS Series in Discrete Mathematics and Theoretical Computer Science 44, 247-258, 1998
- Non-Patent Document 19 Proceedings of the 3rd Annual Genetic Programming Conference, Morgan Kauftnann 684-690, 1998
- Non-Patent Document 20 Proceedings of the National Academy of Sciences of USA 97, 4, 1385-1389, 2000
- Non-Patent Document 21 DNA Computing: 6th International Workshop on DNA-Based
- Non-Patent Document 22 LNCS 2054, 17-26, 2001
- Non-Patent Document 23 Science 296, 5567, 499-502, 2002
- Non-Patent Document 24 Proceedings of 8th International Meeting on DNA-Based Computers (DNA 2002; Sapporo, Japan), 311—323, 2002
- Non-Patent Document 25 Canadian Journal of Mathematics 10, 202-209, 1958
- Non-Patent Document 26 Canadian Journal of Mathematics 39, 3, 513—526, 1987
- Non-Patent Document 27 Proceedings of the National Academy of sciences of USA 83, 11,
- Non-Patent Document 28 biochemistry 37, 26, 9435-9444, 1998
- Non-Patent Document 29 Critical Reviews in Biochemistry and Molecular Biology 2b, 3-4, 227-259, 1991
- primer sequence When reading DNA, a specific sequence called a primer is required.
- the primer sequence is placed at both ends of the information-retaining sequence, and amplifies only the region (information sequence) sandwiched between the primer sequences.
- the conventional DNA coding technology is based on the assumption that written information can be read from DNA "as is”, and does not consider the presence of read errors. Also, do not consider the primers, or prepare a specific sequence at both ends of the information to be embedded in DNA. In addition, since the conventional method does not show a specific means for writing information in DNA, it does not show a method of aligning physical characteristics and preventing the appearance of a specific sequence. There are many experimental restrictions on the replication of genetic information, and it is impossible to replicate genetic information without errors even with advanced technology. Even if the error is eliminated at the replication stage, when an information sequence is written in the DNA of a living body, sudden mutation of the sequence due to in vivo molecules or radiation must be taken into consideration.
- an object of the present invention is to provide a code as an information carrier for reading and writing arbitrary information in an arbitrary non-coding region that does not contain genetic information of DNA (artificial meaning such as alphabets is given. It is an object of the present invention to provide a set of base sequences for a set of symbols, that is, a method for designing a DNA code.
- the codeword of a powerful DNA code can be associated with the coding system used by the computer, and no matter how the characters are connected, The feature is that decoding can be performed with very high reliability.
- This DNA codeword has characteristics that are sufficiently different from those of natural DNA, and can be embedded in any part without containing the genetic information of DNA. Further, the DNA codeword produced by the design method of the present invention can be used as an information storage medium.
- each oligonucleotide sequence in the set S1 of oligonucleotide sequences having a predetermined length n (n is an integer of 3 or more, and preferably 6 or more), each oligonucleotide in the set S1 Sequence, between the sequence complementary to each of the other oligonucleotide sequences in the set S1, between the sequences shifted from these, and between the oligonucleotide sequences, between the complementary sequences, and between the Including a mismatch of a predetermined value or more between the oligonucleotide sequence and the sequence obtained by linking the complementary sequence, between the oligonucleotide sequence, the complementary sequence, and the sequence obtained by shifting them.
- mishybridization between the respective oligonucleotide sequences, between the complementary sequences, and between the respective oligonucleotide sequences and the sequence obtained by linking the complementary sequences A method of systematically designing a set S1 of oligonucleotide sequences that can be avoided, and a set S1 of oligonucleotide sequences that can avoid mishybridization for inverted sequences as well as complementary sequences Propose a systematic design method! / Puru (Japanese Patent Application 2001-3317 32).
- the present inventor has conducted intensive studies to solve the above-mentioned problems.
- a template having a subword constraint of length m is selected, and a predetermined template having a subword constraint of length m is selected.
- a set of base sequences S2 that can be used as characters when describing information by combining with code words of error correction codes is used to find a DNA code design method that satisfies all of these conditions.
- the present invention has been completed by realizing the correspondence between the existing character encoding system including codes and the encoding system based on the base sequence of DNA.
- the present invention provides an oligonucleotide sequence having a predetermined length n (n is an integer of 6 or more).
- n is an integer of 6 or more.
- a predetermined length L consisting of 0 and 1 (L is 6 or more) (Integer) bit string (GC template)
- GC template the distance between each GC template, the mining distance, the hamming distance between the reverse arrangement of each GC template, the distance between these GC templates, and the mining distance
- a set having a subword constraint of length m is selected as a template from the set of selected selected GC templates, and a codeword of a predetermined error correction code also having a subword constraint of length m is selected.
- a method for designing a DNA code characterized by preparing a set S1 of nucleotide sequences (Claim 1), an oligonucleotide sequence having a predetermined length n (n is an integer of 6 or more), and G ([AG]) or T or C ([CT]), expressed as a bit string (AG template) of a predetermined length L (L is an integer of 6 or more) consisting of 0 and 1 ,
- the design method (Claim 2) and the set S1 of oligonucleotide sequences that maintain the Hamming distance k are between the sequences, between the complementary sequences of the other sequences, and between the shifted sequences.
- mismatch greater than or equal to a predetermined value between the sequences, the complementary sequences, and the sequence obtained by linking the sequences and the complementary sequence. Mishybridization between the complementary sequence, the sequence shifted from the sequence, and the sequences, the sequences, and the sequence obtained by linking the sequence and the complementary sequence.
- DNA code design method (claim 3), characterized in that to facilitate the decoding of the information and a predetermined length
- the method of designing a DNA code (Claim 4), wherein the set SI of oligonucleotide sequences of n is a set S1 of oligonucleotide sequences having a length of 32 or less, and the predetermined value k of Hamming distance is L DNA code design method characterized by a value of 1Z4 or more (Claim 5), and a DNA code design method characterized by a value of 1Z2 or more of a subword constraint L of length m (Claim 6) or a method for designing a DNA code, wherein the set S1 of oligonucleotide sequences is a set of oligonucleotide sequences containing a specific partial sequence or containing no specific partial sequence (claim Item 7), Code words of predetermined error correction code Chosen from codes
- a method for designing a DNA code characterized by
- the present invention provides a method for writing arbitrary information in a non-coding region that does not contain DNA genetic information using a computer-readable coding system, from a set of base sequences corresponding to symbol units.
- DNA code (Claim 11), DNA code (Claim 12) that is a collective power of base sequences that can easily detect errors such as skipping or substitution of several bases, and reading frames for base sequences corresponding to symbol units It has an error correction function that can decode (decode) with high reliability even in the presence of errors such as misalignment and substitution of multiple bases! /, Corresponding to DNA codes (claim 13) and symbol units No stable secondary structure is formed between base sequences, and primers are used regardless of how letters are linked.
- a DNA code that does not cause physical inhibition that impedes the width (Claim 14), a DNA code that consists of a set of base sequences corresponding to symbol units that can be easily distinguished from natural DNA (Claim 15),
- DNA sequences (Claim 16), which can easily verify whether or not a specific subsequence appears in the base sequence in the base sequence, and mismatches in at least four positions in any hybridization can be used.
- the present invention provides a method for writing arbitrary information into DNA, wherein the DNA is vector DNA (claim 20), and a method for writing DNA, wherein the DNA is genomic DNA.
- the method of writing arbitrary information on DNA (Claim 21), the method of writing arbitrary information on DNA that can identify the creator of DNA by DNA code (Claim 22), and the method of writing these DNA codes
- a labeling vector embedded in any non-coding region that does not contain DNA genetic information (Claim 23), or these DNA codes are embedded in any non-coding region that does not contain DNA genetic information Labeled dani cells (Claim 24) and DNA tags having these DNA codes (Claim 25) are provided.
- a DNA code having the following characteristics can be designed.
- an oligonucleotide sequence having a predetermined length n is obtained by using G or C ([GC]) or A or T ([AT] ),
- L is an integer of 6 or more
- the Hamming distance between each GC template and each GC template Hamming distance between the reverse sequence of the above, the Hamming distance between the shifted sequence, and the sequence of each GC template, the reverse sequence of each GC template, and the sequence connecting each GC template and its reverse sequence
- a GC template with a difference between the mining distance force V and the deviation that is equal to or greater than a predetermined value k is selected, and from the set of the selected GC templates, a set having a subword constraint of length m is selected as a template.
- n is an integer of 6 or more
- each position is A or G ( [AG]) or T or C ([CT]) means a bit string (AG template) of a predetermined length (L is an integer of 6 or more) consisting of 0 and 1
- a set of oligonucleotide sequences corresponding to a unit signal in information transmission by selecting a set having m subword constraints as a template and combining it with a codeword of a predetermined error correction code also having a subword constraint of length m
- the oligonucleotide sequence includes a DNA sequence and an RNA sequence. Designing the RNA Code of the Project ”.
- the term "encoding” refers to associating a specific base sequence with a character or symbol so that the character or symbol can be handled by a computer
- the term "DNA code” refers to DNA as a medium.
- the DNA code obtained by the design method of the present invention can be used to write arbitrary information to any non-coding region such as an intron, 5, one non-coding region, and 3, one non-coding region which does not contain DNA genetic information. It can be used advantageously.
- the upper limit of the predetermined length n (n is an integer of 6 or more) of the oligonucleotide sequence is not limited, but is usually 100 bases, and preferably 32 bases.
- a subset of the set S1 is also included.
- a DNA code that also has the collective power of a base sequence corresponding to a unit signal such as an alphabet using a set S1 including a mismatch including a complementary sequence centering on a case where an oligonucleotide sequence is a DNA sequence is referred to as a GC template.
- the description mainly focuses on the design using a plate.
- the P array in the set S1 designed using the template has a shift between the own array and the other P arrays in the set S1 with and without a shift (the arrays are shifted). Irrespective of the case, it contains a mismatch equal to or greater than a predetermined value and can be avoided by mishybridization if it can avoid mishybridization.
- mismatch refers to a pairing with a base other than a complementary base in the case of hybridization, and a mismatch having a predetermined value or more is not particularly limited as long as the number of mismatches is such that mismatch hybridization can be avoided.
- a mismatch having a predetermined value or more is not particularly limited as long as the number of mismatches is such that mismatch hybridization can be avoided.
- the oligonucleotide sequence constituting the set S1 can be manipulated as a sequence set that can easily specify the occurrence position of a specific partial sequence.
- specific subsequences include a restriction enzyme recognition site, a poly A portion of RNA, and a translation initiation codon.
- Arbitrary DNA such as an ATG, a stop codon such as TAA, TAG, TGA, etc., an expression signal sequence, a consensus sequence recognized by a transcription factor GCCAATCT, ATGCAAAT, or a nucleotide sequence encoding a variable domain of an antibody Sequence signals can be exemplified.
- the above-mentioned set S1 of oligonucleotide sequences can usually be designed in two stages.
- the first stage is the design stage of the GC template using the no and mining distances, and the second stage is to use the error correction code theory from the set of oligonucleotide sequences represented by the designed GC template.
- This is a stage of designing a target set S1 of the oligonucleotide sequences of the present invention.
- the first step is to determine whether each position in the sequence is [GC] or [AT]. This position is a GC template consisting of 0 and 1; b b---b (b ⁇ ⁇ 0, 1 ⁇ ), where 1 is the [AT
- a GC template of length L would represent 2 ⁇ arrays instead of streets.
- the base sequence is determined by specifically substituting the site of GC template 1 with [AT] and the site of GC template with [GC] (or the reverse combination).
- the above-mentioned no-ming distance is used as a measure of the degree of similarity between arrays.
- a GC template t linking sequence between the GC template t, linking sequence of the reverse sequence t R between the GC template t, the Hamming distance between the linking sequences of GC template t and reverse sequence t R
- MD abbreviation for minimum distance
- the above-described method for designing a GC template is used in the first step for preparing the above-mentioned oligonucleotide sequence collection S1.
- a GC template design method is a method in which an oligonucleotide sequence having a predetermined length n is represented by each position being [GC] or [AT].
- the hamming distance between each GC template When expressed as a bit string (GC template) that has 0 and 1 forces, the hamming distance between each GC template, the hamming distance between the reverse arrangement of each GC template, and the hamming between these shifted sequences Distance and Hamming distance between each GC template, between each reverse sequence of each GC template, and each GC template and its connected reverse sequence MD (t) force MD (t) force
- the length L of the GC template is 6 or more, preferably 6-100, more preferably 6-32, and particularly preferably molecular biology. A well 20 before and after use in the experiment, 5 following cases are such obtained having the desired Hamming distance.
- the predetermined value k is not particularly limited as long as it is a value that becomes the oligonucleotide sequence of the present invention that can avoid misalignment of oligonucleotides produced from a strong GC template and misidization.
- the length L of the GC template is 1 Z5 or more, more preferably 1 Z4 or more, particularly preferably 1 Z3 or more.
- the force at which more GC templates exist is the largest k value (MD value) for a given length.
- GC templates with a) are particularly important.
- the shortest GC template that satisfies a specific MD value (k value) is shown in [Table 2].
- the number excluding the GC template is indicated as "item”.
- the GC template sequences listed in the above [Table 1]-[Table 4] etc. can be used by those skilled in the art by exhaustively searching all patterns from all 0 sequences to all 1 sequences. Can be selected. However, it is not necessary to search all 2 L patterns to find a GC template of length L. Since the GC template with bit 01 inverted has the same property, bit 1 included in the GC template has L Consider what is / 2 or less. In addition, the constraint on the number of mismatches indicates that when the minimum distance is d, there are at least (L sqrt (L 2 — 2dL)) Z2 bits 1 (sqrt is a square root). The GC template can be efficiently obtained by using such a constraint in an additional manner.
- the set S1 of oligonucleotide sequences generated from the GC template contains specific partial sequences such as the above-described restriction enzyme recognition site or oligonucleotides not containing the specific partial sequences. Designing to be a set of arrays can be designed more easily because it corresponds to narrowing the space for exhaustive search.
- the set of oligonucleotide sequences S1 is a set of oligonucleotide sequences represented by the designed GC template following the design step of the GC template using the Hamming distance. It can be designed by using the theory, that is, by combining it with a codeword of an error correction code.
- any codeword of a known error correction code can be used, such as a code, a Mining code, a BCH code, a maximum length sequence code, a Golay code, a ReedMuller code, and a ReedSolomon code. , Hadamard code, Preparata code, reversible code, constant weight code, non-linear code, and the like.
- An error correction code is a set of codewords in which the number of mismatches between arbitrary codewords is equal to or greater than a certain value.However, if the set of set S1 and its reverse array prevents mishybridization, It is only necessary to apply a set of codewords such that the number of matches (rather than mismatches) between any codewords is greater than a certain number.
- the information of the code word is reflected in the sequence together with the information of the GC template. Therefore, to guarantee k mismatches with the complementary sequence
- the code that keeps at least the number of matches k is used. Good.
- a code In the theory of error correcting codes, a code is used in which redundant bits for error detection and correction called “check bits” are added to given information bits to make the nominating distance between arbitrary code words equal to or more than a certain value. Is being developed. The minimum value of the Hamming distance between codewords is called the minimum distance. Since the goal of coding theory is to design a code with a large number of code words while keeping the minimum distance large, there are many codes that meet the purpose of the present invention. For example, a Golay code with a code length of 23 and a minimum distance of 7 has 4096 words. Using this code, 4096 oligonucleotides can be designed for one GC template of length 23 (MD value up to 9).
- a subword constraint of length m must be taken into consideration when selecting a template to be used in the above set S1. .
- select a powerful set make sure that no more than m bit strings of 01 are continuous between templates that generate the set S1, and that the distance between codewords from the error-correcting codeword is the maximum clique.
- the design is such that no more than m consecutive bit strings match between codewords.
- the m value in such a subword constraint of length m is preferably a value of 10 or less from the viewpoint that mismatch can be sufficiently dispersed. For example, when L is 12, the m value can be 7.
- arbitrary information is written into an arbitrary non-coding region that does not include genetic information of DNA by using a computer-readable code system such as a binary code.
- a computer-readable code system such as a binary code.
- a DNA code that is the collective power of base sequences that are coded so that the melting temperature calculated by the pair method is within a predetermined range, and coded so that errors such as skipping or substitution of several bases can be detected easily Decoding with high reliability even in the presence of errors such as misalignment of the reading frame of the encoded base sequence or substitution of multiple bases, such as a DNA code consisting of a set of base sequences.
- a DNA code with an error-correcting function that does not form a stable secondary structure between encoded base sequences, and physical inhibition that prevents amplification by primers regardless of how the codewords are linked.
- DNA code that does not generate the collective power of the encoded base sequence corresponding to the character that can be easily distinguished from natural DNA
- the DNA code, base arrangement is limited, and the appearance of specific partial sequences can be easily verified
- a DNA code that can be used preferably can be obtained by the DNA code designing method of the present invention. And as a specific example, even if the codewords are linked so as to include their complementary sequences, there are mismatches between codewords in at least four positions, and there are only six consecutive base matches, so mishybridization has occurred. And a DNA code consisting of 112 codewords of length 12, which keeps the same melting temperature in the nearest base pair approximation.
- the above-mentioned DNA code of the present invention comprising a set of base sequences corresponding to characters such as alphabets is replaced with an intron containing no genetic information of DNA.
- the method of embedding in the present invention is not particularly limited as long as it is a method for embedding in any non-coding region such as 5'-non-coding region or 3'-non-coding region.
- Examples of the DNA in which the NA code is embedded include vector DNA such as plasmid vector DNA and virus vector DNA, and genomic DNA of animal and plant cells or microbial cells.
- vector DNA such as plasmid vector DNA and virus vector DNA
- genomic DNA of animal and plant cells or microbial cells.
- DNA signature can be performed.
- the present invention also relates to a labeled vector or a labeled cell, which does not contain the genetic information of DNA encoding DNA of the present invention and which is embedded in any non-coding region and which can identify the creator.
- the sequences are unlikely to cause miso and hybridization, so that the present invention
- the set of base sequences encoded above can be advantageously used for a DNA or RNA chip or as a DNA or RNA tag.
- the set of encoded base sequences of the present invention is also useful as a primer in PCR and the like.
- the set of encoded base sequences of the present invention is:
- Various symbol processing operations such as logical formulas and graph structures, can be easily proved to have no specific sequence portion such as a restriction enzyme recognition site, since it is difficult to cause mishybridization with each other.
- the sequence obtained at the end of the experiment can be used as a “calculation result” for DNA computation in a DNA computation system. It can be used to advantage.
- Non-linear (12, 144, 4) codes are short error correcting codes having a length of at least 128 codes (Sloane, NJA and Mac Williams, FJ: The Theory of Error-Correcting Codes. Elsevier, 1977J)
- the notation (12, 144, 4) above means a code of length 12 with 144 codewords with a minimum distance of 4 (one error correction, two error detections).
- the clique problem solver http: ⁇ rtm.science.unitn.it / intertools /
- the codes represented by (12, 144, 4) are shown in Table 7, and among the 144 powerful codewords with a dagger! 56 codewords that satisfy 7 subword constraints
- GC templates having a length of 12 and a minimum distance of 4, and among them, Table 31 shows 31 templates in which the reverse sequence and the one inverted by 01 are regarded as the same.
- a template pair is chosen because 128 codewords cannot be obtained from one template due to subword constraints.
- Such two pairs of templates no matter how the templates are linked, contain four or more mismatches and do not share a partial sequence of seven or more lengths.
- Table 9 shows such eight pairs of template pairs.
- the DNA codewords generated from this template pair have an even distribution of GC bases when concatenated. Under this condition, DNA code from these templates has a close melting temperature (New Generation Computing 20, 3, 263-277, 2002) 0
- the number of codewords that can be designed in this way is 112, which does not satisfy 128 ASCII characters. However, some characters are not used in ASCII characters. For example, the values &# 14 to &# 31 are not used in HTML characters. Thus, the powerful 112 code words are sufficient to represent the ASCII characters of DNA. This compromise is better than relaxing the constraints to get 128 codes.
- the present state of the information description method using DNA was examined, and the necessity and problems in configuring a DNA code were described.
- the DNA code designing method of the present invention can provide 112 DNA code words having a length of 12 and a comma-free index of 4.
- the DNA code of the present invention allows for any linkage between the codes, including the complementary strand, and no powerful DNA code has been known to date.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Nanotechnology (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Crystallography & Structural Chemistry (AREA)
- Genetics & Genomics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Micro-Organisms Or Cultivation Processes Thereof (AREA)
- Error Detection And Correction (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
Claims
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/558,502 US20070042372A1 (en) | 2003-05-29 | 2004-05-27 | Method for designing dna codes used as information carrier |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2003-151738 | 2003-05-29 | ||
JP2003151738A JP2004355294A (ja) | 2003-05-29 | 2003-05-29 | 情報担体としてのdna符号の設計方法 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2004107243A1 true WO2004107243A1 (ja) | 2004-12-09 |
Family
ID=33487236
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2004/007271 WO2004107243A1 (ja) | 2003-05-29 | 2004-05-27 | 情報担体としてのdna符号の設計方法 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20070042372A1 (ja) |
JP (1) | JP2004355294A (ja) |
CN (1) | CN1791875A (ja) |
WO (1) | WO2004107243A1 (ja) |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7882464B1 (en) * | 2005-02-14 | 2011-02-01 | Cadence Design Systems, Inc. | Method and system for power distribution analysis |
JP4853898B2 (ja) * | 2005-08-30 | 2012-01-11 | 独立行政法人産業技術総合研究所 | Dna標準物質 |
CA2692575A1 (en) * | 2006-06-30 | 2008-01-10 | Jpl Llc | Embedded data dna sequence security system |
US8407554B2 (en) * | 2009-02-03 | 2013-03-26 | Complete Genomics, Inc. | Method and apparatus for quantification of DNA sequencing quality and construction of a characterizable model system using Reed-Solomon codes |
US8053744B2 (en) | 2009-04-13 | 2011-11-08 | Src, Inc. | Location analysis using nucleic acid-labeled tags |
US20110269119A1 (en) * | 2009-10-30 | 2011-11-03 | Synthetic Genomics, Inc. | Encoding text into nucleic acid sequences |
JP2011186632A (ja) * | 2010-03-05 | 2011-09-22 | Nec Software Kyushu Ltd | 塩基配列集合算出装置、塩基配列集合算出方法およびコンピュータプログラム |
US8703493B2 (en) | 2010-06-15 | 2014-04-22 | Src, Inc. | Location analysis using fire retardant-protected nucleic acid-labeled tags |
US8716027B2 (en) | 2010-08-03 | 2014-05-06 | Src, Inc. | Nucleic acid-labeled tags associated with odorant |
EP2603607B1 (en) | 2010-08-11 | 2016-04-06 | Celula, Inc. | Genotyping dna |
WO2012031031A2 (en) | 2010-08-31 | 2012-03-08 | Lawrence Ganeshalingam | Method and systems for processing polymeric sequence data and related information |
WO2012122547A2 (en) | 2011-03-09 | 2012-09-13 | Lawrence Ganeshalingam | Biological data networks and methods therefor |
AU2013277986B2 (en) | 2012-06-22 | 2016-12-01 | Annai Systems Inc. | System and method for secure, high-speed transfer of very large files |
CN104182236B (zh) * | 2014-08-28 | 2017-12-12 | 北京航空航天大学 | 一种基于遗传密码的软件通路编解码方法 |
WO2017101112A1 (zh) * | 2015-12-18 | 2017-06-22 | 云舟生物科技(广州)有限公司 | 载体设计方法及载体设计装置 |
WO2017190297A1 (zh) * | 2016-05-04 | 2017-11-09 | 深圳华大基因研究院 | 利用dna存储文本信息的方法、其解码方法及应用 |
US9929813B1 (en) * | 2017-03-06 | 2018-03-27 | Tyco Electronics Subsea Communications Llc | Optical communication system and method using a nonlinear reversible code for probablistic constellation shaping |
RU2659025C1 (ru) * | 2017-06-14 | 2018-06-26 | Общество с ограниченной ответственностью "ЛЭНДИГРАД" | Способы кодирования и декодирования информации |
WO2020243074A1 (en) * | 2019-05-31 | 2020-12-03 | Illumina, Inc. | Obtaining information from a biological sample in a flow cell |
RU2756641C2 (ru) * | 2019-10-29 | 2021-10-04 | Хиллол Дас | Способ сохранения информации с использованием ДНК и устройство хранения информации |
CN113539370B (zh) * | 2021-06-29 | 2024-02-20 | 中国科学院深圳先进技术研究院 | 编码方法、解码方法、装置、终端设备及可读存储介质 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10507357A (ja) * | 1994-10-13 | 1998-07-21 | リンクス セラピューティクス, インコーポレイテッド | 分子タグ化システム |
WO2003038091A1 (fr) * | 2001-10-29 | 2003-05-08 | Japan Science And Technology Agency | Sequences oligonucleotidiques exemptes d'erreurs d'hybridation et procedes de conception correspondants |
-
2003
- 2003-05-29 JP JP2003151738A patent/JP2004355294A/ja active Pending
-
2004
- 2004-05-27 US US10/558,502 patent/US20070042372A1/en not_active Abandoned
- 2004-05-27 CN CNA200480013917XA patent/CN1791875A/zh active Pending
- 2004-05-27 WO PCT/JP2004/007271 patent/WO2004107243A1/ja active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10507357A (ja) * | 1994-10-13 | 1998-07-21 | リンクス セラピューティクス, インコーポレイテッド | 分子タグ化システム |
WO2003038091A1 (fr) * | 2001-10-29 | 2003-05-08 | Japan Science And Technology Agency | Sequences oligonucleotidiques exemptes d'erreurs d'hybridation et procedes de conception correspondants |
Non-Patent Citations (3)
Title |
---|
ARITA M. ET AL: "DNA Sequence Design Using Templates", NEW GENERATION COMPUTING, vol. 20, no. 3, 2002, pages 263 - 277, XP002980865 * |
FAULHAMMER D. ET AL: "Molecular computation: RNA solutions to chess problems", PNAS USA, vol. 97, no. 4, 15 February 2000 (2000-02-15), pages 1385 - 1389, XP002979510 * |
FRUTOS A. ET AL: "Demonstration of a word design strategy for DNA computing on surfaces", NUCLEIC ACIDS RESEARCH, vol. 25, no. 23, 1 December 1997 (1997-12-01), pages 4748 - 4757, XP002980866 * |
Also Published As
Publication number | Publication date |
---|---|
US20070042372A1 (en) | 2007-02-22 |
JP2004355294A (ja) | 2004-12-16 |
CN1791875A (zh) | 2006-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2004107243A1 (ja) | 情報担体としてのdna符号の設計方法 | |
Anavy et al. | Data storage in DNA with fewer synthesis cycles using composite DNA letters | |
Bornholt et al. | A DNA-based archival storage system | |
De Silva et al. | New trends of digital data storage in DNA | |
US9830553B2 (en) | Code generation method, code generating apparatus and computer readable storage medium | |
Buschmann et al. | Levenshtein error-correcting barcodes for multiplexed DNA sequencing | |
TWI673604B (zh) | 信息編碼和信息解碼的方法 | |
Organick et al. | Scaling up DNA data storage and random access retrieval | |
US20180211001A1 (en) | Trace reconstruction from noisy polynucleotide sequencer reads | |
US20200035331A1 (en) | Re-writable DNA-Based Digital Storage with Random Access | |
US20210074380A1 (en) | Reverse concatenation of error-correcting codes in dna data storage | |
Cao et al. | Minimum free energy coding for DNA storage | |
US20200387769A1 (en) | Efficient assembly of oligonucleotides for nucleic acid based data storage | |
CN110569974B (zh) | 可包含人造碱基的dna存储分层表示与交织编码方法 | |
Löchel et al. | Fractal construction of constrained code words for DNA storage systems | |
Yachie et al. | Stabilizing synthetic data in the DNA of living organisms | |
Nassirpour et al. | Embedded codes for reassembling non-overlapping random DNA fragments | |
US20050089860A1 (en) | Oligonucleotide sequences free from mishybridization and method of designing the same | |
Milenkovic et al. | DNA-Based Data Storage Systems: A Review of Implementations and Code Constructions | |
D'yachkov et al. | New results on DNA codes | |
Garzon et al. | Digital information encoding on DNA | |
Fan et al. | Constrained channel capacity for dna-based data storage systems | |
Haughton et al. | Performance of DNA data embedding algorithms under substitution mutations | |
Jiang et al. | DNA Storage Designer: A practical and holistic design platform for storing digital information in DNA sequence | |
Mahjabin et al. | A Survey on DNA-Based Cryptography and Steganography |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2004813917X Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 5538/DELNP/2005 Country of ref document: IN |
|
122 | Ep: pct application non-entry in european phase | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2007042372 Country of ref document: US Ref document number: 10558502 Country of ref document: US |
|
WWP | Wipo information: published in national office |
Ref document number: 10558502 Country of ref document: US |