WO2021243605A1 - 生成dna存储编解码规则的方法和装置及dna存储编解码方法和装置 - Google Patents

生成dna存储编解码规则的方法和装置及dna存储编解码方法和装置 Download PDF

Info

Publication number
WO2021243605A1
WO2021243605A1 PCT/CN2020/094192 CN2020094192W WO2021243605A1 WO 2021243605 A1 WO2021243605 A1 WO 2021243605A1 CN 2020094192 W CN2020094192 W CN 2020094192W WO 2021243605 A1 WO2021243605 A1 WO 2021243605A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
sequence
degree
binary
nucleic acid
Prior art date
Application number
PCT/CN2020/094192
Other languages
English (en)
French (fr)
Inventor
张颢龄
平质
陈世宏
沈玥
Original Assignee
深圳华大生命科学研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大生命科学研究院 filed Critical 深圳华大生命科学研究院
Priority to PCT/CN2020/094192 priority Critical patent/WO2021243605A1/zh
Priority to CN202080101762.4A priority patent/CN115699189A/zh
Publication of WO2021243605A1 publication Critical patent/WO2021243605A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • the present invention relates to the field of information storage technology, in particular to a method and device for generating DNA storage coding and decoding rules and a DNA storage coding and decoding method and device.
  • DNA as a medium for information storage has a long storage time (can be more than several thousand years, which is more than a hundred times that of existing tape and optical disc media) and high storage density (up to about 10 9 Gb/mm 3 , It is more than ten million times of the existing tape and optical disk media) and the advantages of good storage security.
  • DNA storage usually includes the following steps: 1) Encoding: Convert the binary 0/1 code of computer information into A/T/C/G DNA sequence information; 2) Synthesis: Use DNA synthesis technology to synthesize the corresponding 3) Sequencing: Use sequencing technology to read the DNA sequence of stored DNA molecules; 4) Decoding: Use step 1 to correspond to the encoding process Locally, the DNA sequence obtained by sequencing is converted into a binary 0/1 code, which is further converted into computer information. In order to achieve effective DNA data storage, it is necessary to develop technology for the above steps. Among them, the encoding and decoding technologies involved in steps 1 and 4 are the most critical technologies for DNA data storage.
  • the most critical part of this technology is: 1) How to maximize the density of the 0/1 binary information of the DNA encoding.
  • the increase in DNA storage density is essential to save the cost of DNA synthesis for storing information in step 2.
  • continuous single-base repeats, high GC or high AT in the DNA sequence will make it difficult to read sequence information during the sequencing process.
  • the way the binary 0/1 data is converted to the DNA sequence of A/T/C/G directly determines the difficulty of reading the DNA sequence during the sequencing process, and thus determines the fidelity of the data during the reading process.
  • the purpose of the present invention is to provide a method and device for generating DNA storage coding and decoding rules and a DNA storage coding and decoding method and device, which can solve the problem of extreme GC or special motifs that cannot be completely avoided by the existing fixed rules.
  • the present invention provides a method for generating DNA storage codec rules, including:
  • the complete set sequence is obtained, where the above-mentioned complete set sequence is a set of all base sequences formed by random combinations of all base possibilities at each base position within the length of the sliding window, using restriction conditions, Filter out the set of qualified sequences in the above-mentioned complete set sequence that meets the above-mentioned restriction conditions, wherein the above-mentioned restriction condition is set based on the sequence characteristics in the above-mentioned complete set sequence;
  • the above restriction conditions include at least one of GC base content, single base repetition, simple sequence repetition, palindrome sequence repetition, complementary palindrome sequence repetition, and elimination of special sequences.
  • the above restriction conditions include at least one of the following:
  • GC base content is 40%-60%
  • single base repetition is not more than 3 consecutive identical bases
  • simple sequence repetition is not less than 4 bases
  • palindrome sequence repetition is not less than 4 bases
  • complementary palindrome sequence repetition No less than 4 bases
  • the elimination of special sequences means the elimination of sequences containing AGA, GAG, CTC, and TCT.
  • the above set limit of the number of out-degrees is the number of out-degrees required for coding efficiency.
  • the above coding efficiency is e, and when e ⁇ (0, 2], the first of each node
  • the limit of the number of degrees out of the layer is
  • the limit of the number of degrees out of each node is 2.
  • the deletion of the redundant out-degrees of each node in the directed graph includes: if the total out-degrees of the above-mentioned nodes exceed the set limit of out-degrees, output the bases of the above-mentioned nodes in reverse order , According to the base order output in reverse order, delete the out-degrees pointing to the corresponding bases in turn.
  • step (4) the above method further includes:
  • the above method further includes:
  • step (4') After performing step (4'), return to step (4) again, and repeat steps (4)-(4') until the number of degrees out of all nodes in the above-mentioned directed graph is greater than the set number of degrees out limit Value, and there is no node with 0 in-degree in the above-mentioned directed graph.
  • the above method further includes between step (4) and step (4'):
  • the above method further includes:
  • step (4') After performing step (4'), return to step (3) again, and repeat steps (3)-(4)-(4”)(4') until the number of degrees out of all nodes in the above-mentioned directed graph is greater than The set limit for the number of out-degrees, and there is no node with 0 in-degrees in the above-mentioned directed graph.
  • the present invention provides an apparatus for generating DNA storage codec rules, including:
  • the sliding window setting unit is used to set the sliding window (n, k) of the DNA storage codec rules, where n represents the length of the sliding window, k represents the length of the base character for each sliding, where n, k are positive integers, n ⁇ k;
  • the qualified sequence screening unit is used to obtain the complete set sequence based on the length n of the sliding window, where the above-mentioned complete set sequence is a set of all base sequences formed by random combination of all base possibilities at each base position within the length of the sliding window, Use restriction conditions to filter out the set of qualified sequences in the above-mentioned complete set sequence that meets the above-mentioned restriction conditions, wherein the above-mentioned restriction condition is set based on the sequence characteristics in the above-mentioned complete set sequence;
  • the directed graph connection unit is used to connect the sequences in the above qualified sequence set through a directed graph, and each node in the above directed graph represents each sequence;
  • the out-degree discrepancy deletion unit is used to delete the nodes whose out-degree number is less than the set out-degree number limit in the above-mentioned directed graph;
  • the redundant out-degree deleting unit is used to delete the redundant out-degree of each node in the above-mentioned directed graph, and the above-mentioned redundant out-degree is the out-degree that exceeds the set limit of the number of out-degrees;
  • the algorithm chart obtaining unit is used to obtain an algorithm chart, and the algorithm chart includes DNA storage coding and decoding rules.
  • the present invention provides a DNA storage coding method, including:
  • the aforementioned method slices the aforementioned binary sequence to be encoded according to a length of 2k-1, where k represents the length of the base character of each sliding of the sliding window.
  • the above-mentioned method further comprises: synthesizing the above-mentioned DNA sequence, and then storing it in an isolated medium or living cell.
  • the present invention provides a DNA storage coding device, including:
  • the codec rule acquisition unit is used to obtain the DNA storage codec rule generated by the method of the first aspect, and set the initial node, and set the initial node as the current node;
  • the binary sequence slicing and conversion unit is used to obtain the binary sequence to be encoded and slice it to generate a binary slice, and convert the binary value corresponding to the slice into an out-degree node or a multi-layer out-degree node connected to the current node, each of which is
  • the out-degree node describes a nucleic acid fragment, and the above-mentioned binary slice and the corresponding nucleic acid fragment form a pair of mapping relationships;
  • the nucleic acid fragment output unit is used to input the binary slice, output the nucleic acid fragment mapped by the out-degree node or the multi-layer out-degree node, and update the out-degree node to the current node according to the above-mentioned binary
  • the slicing sequence continuously loops input binary slices and output nucleic acid fragments until all the above binary slices are input;
  • the nucleic acid fragment connecting unit is used to connect the nucleic acid fragments in sequence according to the output order and output the complete DNA sequence.
  • the aforementioned binary sequence to be encoded is sliced according to the length of 2k-1, where k represents the length of the base character of each sliding of the sliding window.
  • the present invention provides a DNA storage decoding method, including:
  • the out-degree node or multi-layer out-degree node connected to the above-mentioned current node is found, where Each out-of-degree node describes a piece of nucleic acid information, and the above-mentioned nucleic acid slice and the corresponding binary value or binary slice form a pair of mapping relationships;
  • the aforementioned method slices the aforementioned DNA sequence to be decoded according to the length of k, where k represents the length of the base character of each sliding of the sliding window.
  • the above-mentioned DNA sequence to be decoded is generated by encoding by the method of the third aspect or the apparatus of the fourth aspect.
  • the present invention provides a DNA storage and decoding device, including:
  • the codec rule acquisition unit is used to obtain the DNA storage codec rule generated by the method of the first aspect, and set the initial node, and set the initial node as the current node;
  • the DNA slicing and conversion unit is used to obtain the DNA sequence to be decoded and slice it to generate a nucleic acid slice.
  • the above-mentioned DNA storage coding and decoding rules and the corresponding nucleic acid information of the slice find the above-mentioned current node and the out-of-degree node connected to the above-mentioned current node Or multi-level out-of-degree nodes, where each out-of-degree node describes a piece of nucleic acid information, and the above-mentioned nucleic acid slice and the corresponding binary value or binary slice form a pair of mapping relationships;
  • the binary value output unit is used to obtain the binary value or binary slice between the nodes according to the above-mentioned current node and the above-mentioned out-degree node or multi-level out-of-degree nodes according to the above-mentioned mapping relationship, and update the above-mentioned out-degree node as the current node, according to
  • the aforementioned nucleic acid slice sequence continuously loops the input nucleic acid slice and the output binary value or binary slice until all the aforementioned nucleic acid slices are input;
  • the binary value connecting unit is used to connect the above binary values in sequence according to the output sequence and output a complete binary sequence.
  • the above-mentioned DNA sequence to be decoded is sliced according to the length of k, where k represents the base character length of each sliding of the sliding window.
  • the present invention provides a computer-readable storage medium including a program that can be executed by a processor to implement the method of the first aspect or the method of the third aspect or the method of the fifth aspect.
  • All current coding and decoding rules can be generated by the method of generating DNA storage coding and decoding rules of the present invention, so there is no need to set corresponding coding and decoding rules for each restriction condition and coding efficiency, which saves costs.
  • the analysis method based on graph theory can make further theoretical analysis of the generated implicit codec rules, such as the stability analysis of the algorithm.
  • the codec rules generated by the present invention have higher efficiency, because the implicit codec rules generated by the present invention are an end-to-end direct mapping relationship between binary and base, encoding and decoding
  • the time complexity of is O(n).
  • the method of the present invention is suitable for sequencing and decoding under any conditions, and can be particularly used for third-generation sequencing and decoding, while other existing algorithms do not involve third-generation sequencing and decoding.
  • Figure 1 is a schematic diagram of the encoding and decoding process of DNA storage in an embodiment of the present invention
  • Figure 2 is a flowchart of a method for generating DNA storage encoding and decoding rules in an embodiment of the present invention
  • Fig. 3 is a schematic diagram of a method for generating a DNA storage codec rule in an embodiment of the present invention
  • FIG. 4 is a block diagram of the structure of an apparatus for generating DNA storage codec rules in an embodiment of the present invention
  • Figure 5 is a flowchart of a DNA storage coding method in an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of the principle of encoding rules displayed in the form of adjacency matrix or graph in the DNA storage encoding and decoding method in the embodiment of the present invention
  • FIG. 7 is a schematic diagram of the coding and decoding steps of the DNA storage coding and decoding method in the embodiment of the present invention.
  • FIG. 8 is a structural block diagram of a DNA storage and encoding device in an embodiment of the present invention.
  • Figure 9 is a flowchart of a DNA storage and decoding method in an embodiment of the present invention.
  • FIG. 10 is a structural block diagram of a DNA storage and decoding device in an embodiment of the present invention.
  • FIG. 11 is a schematic diagram of a part of information of a configuration file of encoding and decoding rules in an embodiment of the present invention.
  • Coding method refers to a mapping relationship between binary and base. Generally speaking, the traditional fixed-rule encoding method will perform multiple steps of optimization processing, and finally obtain the final mapping relationship.
  • the encoding method is implemented by encoding and decoding rules.
  • the coding and decoding rules of the present invention are generated by the method for generating DNA storage coding and decoding rules of the present invention.
  • the generator also called "method for generating DNA storage codec rules" in the present invention, according to different combinations, through graph theory, to obtain a certain potential binary and base mapping relationship, that is, to obtain the original Invented codec rules.
  • Algorithm stability means that the algorithm can stably meet the restriction conditions for any input electronic file and the output DNA sequence. Usually, in “arbitrary” situations, flood-like input is used to observe the stability of the algorithm under extreme conditions.
  • Time complexity The time complexity of an algorithm is a function that qualitatively describes the running time of the algorithm. This is a function of the length of the string representing the input value of the algorithm. Time complexity is often expressed in big O notation, excluding the low-order terms and first coefficients of this function. When using this method, the time complexity can be said to be asymptotic, that is, when the input value approaches infinity.
  • the present invention proposes an optimal codec generator based on the restriction conditions, that is, the method for generating DNA storage codec rules of the present invention.
  • the generator (or method) can solve the problem that the existing fixed rules cannot completely avoid extreme GC or special motifs.
  • the special motif here refers to a sequence that is difficult to analyze using fixed rules.
  • the coding method generated by the generator does not require a screen process, so there is no hidden danger of not being able to accept all inputs.
  • the encoding and decoding time complexity of the encoding method generated by the generator is O(n). Compared with most encoding and decoding methods that require many optimization processes, the encoding and decoding method of the present invention is much faster, and is aimed at large-scale DNA in the future. Storage and transcoding will be more efficient.
  • a method for generating DNA storage codec rules that is, a DNA storage codec generator, is created based on graph theory and combinatorics, and its steps include :
  • S210 Set a sliding window (n, k) for DNA storage coding and decoding rules.
  • S220 Obtain the complete set sequence based on the length n of the sliding window, where the complete set sequence is the set of all base sequences formed by random combination of all the base possibilities of each base position within the length of the sliding window, and the restriction conditions are used to filter out The set of qualified sequences in the complete set sequence that meets the restriction conditions, where the restriction conditions are set based on the sequence features in the complete set sequence.
  • the restriction conditions may include at least one of GC base content, single base repetition, simple sequence repetition, palindrome sequence repetition, complementary palindrome sequence repetition, and elimination of special sequences.
  • the restriction conditions include at least one of the following: GC base content is 40%-60%, single base repetition does not exceed 3 consecutive identical bases, simple sequence repetition is not less than 4 bases, palindrome The sequence repetition is not less than 4 bases, and the complementary palindrome sequence repetition is not less than 4 bases.
  • the elimination of special sequences means the elimination of sequences containing AGA, GAG, CTC, and TCT. It should be noted that the "repetition" in simple sequence repetition, palindrome sequence repetition, and complementary palindrome sequence repetition refers to the "repetitive base length". For example: the base sequence ACGTACGTACGT, which is a repeat of "ACGT”, has a repeat of 4.
  • the base sequence ACGTAAACGTAAACGTAA which is a repeat of "ACGTAA” has a repeat of 6. Since the A base and the G base, the C base and the T base have similar chemical structures, when using a third-generation sequencer, such as nanopore, for sequencing, adjacent bases with similar chemical structures are likely to be caused during the sequencing process. The base recognition is confused, which in turn leads to sequencing errors. Therefore, it is necessary to avoid such sequences as much as possible.
  • the specific operation method for screening out the set of qualified sequences is as follows: (1) Since the sequence is composed of bases ACGT, the method of the present invention generates 4 n sequences in advance (ie full set sequence); (2) A sequence is tested by restriction conditions, and if the sequence meets the restriction conditions, the sequence is saved to the qualified sequence set.
  • a directed graph is a graph composed of a number of given nodes and lines connecting two nodes.
  • a directed graph means that the line between two nodes is directional.
  • each sequence is compared to a node in a directed graph. Assuming that the length of the sequence represented by the node is n, if the sequence corresponding to a node A is from the 2nd to the nth character string and the sequence corresponding to another node B is from the 1st to the (n-1) If the character string formed by the two characters is exactly the same, the connection relationship is from A to B.
  • the length of the sequence represented by the node is 9, the node 1 is the sequence ATAGTGGTC, the node 2 is the sequence TAGTGGTCA, the sequence consisting of the second to the ninth bases of the node 1 sequence is "TAGTGGTC”, and the node 2 sequence starts from the first
  • the sequence composed of 1 to 8 bases is "TAGTGGTC", the two are exactly the same, and the connection relationship is from node 1 to node 2.
  • the set limit of the number of out-degrees is the number of out-degrees required by the coding efficiency. As shown in Figure 3, based on the coding efficiency, the number of out-degrees of all nodes is checked. If the number of out-degrees of a node is less than the number of out-degrees required for coding efficiency, the node is deleted. Until all nodes meet the conditions, the loop is terminated.
  • the number of out-degrees of a node refers to the number of edges from a given node to other nodes in a directed graph.
  • the coding efficiency is e, and when e ⁇ (0, 2), the first The limit of the number of degrees out of the layer is Among them, k represents the length of the base character of each sliding of the sliding window. Since the base sequence within the length of a sliding window constitutes a node, the length of the base character of each sliding k is the kth level of the node.
  • the limit of the number of degrees out of each node is 2.
  • the excess out-degree is the out-degree that exceeds the set limit of the number of out-degrees. For example, in one embodiment, for a certain node, if the set limit of the number of out degrees is 2, but the node contains 4 out of degrees, then the outgoing ones that exceed the set limit of the number of out degrees Degrees are redundant out degrees, that is, there are 2 out degrees that need to be deleted. The purpose of deleting redundant out-degrees is to maintain the stability of the algorithm.
  • the redundant out degree is the node's The total number of emergence exceeds The out-degree of, where e is the coding efficiency, and e ⁇ (0,2].
  • the redundant out-degree of each node is deleted.
  • the specific deletion method is as follows: if the total number of out degrees of a node exceeds the set limit on the number of out degrees, the bases of the node are output in reverse order, and the out degrees pointing to the corresponding bases are sequentially deleted according to the order of the bases output in the reverse order.
  • the "out-degree pointing to the corresponding base” means that in the out-degree formed by the previous node pointing to the next node, the last base of the base sequence of the next node is the same as that of the previous node If the bases output in the reverse order are the same, the out-degree is the "out-degree pointing to the corresponding base".
  • the last node (L) sequence is AACACGACT
  • the next node sequence connected by the node is: node (P1) sequence is ACACGACTA, node (P2) sequence is ACACGACTC, node (P3) sequence is ACACGACTG, node (P4) ) Sequence is ACACGACTT, node (L) is connected with nodes (P1), (P2), (P3), (P4) respectively to form 4 out degrees.
  • the number of out-degrees of is 2, that is, 2 extra out-degrees need to be deleted, and the bases of the output node (L) in reverse order: T, C, A, G, C, A, C, A, A, according to the output order, Delete the out-degrees pointing to the T base and the C base in turn, that is, delete the out-degree formed by the node (L) and the last base sequence of T (P4), and the node (L) and the last base The out-degree formed by the node (P2) of sequence C.
  • step S240 the method further includes:
  • Step S240' Delete the nodes whose in-degree number is 0 in the above-mentioned directed graph to reduce the scope of the directed graph.
  • step S240' further includes: after step S240' is executed, returning to step S240 again, and steps S240-S240' are executed cyclically, until the number of out-degrees of all nodes in the directed graph is greater than the set out-degree number The number limit, and there is no node with 0 in-degree in the above-mentioned directed graph.
  • step S240 between step S240 and step S240', the method further includes:
  • Step S240" Delete the redundant out-degree of each node in the directed graph, where the redundant out-degree is the out-degree that exceeds the set limit of the number of out-degrees.
  • the redundant out-degree is defined as above, and it will not be exhausted here. Narrated.
  • it further includes: after performing step (S240'), returning to step (S230) again, and repeating steps (S230)-(S240)-(S240")(S240') until the directed
  • the number of out-degrees of all nodes in the graph is greater than the set limit of the number of out-degrees, and there is no node with the in-degree number of 0 in the directed graph. It should be noted that after the end of each cycle, due to Therefore, when a new cycle is started, all nodes will be reconnected according to the above-mentioned connection principle to form a new directed graph, and then deleted according to the above-mentioned deletion principle. Remove redundant out-degrees before deleting nodes with 0 in-degrees. More nodes with 0-in-degrees can be exposed in the current cycle, so as to reduce the total number of cycles and shorten the program running time.
  • S260 Obtain an algorithm chart, and the algorithm chart includes DNA storage coding and decoding rules.
  • the present invention also provides a device for generating DNA storage codec rules, as shown in FIG. 4, including: a sliding window setting unit 410 for setting DNA storage codec Regular sliding window (n, k), where n represents the length of the sliding window, k represents the length of the base character for each sliding, where n, k are positive integers, n ⁇ k; the qualified sequence screening unit 420 is used for The length of the sliding window is n to obtain the complete set sequence, where the complete set sequence is the set of all base sequences formed by the random combination of all the base possibilities of each base position within the length of the sliding window, and the restriction conditions are used to screen out the complete set sequence The set of qualified sequences that meet the restriction conditions, where the restriction conditions are set based on the sequence features in the complete set of sequences; the directed graph connecting unit 430 is used to connect the sequences in the set of qualified sequences through a directed graph, and the directed graph Each node in represents each sequence; the out-
  • the program can be stored in a computer-readable storage medium.
  • the storage medium can include: read-only memory, random access memory, magnetic disk, optical disk, hard disk, etc.
  • the computer executes the program to realize the above-mentioned functions.
  • the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the above-mentioned functions can be realized.
  • the program can also be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a mobile hard disk, and saved by downloading or copying.
  • a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a mobile hard disk, and saved by downloading or copying.
  • an embodiment of the present invention provides a computer-readable storage medium, including a program, which can be executed by a processor to implement the method for generating a DNA storage codec rule of the present invention.
  • an embodiment of the present invention also provides a DNA storage coding method, that is, a method for using the generated DNA storage coding and decoding rules in the coding stage, including the following steps:
  • S510 Obtain the DNA storage coding and decoding rules, and set the initial node, and set the initial node as the current node. It can be understood that any node can be set as the initial node, and the ID of the initial node is usually set to 0.
  • the DNA storage coding and decoding rules are generated by the method for generating DNA storage coding and decoding rules of the present invention under a given sliding window (n, k) and restriction conditions.
  • the parameters and restriction conditions of the sliding window can be set according to specific needs.
  • S520 Obtain the binary sequence to be encoded and slice it to generate a binary slice, and convert the binary value corresponding to the slice into an out-degree node or a multi-layer out-degree node connected to the current node.
  • each out-degree node describes a nucleic acid fragment
  • the binary slice and the corresponding nucleic acid fragment form a pair of mapping relationships.
  • the aforementioned binary sequence to be encoded is sliced according to the length of 2k-1, where k represents the length of the base character of each sliding of the sliding window.
  • an adjacency matrix is used to demonstrate the principle of the encoding method, as shown in FIG. 6.
  • white letters on a black background indicate the designated nucleotides under the current ID.
  • the color of the node from light to dark indicates the number of layers corresponding to the node, the number in the node indicates the ID of the node, and the character closest to the node indicates the designated nucleotide in the node.
  • Each arrow represents the bit obtained from that node to the next node.
  • an adjacency matrix diagram (DNA Spider-Web) is used to show the specific encoding and decoding process, as shown in FIG. 7.
  • the figure shows the process in which the node in the figure jumps to the next node after reading a bit in the encoding process, and the process of obtaining the corresponding nucleotide in this process.
  • the DNA storage coding method of the present invention after outputting the complete DNA sequence, synthesizes the above-mentioned DNA sequence, and then stores it in an isolated medium or living cell.
  • an embodiment of the present invention also provides a DNA storage and encoding device, as shown in FIG.
  • the DNA generated by the method of encoding and decoding rules stores the encoding and decoding rules, and the initial node is set, and the initial node is set as the current node;
  • the binary sequence slicing and conversion unit 820 is used to obtain the binary sequence to be encoded and slice it to generate a binary slice , Convert the binary value corresponding to the slice into an out-degree node or multiple out-degree nodes connected to the current node, where each out-degree node describes a nucleic acid fragment, and the binary slice and the corresponding nucleic acid fragment form a pair of mapping relationships; nucleic acid fragments
  • the output unit 830 is configured to input the binary slice, output the nucleic acid fragment mapped by the out-degree node or the multi-layer out-degree node, and update the out-degree node to the current node in accordance with the aforementioned DNA
  • an embodiment of the present invention provides a computer-readable storage medium, including a program, which can be executed by a processor to implement the DNA storage encoding method of the present invention.
  • an embodiment of the present invention also provides a DNA storage and decoding method, that is, a method for using the generated DNA storage codec rules in the decoding stage, including the following steps:
  • S910 Obtain DNA storage coding and decoding rules, and set an initial node, and set the initial node as the current node. It can be understood that any node can be set as the initial node, and the ID of the initial node is usually set to 0.
  • the DNA storage coding and decoding rules are generated by the method for generating DNA storage coding and decoding rules of the present invention under a given sliding window (n, k) and restriction conditions.
  • the parameters and restriction conditions of the sliding window can be set according to specific needs.
  • S920 Obtain the DNA sequence to be decoded and slice it to generate a nucleic acid slice.
  • the DNA storage coding and decoding rules and the nucleic acid information corresponding to the above-mentioned slice find the out-degree node or multi-layer out-degree connected to the above-mentioned current node Nodes, where each out-of-degree node describes a piece of nucleic acid information, and the nucleic acid slice and the corresponding binary value or binary slice form a pair of mapping relationships.
  • the DNA sequence to be decoded is sliced according to the length of k, where k represents the base character length of each sliding of the sliding window.
  • S930 Obtain the binary value or binary slice between the nodes according to the above-mentioned current node and the above-mentioned out-of-degree node or multi-level out-of-degree nodes according to the above-mentioned mapping relationship, and update the above-mentioned out-degree node to the current node, continuously according to the above-mentioned nucleic acid slice sequence Loop input nucleic acid slices and output binary values or binary slices until all the nucleic acid slices are input.
  • an adjacency matrix diagram (DNA Spider-Web) is used to show the specific encoding and decoding process, as shown in FIG. 7.
  • the figure shows the decoding process. After reading a nucleotide, the node in the figure jumps to the next node, and the corresponding bit is obtained in this process.
  • an embodiment of the present invention also provides a DNA storage and decoding device, as shown in FIG.
  • the DNA generated by the method of encoding and decoding rules stores the encoding and decoding rules, and the initial node is set, and the initial node is set as the current node;
  • the DNA slicing and conversion unit 1020 is used to obtain the DNA sequence to be decoded and slice it to generate a nucleic acid slice
  • the aforementioned DNA storage coding and decoding rules and the nucleic acid information corresponding to the aforementioned slices find out-degree nodes or multi-layer out-degree nodes connected to the aforementioned current node, where each out-degree node describes a piece of nucleic acid information, and the aforementioned nucleic acid slice corresponds to Binary values or binary slices form a pair of mapping relationships;
  • the binary value output unit 1030 is used to obtain the binary values or binary slices between the nodes according to the above-mentioned current node and the above-ment
  • an embodiment of the present invention provides a computer-readable storage medium, including a program, which can be executed by a processor to implement the DNA storage decoding method of the present invention.

Landscapes

  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioethics (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种生成DNA存储编解码规则的方法和装置及DNA存储编解码方法和装置,其中生成DNA存储编解码规则的方法包括:设置DNA存储编解码规则的滑动窗口(S210);筛选出全集序列中符合限制条件的合格序列集合(S220);将合格序列集合中的序列通过有向图的方式进行连接(S230);删除有向图中出度个数小于设定的出度个数限值的节点(S240);删除有向图中每个节点多余的出度(S250);获得算法图表,其包括DNA存储编解码规则(S260)。所述方法能够解决现有固定规则无法完全避免的极端GC或特殊基序的问题。

Description

生成DNA存储编解码规则的方法和装置及DNA存储编解码方法和装置 技术领域
本发明涉及信息存储技术领域,尤其涉及一种生成DNA存储编解码规则的方法和装置及DNA存储编解码方法和装置。
背景技术
随着现代科技,尤其是互联网的发展,全球的数据呈指数级攀升的态势。不断增长的数据量对存储技术提出越来越高的要求。传统的存储技术,比如磁带以及光碟存储,由于存储密度和时间有限,越来越无法满足当前的数据需求。近年来,发展起来的DNA存储技术为解决这些问题提供了一条新的途径。与传统的存储介质相比,DNA作为介质进行信息存储具有存储时间长(可以到几千年以上,是现有磁带和光盘介质的百倍以上)、存储密度高(达到约10 9Gb/mm 3,是现有磁带和光盘介质的千万倍以上)以及存储安全性好等优点。
如图1所示,DNA存储通常包括以下步骤:1)编码:将电脑信息的二进制0/1代码转换为A/T/C/G的DNA序列信息;2)合成:利用DNA合成技术合成相应的DNA序列,并将获得的DNA分子保藏在离体介质或者活体细胞内;3)测序:利用测序技术读取存储的DNA分子的DNA序列;4)解码:利用步骤1中与编码过程相对应地方式,将测序获得的DNA序列转换为二进制0/1代码,进一步转换为电脑信息。为了实现有效的DNA数据存储,需要开发针对以上步骤流程的技术。其中,步骤1和步骤4中涉及的编码和解码技术是DNA数据存储最关键的技术。对这个技术最关键的部分是:1)如何最大程度的提高DNA编码0/1二进制信息的密度。DNA存储密度的提升对于节省步骤2中存储信息的DNA的合成成本至关重要。2)0/1二进制信息转换为A/T/C/G序列的时候,最大程度的避免序列之间出现的单碱基重复、高GC和高AT的情况。通常来讲,DNA序列里出现连续的单碱基重复、高GC或高AT都会造成测序过程读取序列信息的困难。二进制0/1数据与A/T/C/G的DNA序列进行转换的方式直接决定了测序过程中对DNA序列的解读难易程度,从而决定数据在读取过程中的保真度。
目前,三代测序,即单分子测序已经越来越受到测序行业的青睐,以PacBio的SMRT,ONT的Nanopore技术为三代测序技术中的成熟者。尽管与二代高通量测序技术相比,三代 测序技术有着测序速度快的优势,但其高错误率为抑制其广泛应用的重要瓶颈之一。通过对序列的人工设计,如控制其GC含量、特殊基序(motif)的去除等,可以大大提高三代测序的准确率。而DNA存储要做到即时快速读取,必然要用到三代测序。因此,需要设计出一种编码技术可以满足任意限制条件的DNA序列。
现有的DNA存储编解码算法,经典方法包括Church、Goldman、Grass、Erlich等人提出的多种算法。现有算法侧重于提高编码密度,或尽量避免极端GC含量,或尽量避免连续单碱基重复等。然而,这些算法由于其规则固定,无法完全避免极端GC或特殊基序(motif)的情况。在利用三代测序进行测序解码时,需要花费大量计算时间用于纠错。
发明内容
本发明的目的在于提供一种生成DNA存储编解码规则的方法和装置及DNA存储编解码方法和装置,可以解决现有固定规则无法完全避免的极端GC或特殊基序(motif)的问题。
根据本发明的第一方面,本发明提供一种生成DNA存储编解码规则的方法,包括:
(1)设置DNA存储编解码规则的滑动窗口(n,k),其中n表示滑动窗口的长度,k表示每次滑动的碱基字符长度,其中n,k为正整数,n≥k;
(2)基于滑动窗口的长度n,获得全集序列,其中上述全集序列为滑动窗口长度范围内每一碱基位置的所有碱基可能性随机组合形成的全部碱基序列的集合,使用限制条件,筛选出上述全集序列中符合上述限制条件的合格序列集合,其中上述限制条件是基于上述全集序列中的序列特征来设置;
(3)将上述合格序列集合中的序列通过有向图的方式进行连接,上述有向图中的每个节点代表每个序列;
(4)删除上述有向图中出度个数小于设定的出度个数限值的节点;
(5)删除上述有向图中每个节点多余的出度,其中,上述多余的出度是超过设定的出度个数限值的出度;
(6)获得算法图表,该算法图表中包括DNA存储编解码规则。
在优选实施例中,上述限制条件包括GC碱基含量、单碱基重复、简单序列重复、回文序列重复、互补回文序列重复和消除特殊序列中的至少一种。
在优选实施例中,上述限制条件包括下列至少一种:
GC碱基含量为40%-60%,单碱基重复不超过3个连续相同碱基,简单序列重复不小于4个碱基,回文序列重复不小于4个碱基,互补回文序列重复不小于4个碱基,消除特殊序列为消除包含AGA、GAG、CTC、TCT的序列。
在优选实施例中,上述设定的出度个数限值是编码效率需要的出度个数。
在优选实施例中,上述编码效率为e,且e∈(0,2]时,每个节点的第
Figure PCTCN2020094192-appb-000001
层的出度个数限值为
Figure PCTCN2020094192-appb-000002
在优选实施例中,上述编码效率为1时,每个节点的出度个数限值为2。
在优选实施例中,上述删除上述有向图中每个节点多余的出度,包括:若上述节点的总出度个数超过设定的出度个数限制,则倒序输出上述节点的碱基,依照倒序输出的碱基顺序依次删除指向对应碱基的出度。
在优选实施例中,上述方法在步骤(4)之后,还包括:
(4’)删除上述有向图中入度个数为0的节点。
在优选实施例中,上述方法还包括:
在执行步骤(4’)之后,再次返回步骤(4),循环执行步骤(4)-(4’)直至上述有向图中所有节点的出度个数均大于设定的出度个数限值,以及上述有向图中不存在入度个数为0的节点。
在优选实施例中,上述方法在步骤(4)与步骤(4’)之间,还包括:
(4”)删除上述有向图中每个节点多余的出度,其中,多余的出度是超过设定的出度个数限值的出度。
在优选实施例中,上述方法还包括:
在执行步骤(4’)之后,再次返回步骤(3),循环执行步骤(3)-(4)-(4”)(4’)直至上述有向图中所有节点的出度个数均大于设定的出度个数限值,以及上述有向图中不存在入度个数为0的节点。
根据本发明的第二方面,本发明提供一种生成DNA存储编解码规则的装置,包括:
滑动窗口设置单元,用于设置DNA存储编解码规则的滑动窗口(n,k),其中n表示滑动窗口的长度,k表示每次滑动的碱基字符长度,其中n,k为正整数,n≥k;
合格序列筛选单元,用于基于滑动窗口的长度n,获得全集序列,其中上述全集序列为滑 动窗口长度范围内每一碱基位置的所有碱基可能性随机组合形成的全部碱基序列的集合,使用限制条件,筛选出上述全集序列中符合上述限制条件的合格序列集合,其中上述限制条件是基于上述全集序列中的序列特征来设置;
有向图连接单元,用于将上述合格序列集合中的序列通过有向图的方式进行连接,上述有向图中的每个节点代表每个序列;
出度不符删除单元,用于删除上述有向图中出度个数小于设定的出度个数限值的节点;
多余出度删除单元,用于删除上述有向图中每个节点多余的出度,上述多余的出度是超过设定的出度个数限值的出度;
算法图表获得单元,用于获得算法图表,该算法图表中包括DNA存储编解码规则。
根据本发明的第三方面,本发明提供一种DNA存储编码方法,包括:
获取由第一方面的方法生成的DNA存储编解码规则,并设置初始节点,将初始节点定为当前节点;
获取待编码的二进制序列并对其进行切片生成二进制切片,将切片对应的二进制数值转换为与当前节点相连的出度节点或多层出度节点,其中每个出度节点描述一个核酸片段,上述二进制切片与对应的核酸片段组成一对映射关系;
依据上述DNA存储编解码规则,输入上述二进制切片,输出上述出度节点或多层出度节点映射的核酸片段,并将上述出度节点更新为当前节点,依据上述二进制切片顺序不断循环输入二进制切片与输出核酸片段,直至上述二进制切片全部输入完毕;
按照输出顺序依次连接上述核酸片段并输出完整的DNA序列。
在优选实施例中,上述方法依据2k-1的长度对上述待编码的二进制序列进行切片,其中k表示滑动窗口每次滑动的碱基字符长度。
在优选实施例中,上述方法还包括:合成上述DNA序列,然后保藏在离体介质或活体细胞内。
根据本发明的第四方面,本发明提供一种DNA存储编码装置,包括:
编解码规则获取单元,用于获取由第一方面的方法生成的DNA存储编解码规则,并设置初始节点,将初始节点定为当前节点;
二进制序列切片和转换单元,用于获取待编码的二进制序列并对其进行切片生成二进制 切片,将切片对应的二进制数值转换为与当前节点相连的出度节点或多层出度节点,其中每个出度节点描述一个核酸片段,上述二进制切片与对应的核酸片段组成一对映射关系;
核酸片段输出单元,用于依据上述DNA存储编解码规则,输入上述二进制切片,输出上述出度节点或多层出度节点映射的核酸片段,并将上述出度节点更新为当前节点,依据上述二进制切片顺序不断循环输入二进制切片与输出核酸片段,直至上述二进制切片全部输入完毕;
核酸片段连接单元,用于按照输出顺序依次连接上述核酸片段并输出完整的DNA序列。
在优选实施例中,依据2k-1的长度对上述待编码的二进制序列进行切片,其中k表示滑动窗口每次滑动的碱基字符长度。
根据本发明的第五方面,本发明提供一种DNA存储解码方法,包括:
获取由第一方面的方法生成的DNA存储编解码规则,并设置初始节点,将初始节点定为当前节点;
获取待解码的DNA序列并对其进行切片以生成核酸切片,依据上述DNA存储编解码规则,以及上述切片对应的核酸信息,找到与上述当前节点相连的出度节点或多层出度节点,其中每个出度节点描述一个核酸信息,上述核酸切片与对应的二进制数值或二进制切片组成一对映射关系;
依据上述当前节点与上述出度节点或多层出度节点,依据上述映射关系获得节点之间的二进制数值或二进制切片,并将上述出度节点更新为当前节点,依据上述核酸切片顺序不断循环输入核酸切片与输出二进制数值或二进制切片,直至上述核酸切片全部输入完毕;
按照输出顺序依次连接上述二进制数值或二进制切片并输出完整的二进制序列。
在优选实施例中,上述方法依据k的长度对上述待解码的DNA序列进行切片,其中k表示滑动窗口每次滑动的碱基字符长度。
在优选实施例中,上述待解码的DNA序列由第三方面的方法或第四方面的装置编码而生成。
根据本发明的第六方面,本发明提供一种DNA存储解码装置,包括:
编解码规则获取单元,用于获取由第一方面的方法生成的DNA存储编解码规则,并设置初始节点,将初始节点定为当前节点;
DNA切片和转换单元,用于获取待解码的DNA序列并对其进行切片生成核酸切片,依 据上述DNA存储编解码规则,以及切片对应的核酸信息,找到与上述当前节点及其相连的出度节点或多层出度节点,其中每个出度节点描述一个核酸信息,上述核酸切片与对应的二进制数值或二进制切片组成一对映射关系;
二进制数值输出单元,用于依据上述当前节点与上述出度节点或多层出度节点,依据上述映射关系获得节点之间的二进制数值或二进制切片,并将上述出度节点更新为当前节点,依据上述核酸切片顺序不断循环输入核酸切片与输出二进制数值或二进制切片,直至上述核酸切片全部输入完毕;
二进制数值连接单元,用于按照输出顺序依次连接上述二进制数值并输出完整的二进制序列。
在优选实施例中,依据k的长度对上述待解码的DNA序列进行切片,其中k表示滑动窗口每次滑动的碱基字符长度。
根据本发明的第七方面,本发明提供一种计算机可读存储介质,包括程序,该程序能够被处理器执行以实现如第一方面的方法或第三方面的方法或第五方面的方法。
目前所有的编解码规则都可以通过本发明的生成DNA存储编解码规则的方法来生成,因此不需要为每种限制条件和编码效率设置对应的编解码规则,节省成本。
此外,基于图论的分析手段,可以对生成的隐式编解码规则做进一步的理论分析,例如算法的稳定性分析等。相比现有的编解码规则,本发明生成的编解码规则有更高的效率,因为本发明生成的隐式编解码规则是一种二进制和碱基的端到端直接映射关系,编码和解码的时间复杂度都是O(n)。本发明的方法适用于任何条件下的测序解码,特别是能够用于三代测序解码,而现有其他算法不涉及三代测序解码。
附图说明
图1为本发明实施例中DNA存储的编码和解码过程示意图;
图2为本发明实施例中生成DNA存储编解码规则的方法流程图;
图3为本发明实施例中生成DNA存储编解码规则的方法原理图;
图4为本发明实施例中生成DNA存储编解码规则的装置结构框图;
图5为本发明实施例中DNA存储编码方法流程图;
图6为本发明实施例中DNA存储编解码方法中以邻接矩阵或图方式展示的编码规则原理示意图;
图7为本发明实施例中DNA存储编解码方法的编解码步骤原理示意图;
图8为本发明实施例中DNA存储编码装置结构框图;
图9为本发明实施例中DNA存储解码方法流程图;
图10为本发明实施例中DNA存储解码装置结构框图;
图11为本发明实施例中编解码规则的配置文件的一部分信息示意图。
具体实施方式
下面通过具体实施方式结合附图对本发明作进一步详细说明。在以下的实施方式中,很多细节描述是为了使得本发明能被更好的理解。然而,本领域技术人员可以毫不费力的认识到,其中部分特征在不同情况下是可以省略的,或者可以由其他材料、方法所替代。
另外,说明书中所描述的特点、操作或者特征可以以任意适当的方式结合形成各种实施方式。同时,方法描述中的各步骤或者动作也可以按照本领域技术人员所能显而易见的方式进行顺序调换或调整。因此,说明书和附图中的各种顺序只是为了清楚描述某一个实施例,并不意味着是必须的顺序,除非另有说明其中某个顺序是必须遵循的。
本发明的术语说明:
编码方法,是指一种二进制和碱基之间的映射关系。通常来说,传统的固定规则的编码方法,会进行多个步骤的优化处理,最终获得最后的映射关系。本发明中,编码方法通过编解码规则实现。本发明的编解码规则通过本发明的生成DNA存储编解码规则的方法来生成。
生成器,本发明中也称“生成DNA存储编解码规则的方法”,其依据不同的组合情况,通过图论的方法,获得某种潜在的二进制和碱基之间的映射关系,即获得本发明的编解码规则。
算法稳定性,是指算法对于任意输入的电子文件,输出的DNA序列都可以稳定地满足限制条件。通常,在“任意”的情况下,会使用类洪水攻击的输入,观测算法在极限情况下的稳定性。
端到端,由原始数据输入到结果输出,从输入端到输出端,中间的映射处理自成一体。
时间复杂度,算法的时间复杂度是一个函数,它定性描述该算法的运行时间。这是一个 代表算法输入值的字符串长度的函数。时间复杂度常用大O符号表述,不包括这个函数的低阶项和首项系数。使用这种方式时,时间复杂度可被称为是渐近的,亦即考察输入值大小趋近无穷时的情况。
针对不同测序或合成仪器的序列限制条件,本发明提出一种基于限制条件的最优编解码生成器,即本发明的生成DNA存储编解码规则的方法。该生成器(或方法),可以解决现有固定规则无法完全避免极端GC或特殊基序(motif)的问题。这里的特殊基序(motif)是指,难以使用固定规则进行分析的序列。
此外,该生成器生成的编码方法不需要筛选(screen)过程,故不存在无法接受所有输入的隐患。此外,该生成器生成的编码方法的编解码时间复杂度为O(n),相比绝大多数需要很多优化过程的编解码方法,本发明的编解码方法会快很多,针对未来大规模DNA存储转码,会更高效。
以下详细说明本发明的技术组成部分,应当理解,这些说明是示例性的,本领域技术人员可以在本发明技术内容的基础上做出许多变型。
如图2和图3所示,在本发明的一个实施例中,一种生成DNA存储编解码规则的方法,即一种DNA存储编解码生成器,基于图论和组合学创建,其步骤包括:
S210:设置DNA存储编解码规则的滑动窗口(n,k)。
如图3所示,滑动窗口是一个固定长度为n的窗口模型,在每次滑动定长碱基字符k后(通常k=1),其中n,k为正整数,n≥k,观测当前位置窗口范围内的所有字符数据。
S220:基于滑动窗口的长度n,获得全集序列,其中全集序列为滑动窗口长度范围内每一碱基位置的所有碱基可能性随机组合形成的全部碱基序列的集合,使用限制条件,筛选出全集序列中符合限制条件的合格序列集合,其中限制条件是基于全集序列中的序列特征来设置。
如图3所示,限制条件可以包括GC碱基含量、单碱基重复、简单序列重复、回文序列重复、互补回文序列重复和消除特殊序列中的至少一种。
在一个实施例中,限制条件包括下列至少一种:GC碱基含量为40%-60%,单碱基重复不超过3个连续相同碱基,简单序列重复不小于4个碱基,回文序列重复不小于4个碱基,互补回文序列重复不小于4个碱基,消除特殊序列为消除包含AGA、GAG、CTC、TCT的序列。需要说明的是,简单序列重复、回文序列重复、互补回文序列重复中的“重复”指的是“重复的碱基长度”。例如:碱基序列ACGTACGTACGT,其为“ACGT”的重复,重复为4;碱基序列ACGTAAACGTAAACGTAA,其为“ACGTAA”的重复,重复为6。由于A 碱基和G碱基,C碱基和T碱基具有相似的化学结构,在使用三代测序仪,例如nanopore,进行测序的时候,相似化学结构的碱基相邻在测序过程中容易造成碱基识别混淆,进而导致测序序列错误。因此需要尽可能地避免此类序列出现。
如图3所示,筛选出合格序列集合的具体的操作方法如下:(1)由于序列由碱基ACGT组成,本发明的方法预先生成4 n个序列(即全集序列);(2)将每条序列通过限制条件进行检测,如果该序列符合限制条件,则保存该序列至合格序列集合中。
S230:将合格序列集合中的序列通过有向图的方式进行连接,有向图中的每个节点代表每个序列。
如图3所示,有向图是由若干给定的节点及连接两个节点的线所构成的图形。有向图是指两个节点之间的线是有方向性的。这里将每个序列比作有向图中的节点。假设节点所表示的序列长度为n,若一个节点A对应的序列从第2个到第n个字符所组成的字符串与另一个节点B对应的序列从第1个到第(n-1)个字符所组成的字符串完全一致,则连接关系为从A连接到B。例如,节点所表示的序列长度为9,节点1为序列ATAGTGGTC,节点2为序列TAGTGGTCA,节点1序列从第2个到第9个碱基所组成的序列为“TAGTGGTC”,节点2序列从第1个到第8个碱基所组成的序列为“TAGTGGTC”,两者完全一致,连接关系为从节点1连接至节点2。
S240:删除上述有向图中出度个数小于设定的出度个数限值的节点。
在本发明一个实施例中,设定的出度个数限值是编码效率需要的出度个数。如图3所示,基于编码效率,检查所有节点的出度个数。若某节点的出度个数小于编码效率需要的出度个数,则删除该节点。直到所有节点都满足条件后,终止循环。节点的出度个数是指,在有向图中从给定的节点指向其他节点的边数目。
在一个实施例中,编码效率为e,且e∈(0,2]时,每个节点的第
Figure PCTCN2020094192-appb-000003
层的出度个数限值为
Figure PCTCN2020094192-appb-000004
其中,k表示滑动窗口每次滑动的碱基字符长度,由于一个滑动窗口长度范围内的碱基序列构成了节点,因此,每次滑动碱基字符长度k即为节点的第k层。
在一个更优选的实施例中,编码效率为1时,每个节点的出度个数限值为2。
S250:删除上述有向图中每个节点多余的出度。
本发明中,多余的出度是超过设定的出度个数限值的出度。例如,在一个实施例中,对于某个节点,如果设定的出度个数限值为2,但该节点包含4个出度个数,那么超过设定的 出度个数限值的出度则属于多余的出度,即有2个出度需要被删除。删除多余的出度的目的在于维护算法的稳定性。
在一个实施例中,多余的出度是节点的第
Figure PCTCN2020094192-appb-000005
层出度总个数超过
Figure PCTCN2020094192-appb-000006
的出度,其中e为编码效率,且e∈(0,2]。
如图3所示,依据编码效率,对每个节点多余的出度进行删除。删除方法具体为,若节点的总出度个数超过设定的出度个数限制,则倒序输出该节点的碱基,依照倒序输出的碱基顺序依次删除指向对应碱基的出度。在本发明中,“指向对应碱基的出度”指的是,在上一个节点指向下一个节点形成的出度中,下一个节点的碱基序列的最后一位碱基如与上一节点倒序输出的碱基相同,则该出度即为“指向对应碱基的出度”。例如,上一个节点(L)序列为AACACGACT,该节点连接的下一个节点序列分别为:节点(P1)序列为ACACGACTA、节点(P2)序列为ACACGACTC、节点(P3)序列为ACACGACTG、节点(P4)序列为ACACGACTT,节点(L)分别与节点(P1)、(P2)、(P3)、(P4)连接,形成4个出度,如果设定的出度个数限值为2,则多余的出度个数为2,即需要删除2个多余出度,倒序输出节点(L)的碱基:T、C、A、G、C、A、C、A、A,依据该输出顺序,依次删除指向T碱基与C碱基的出度,即删除节点(L)与最后一位碱基序列为T的节点(P4)形成的出度,以及节点(L)与最后一位碱基序列为C的节点(P2)形成的出度。
在一些优选实施例中,步骤S240以后,还包括:
步骤S240’:删除上述有向图中入度个数为0的节点,以缩小有向图的范围。这样做的好处是,对于限制条件较宽松的生成算法而言,可以从另一层面提高限制条件的严格程度。
在一些优选实施例中,还包括:在执行步骤S240’之后,再次返回步骤S240,循环执行步骤S240-S240’,直至有向图中所有节点的出度个数均大于设定的出度个数限值,以及上述有向图中不存在入度个数为0的节点。
在一些优选实施例中,在步骤S240与步骤S240’之间还包括:
步骤S240”:删除有向图中每个节点多余的出度,其中,多余的出度是超过设定的出度个数限值的出度。多余的出度定义如上,在此不再累述。
在一些优选实施例中,还包括:在执行步骤(S240’)之后,再次返回步骤(S230),循环执行步骤(S230)-(S240)-(S240”)(S240’)直至所述有向图中所有节点的出度个数均大于设定的出度个数限值,以及有向图中不存在入度个数为0的节点。需要说明的是,每一次循环结束之后,由于出现了节点与出度被删除的情况,因此,在开始新的一次循环时, 所有节点将依照上述连接原则进行重新连接,进而组成新的有向图,再依据上述删除原则进行删除。并且,在删除入度个数为0的节点前先去除掉多余的出度,可以在当前循环中暴露更多入度个数为0的节点,以便减少总循环次数,进而缩短程序运行时间。
S260:获得算法图表,该算法图表中包括DNA存储编解码规则。
对应于本发明的生成DNA存储编解码规则的方法,本发明还提供一种生成DNA存储编解码规则的装置,如图4所示,包括:滑动窗口设置单元410,用于设置DNA存储编解码规则的滑动窗口(n,k),其中n表示滑动窗口的长度,k表示每次滑动的碱基字符长度,其中n,k为正整数,n≥k;合格序列筛选单元420,用于基于滑动窗口的长度n,获得全集序列,其中全集序列为滑动窗口长度范围内每一碱基位置的所有碱基可能性随机组合形成的全部碱基序列的集合,使用限制条件,筛选出全集序列中符合限制条件的合格序列集合,其中限制条件是基于全集序列中的序列特征来设置;有向图连接单元430,用于将合格序列集合中的序列通过有向图的方式进行连接,有向图中的每个节点代表每个序列;出度不符删除单元440,用于删除上述有向图中出度个数小于设定的出度个数限值的节点;多余出度删除单元450,用于删除上述有向图中每个节点多余的出度,其中多余的出度是超过设定的出度个数限值的出度;算法图表获得单元460,用于获得算法图表,该算法图表中包括DNA存储编解码规则。
本领域技术人员可以理解,上述实施方式中各种方法的全部或部分功能可以通过硬件的方式实现,也可以通过计算机程序的方式实现。当上述实施方式中全部或部分功能通过计算机程序的方式实现时,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:只读存储器、随机存储器、磁盘、光盘、硬盘等,通过计算机执行该程序以实现上述功能。例如,将程序存储在设备的存储器中,当通过处理器执行存储器中程序,即可实现上述全部或部分功能。另外,当上述实施方式中全部或部分功能通过计算机程序的方式实现时,该程序也可以存储在服务器、另一计算机、磁盘、光盘、闪存盘或移动硬盘等存储介质中,通过下载或复制保存到本地设备的存储器中,或对本地设备的系统进行版本更新,当通过处理器执行存储器中的程序时,即可实现上述实施方式中全部或部分功能。
因此,本发明的一种实施例中提供一种计算机可读存储介质,包括程序,该程序能够被处理器执行以实现本发明的生成DNA存储编解码规则的方法。
如图5、图6、和图7所示,本发明的一个实施例还提供一种DNA存储编码方法,即生成的DNA存储编解码规则在编码阶段的使用方法,包括如下步骤:
S510:获取DNA存储编解码规则,并设置初始节点,将初始节点定为当前节点。可以理解,任何一个节点都可以设置为初始节点,通常将初始节点的ID设置为0。
其中,DNA存储编解码规则是在给定的滑动窗口(n,k)和限制条件下,由本发明的生成DNA存储编解码规则的方法生成的。
在本发明的一个实施例中,限定滑动窗口的参数为(n=9,k=1),给定的限制条件包括:单碱基重复不超过2,简单序列重复不小于4个碱基,回文序列重复不小于4个碱基,互补回文序列重复不小于4个碱基,GC碱基含量在40%至60%之间,消除针对nanopore测序的4种特殊序列"AGA"、"GAG"、"CTC"、"TCT"。在其他实施例中,滑动窗口的参数和限制条件可以根据具体需要设置。
S520:获取待编码的二进制序列并对其进行切片生成二进制切片,将切片对应的二进制数值转换为与当前节点相连的出度节点或多层出度节点。其中,每个出度节点描述一个核酸片段,二进制切片与对应的核酸片段组成一对映射关系。
在本发明的一个实施例中,依据2k-1的长度对上述待编码的二进制序列进行切片,其中k表示滑动窗口每次滑动的碱基字符长度。
S530:依据上述DNA存储编解码规则,输入二进制切片,输出上述出度节点或多层出度节点映射的核酸片段,并将上述出度节点更新为当前节点,依据上述二进制切片顺序不断循环输入二进制切片与输出核酸片段,直至上述二进制切片全部输入完毕。
在本发明的一个实施例中,使用邻接矩阵展示该编码方法的原理,如图6所示。在邻接矩阵中,黑底白字表示当前ID下被指定核苷酸。在图中,节点颜色从浅至深表示节点对应的层数,节点中的编号表示节点的ID,距离节点最近的字符表示该节点中被指定的核苷酸。每个箭头表示该节点到下一节点获得的比特。
在本发明的一个实施例中,使用邻接矩阵图(DNA Spider-Web)展示具体编码和解码过程,如图7所示。图中展示了编码过程中,读取一个比特后图中节点跳转到下一节点的过程,并在此过程中获得对应核苷酸的过程。
S540:按照输出顺序依次连接上述核酸片段并输出完整的DNA序列。
在本发明的一个实施例中,本发明的DNA存储编码方法,在输出完整的DNA序列之后,合成上述DNA序列,然后保藏在离体介质或活体细胞内。
对应于本发明的DNA存储编码方法,本发明的一个实施例还提供一种DNA存储编码装 置,如图8所示,包括:编解码规则获取单元810,用于获取由本发明中的生成DNA存储编解码规则的方法生成的DNA存储编解码规则,并设置初始节点,将初始节点定为当前节点;二进制序列切片和转换单元820,用于获取待编码的二进制序列并对其进行切片生成二进制切片,将切片对应的二进制数值转换为与当前节点相连的出度节点或多层出度节点,其中每个出度节点描述一个核酸片段,二进制切片与对应的核酸片段组成一对映射关系;核酸片段输出单元830,用于依据上述DNA存储编解码规则,输入上述二进制切片,输出上述出度节点或多层出度节点映射的核酸片段,并将上述出度节点更新为当前节点,依据上述二进制切片顺序不断循环输入二进制切片与输出核酸片段,直至上述二进制切片全部输入完毕;核酸片段连接单元840,用于按照输出顺序依次连接上述核酸片段并输出完整的DNA序列。
此外,本发明的一种实施例中提供一种计算机可读存储介质,包括程序,该程序能够被处理器执行以实现本发明的DNA存储编码方法。
如图7和图9所示,本发明的一个实施例还提供一种DNA存储解码方法,即生成的DNA存储编解码规则在解码阶段的使用方法,包括如下步骤:
S910:获取DNA存储编解码规则,并设置初始节点,将初始节点定为当前节点。可以理解,任何一个节点都可以设置为初始节点,通常将初始节点的ID设置为0。
其中,DNA存储编解码规则是在给定的滑动窗口(n,k)和限制条件下,由本发明的生成DNA存储编解码规则的方法生成的。
在本发明的一个实施例中,限定滑动窗口的参数为(n=9,k=1),给定的限制条件包括:单碱基重复不超过2,简单序列重复不小于4个碱基,回文序列重复不小于4个碱基,互补回文序列重复不小于4个碱基,GC碱基含量在40%至60%之间,消除针对nanopore测序的4种特殊序列"AGA"、"GAG"、"CTC"、"TCT"。在其他实施例中,滑动窗口的参数和限制条件可以根据具体需要设置。
S920:获取待解码的DNA序列并对其进行切片以生成核酸切片,依据上述DNA存储编解码规则,以及上述切片对应的核酸信息,找到与上述当前节点相连的的出度节点或多层出度节点,其中每个出度节点描述一个核酸信息,核酸切片与对应的二进制数值或二进制切片组成一对映射关系。
在本发明的一个实施例中,依据k的长度对待解码的DNA序列进行切片,其中k表示滑动窗口每次滑动的碱基字符长度。
S930:依据上述当前节点与上述出度节点或多层出度节点,依据上述映射关系获得节点 之间的二进制数值或二进制切片,并将上述出度节点更新为当前节点,依据上述核酸切片顺序不断循环输入核酸切片与输出二进制数值或二进制切片,直至上述核酸切片全部输入完毕。
在本发明的一个实施例中,使用邻接矩阵图(DNA Spider-Web)展示具体编码和解码过程,如图7所示。图中展示了解码过程,读取一个核苷酸后图中节点跳转到下一节点的过程,并在此过程中获得对应比特的过程。
S940:按照输出顺序依次连接上述二进制数值或二进制切片并输出完整的二进制序列。
对应于本发明的DNA存储解码方法,本发明的一个实施例还提供一种DNA存储解码装置,如图10所示,包括:编解码规则获取单元1010,用于获取由本发明中的生成DNA存储编解码规则的方法生成的DNA存储编解码规则,并设置初始节点,将初始节点定为当前节点;DNA切片和转换单元1020,用于获取待解码的DNA序列并对其进行切片以生成核酸切片,依据上述DNA存储编解码规则,以及上述切片对应的核酸信息,找到与上述当前节点相连的出度节点或多层出度节点,其中每个出度节点描述一个核酸信息,上述核酸切片与对应的二进制数值或二进制切片组成一对映射关系;二进制数值输出单元1030,用于依据上述当前节点与上述出度节点或多层出度节点,依据上述映射关系获得节点之间的二进制数值或二进制切片,并将上述出度节点更新为当前节点,依据上述核酸切片顺序不断循环输入核酸切片与输出二进制数值或二进制切片,直至上述核酸切片全部输入完毕;二进制数值连接单元1040,用于按照输出顺序依次连接上述二进制数值或二进制切片并输出完整的二进制序列。
此外,本发明的一种实施例中提供一种计算机可读存储介质,包括程序,该程序能够被处理器执行以实现本发明的DNA存储解码方法。
以下通过实施例详细说明本发明的技术方案和效果,应当理解,实施例仅是示例性的,不能理解为对本发明的限制。
在该实例中,限定滑动窗口的参数为(n=9,k=1)。
通过DNA存储编解码生成器获得隐式编解码规则的流程为:
(1)基于当前的限制条件(单碱基重复不超过2,简单序列重复不小于4个碱基,回文序列重复不小于4个碱基,互补回文序列重复不小于4个碱基,GC含量在40%至60%之间,消除针对nanopore测序的4种特殊序列"AGA"、"GAG"、"CTC"、"TCT")获得所有的组合情况。在原始的4 n=262144个DNA序列片段组合中,忽略不符合上述限值条件的DNA序列片段,保留剩余的DNA序列片段,最终获得包括48460种组合方式的合格序列集合。
(2)将合格序列集合中的序列通过有向图的方式进行连接。检测每个节点的出度情况, 删除掉不符合要求,即出度不超过2 k=2的节点,直到剩余的所有节点都符合要求。共经历9轮筛选进程,最终获得14000种组合方式。
(3)消除入度为0的节点,并进一步检测出度情况。共经历10次入度消除操作,最终获得5264种DNA序列片段组合。
(4)在5264种DNA序列片段组合中,找到出度大于2的所有节点。倒序输出这些节点的序列,若还有多余的出度,依照倒序的序列顺序依次删除掉指向对应碱基的出度。保存图的邻接矩阵,生成包含隐式编解码规则的编解码方法。
生成的编解码方法在编码和解码中的具体使用实例:
(1)先决条件:获取本实施例生成的包含隐式编解码规则的编解码方法,该编解码规则的配置文件的一部分信息如图11所示。
(2)具体存储过程:
[1]将“Hello world!”,这句话对应的二进制代码提取出来:
000100101010011000110110001101101111011000000100111011101111011001001110001101100010011010000100
[2]将上述二进制代码按照长度为1进行切片,并依照编码规则对二进制信息进行编码,获得DNA片段,将各个DNA片段按照切片顺序进行连接,获得以下DNA全长序列:
Figure PCTCN2020094192-appb-000007
[3]使用化学合成的方法合成上述DNA全长序列。
[4]将合成好的DNA分子保存起来,实现信息存储。
(3)具体读取过程:
[1]将存储的DNA分子使用测序手段获得其具体序列,如下:
Figure PCTCN2020094192-appb-000008
[2]将上述DNA序列按照长度为1进行切片,并依照编码规则对序列信息进行解码,获得二进制数值,将各个二进制数值按照切片顺序进行连接,获得以下二进制序列:
00010010101001100011011000110110111101100000010011101110111101100100111000110 1100010011010000100。
[3]将二进制信息复原,解读为Hello world!。
以上应用了具体个例对本发明进行阐述,只是用于帮助理解本发明,并不用以限制本发明。对于本发明所属技术领域的技术人员,依据本发明的思想,还可以做出若干简单推演、变形或替换。

Claims (23)

  1. 一种生成DNA存储编解码规则的方法,其特征在于,所述方法包括:
    (1)设置DNA存储编解码规则的滑动窗口(n,k),其中n表示滑动窗口的长度,k表示每次滑动的碱基字符长度,其中n,k为正整数,n≥k;
    (2)基于滑动窗口的长度n,获得全集序列,其中所述全集序列为滑动窗口长度范围内每一碱基位置的所有碱基可能性随机组合形成的全部碱基序列的集合,使用限制条件,筛选出所述全集序列中符合所述限制条件的合格序列集合,其中所述限制条件是基于所述全集序列中的序列特征来设置;
    (3)将所述合格序列集合中的序列通过有向图的方式进行连接,所述有向图中的每个节点代表每个序列;
    (4)删除所述有向图中出度个数小于设定的出度个数限值的节点;
    (5)删除所述有向图中每个节点多余的出度,其中,所述多余的出度是超过设定的出度个数限值的出度;
    (6)获得算法图表,所述算法图表中包括DNA存储编解码规则。
  2. 根据权利要求1所述的方法,其特征在于,所述限制条件包括GC碱基含量、单碱基重复、简单序列重复、回文序列重复、互补回文序列重复和消除特殊序列中的至少一种。
  3. 根据权利要求2所述的方法,其特征在于,所述限制条件包括下列至少一种:
    GC碱基含量为40%-60%,单碱基重复不超过3个连续相同碱基,简单序列重复不小于4个碱基,回文序列重复不小于4个碱基,互补回文序列重复不小于4个碱基,消除特殊序列为消除包含AGA、GAG、CTC、TCT的序列。
  4. 根据权利要求1所述的方法,其特征在于,所述设定的出度个数限值是编码效率需要的出度个数。
  5. 根据权利要求4所述的方法,其特征在于,所述编码效率为e,且e∈(0,2]时,每个节点的第
    Figure PCTCN2020094192-appb-100001
    层的出度个数限值为
    Figure PCTCN2020094192-appb-100002
  6. 根据权利要求5所述的方法,其特征在于,所述编码效率为1时,每个节点的出度个数限值为2。
  7. 根据权利要求1所述的方法,其特征在于,所述删除所述有向图中每个节点多余的出 度,包括:若所述节点的总出度个数超过设定的出度个数限制,则倒序输出所述节点的碱基,依照倒序输出的碱基顺序依次删除指向对应碱基的出度。
  8. 根据权利要求1所述的方法,其特征在于,所述方法在步骤(4)之后,还包括:
    (4’)删除所述有向图中入度个数为0的节点。
  9. 根据权利要求8所述的方法,其特征在于,所述方法还包括:
    在执行步骤(4’)之后,再次返回步骤(4),循环执行步骤(4)-(4’)直至所述有向图中所有节点的出度个数均大于设定的出度个数限值,以及所述有向图中不存在入度个数为0的节点。
  10. 根据权利要求8或9所述的方法,其特征在于,所述方法在步骤(4)与步骤(4’)之间,还包括:
    (4”)删除所述有向图中每个节点多余的出度,其中,所述多余的出度是超过设定的出度个数限值的出度。
  11. 根据权利要求10所述的方法,其特征在于,所述方法还包括:
    在执行步骤(4’)之后,再次返回步骤(3),循环执行步骤(3)-(4)-(4”)(4’)直至所述有向图中所有节点的出度个数均大于设定的出度个数限值,以及所述有向图中不存在入度个数为0的节点。
  12. 一种生成DNA存储编解码规则的装置,其特征在于,所述装置包括:
    滑动窗口设置单元,用于设置DNA存储编解码规则的滑动窗口(n,k),其中n表示滑动窗口的长度,k表示每次滑动的碱基字符长度,其中n,k为正整数,n≥k;
    合格序列筛选单元,用于基于滑动窗口的长度n,获得全集序列,其中所述全集序列为滑动窗口长度范围内每一碱基位置的所有碱基可能性随机组合形成的全部碱基序列的集合,使用限制条件,筛选出所述全集序列中符合所述限制条件的合格序列集合,其中所述限制条件是基于所述全集序列中的序列特征来设置;
    有向图连接单元,用于将所述合格序列集合中的序列通过有向图的方式进行连接,所述有向图中的每个节点代表每个序列;
    出度不符删除单元,用于删除所述有向图中出度个数小于设定的出度个数限值的节点;
    多余出度删除单元,用于删除所述有向图中每个节点多余的出度,所述多余的出度是超 过设定的出度个数限值的出度;
    算法图表获得单元,用于获得算法图表,所述算法图表中包括DNA存储编解码规则。
  13. 一种DNA存储编码方法,其特征在于,所述方法包括:
    获取由权利要求1至11任一项所述的方法生成的DNA存储编解码规则,并设置初始节点,将初始节点定为当前节点;
    获取待编码的二进制序列并对其进行切片生成二进制切片,将切片对应的二进制数值转换为与当前节点相连的出度节点或多层出度节点,其中每个出度节点描述一个核酸片段,所述二进制切片与对应的核酸片段组成一对映射关系;
    依据所述DNA存储编解码规则,输入所述二进制切片,输出所述出度节点或多层出度节点映射的核酸片段,并将所述出度节点更新为当前节点,依据所述二进制切片顺序不断循环输入二进制切片与输出核酸片段,直至所述二进制切片全部输入完毕;
    按照输出顺序依次连接所述核酸片段并输出完整的DNA序列。
  14. 根据权利要求13所述的方法,其特征在于,所述方法依据2k-1的长度对所述待编码的二进制序列进行切片,其中k表示滑动窗口每次滑动的碱基字符长度。
  15. 根据权利要求13所述的方法,其特征在于,所述方法还包括:合成所述DNA序列,然后保藏在离体介质或活体细胞内。
  16. 一种DNA存储编码装置,其特征在于,所述装置包括:
    编解码规则获取单元,用于获取由权利要求1至11任一项所述的方法生成的DNA存储编解码规则,并设置初始节点,将初始节点定为当前节点;
    二进制序列切片和转换单元,用于获取待编码的二进制序列并对其进行切片生成二进制切片,将切片对应的二进制数值转换为与当前节点相连的出度节点或多层出度节点,其中每个出度节点描述一个核酸片段,所述二进制切片与对应的核酸片段组成一对映射关系;
    核酸片段输出单元,用于依据所述DNA存储编解码规则,输入所述二进制切片,输出所述出度节点或多层出度节点映射的核酸片段,并将所述出度节点更新为当前节点,依据所述二进制切片顺序不断循环输入二进制切片与输出核酸片段,直至所述二进制切片全部输入完毕;
    核酸片段连接单元,用于按照输出顺序依次连接所述核酸片段并输出完整的DNA序列。
  17. 根据权利要求16所述的装置,其特征在于,依据2k-1的长度对所述待编码的二进制序列进行切片,其中k表示滑动窗口每次滑动的碱基字符长度。
  18. 一种DNA存储解码方法,其特征在于,所述方法包括:
    获取由权利要求1至11任一项所述的方法生成的DNA存储编解码规则,并设置初始节点,将初始节点定为当前节点;
    获取待解码的DNA序列并对其进行切片以生成核酸切片,依据所述DNA存储编解码规则,以及所述切片对应的核酸信息,找到与所述当前节点相连的出度节点或多层出度节点,其中每个出度节点描述一个核酸信息,所述核酸切片与对应的二进制数值或二进制切片组成一对映射关系;
    依据所述当前节点与所述出度节点或多层出度节点,依据所述映射关系获得节点之间的二进制数值或二进制切片,并将所述出度节点更新为当前节点,依据所述核酸切片顺序不断循环输入核酸切片与输出二进制数值或二进制切片,直至所述核酸切片全部输入完毕;
    按照输出顺序依次连接所述二进制数值或二进制切片并输出完整的二进制序列。
  19. 根据权利要求18所述的方法,其特征在于,所述方法依据k的长度对所述待解码的DNA序列进行切片,其中k表示滑动窗口每次滑动的碱基字符长度。
  20. 根据权利要求18所述的方法,其特征在于,所述待解码的DNA序列由权利要求13至14任一项所述的方法或权利要求16-17所述的装置编码而生成。
  21. 一种DNA存储解码装置,其特征在于,所述装置包括:
    编解码规则获取单元,用于获取由权利要求1至11任一项所述的方法生成的DNA存储编解码规则,并设置初始节点,将初始节点定为当前节点;
    DNA切片和转换单元,用于获取待解码的DNA序列并对其进行切片生成核酸切片,依据所述DNA存储编解码规则,以及切片对应的核酸信息,找到与所述当前节点及其相连的出度节点或多层出度节点,其中每个出度节点描述一个核酸信息,所述核酸切片与对应的二进制数值或二进制切片组成一对映射关系;
    二进制数值输出单元,用于依据所述当前节点与所述出度节点或多层出度节点,依据所述映射关系获得节点之间的二进制数值或二进制切片,并将所述出度节点更新为当前节点,依据所述核酸切片顺序不断循环输入核酸切片与输出二进制数值或二进制切片,直至所述核酸切片全部输入完毕;
    二进制数值连接单元,用于按照输出顺序依次连接所述二进制数值并输出完整的二进制序列。
  22. 根据权利要求21所述的装置,其特征在于,依据k的长度对所述待解码的DNA序列进行切片,其中k表示滑动窗口每次滑动的碱基字符长度。
  23. 一种计算机可读存储介质,其特征在于,包括程序,所述程序能够被处理器执行以实现如权利要求1至11任一项所述的方法或权利要求13至15任一项所述的方法或权利要求18至20任一项所述的方法。
PCT/CN2020/094192 2020-06-03 2020-06-03 生成dna存储编解码规则的方法和装置及dna存储编解码方法和装置 WO2021243605A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2020/094192 WO2021243605A1 (zh) 2020-06-03 2020-06-03 生成dna存储编解码规则的方法和装置及dna存储编解码方法和装置
CN202080101762.4A CN115699189A (zh) 2020-06-03 2020-06-03 生成dna存储编解码规则的方法和装置及dna存储编解码方法和装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/094192 WO2021243605A1 (zh) 2020-06-03 2020-06-03 生成dna存储编解码规则的方法和装置及dna存储编解码方法和装置

Publications (1)

Publication Number Publication Date
WO2021243605A1 true WO2021243605A1 (zh) 2021-12-09

Family

ID=78830036

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/094192 WO2021243605A1 (zh) 2020-06-03 2020-06-03 生成dna存储编解码规则的方法和装置及dna存储编解码方法和装置

Country Status (2)

Country Link
CN (1) CN115699189A (zh)
WO (1) WO2021243605A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114822695A (zh) * 2022-04-25 2022-07-29 中国科学院深圳先进技术研究院 用于dna存储的编码方法及编码装置
WO2023206023A1 (zh) * 2022-04-25 2023-11-02 中国科学院深圳先进技术研究院 用于dna存储的编码方法及编码装置
CN116187435B (zh) * 2022-12-19 2024-01-05 武汉大学 基于大小喷泉码及mrc算法利用dna进行信息存储方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040153255A1 (en) * 2003-02-03 2004-08-05 Ahn Tae-Jin Apparatus and method for encoding DNA sequence, and computer readable medium
CN105022935A (zh) * 2014-04-22 2015-11-04 中国科学院青岛生物能源与过程研究所 一种利用dna进行信息存储的编码方法和解码方法
CN105119717A (zh) * 2015-07-21 2015-12-02 郑州轻工业学院 一种基于dna编码的加密系统及加密方法
CN109300508A (zh) * 2017-07-25 2019-02-01 南京金斯瑞生物科技有限公司 一种dna数据存储编码解码方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017190297A1 (zh) * 2016-05-04 2017-11-09 深圳华大基因研究院 利用dna存储文本信息的方法、其解码方法及应用
KR20240025702A (ko) * 2016-06-07 2024-02-27 일루미나, 인코포레이티드 2차 및/또는 3차 프로세싱을 수행하기 위한 생물정보학 시스템, 장치, 및 방법
CN107103206B (zh) * 2017-04-27 2019-10-18 福建师范大学 基于标准熵的局部敏感哈希的dna序列聚类
CN109830263B (zh) * 2019-01-30 2023-04-07 东南大学 一种基于寡核苷酸序列编码存储的dna存储方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040153255A1 (en) * 2003-02-03 2004-08-05 Ahn Tae-Jin Apparatus and method for encoding DNA sequence, and computer readable medium
CN105022935A (zh) * 2014-04-22 2015-11-04 中国科学院青岛生物能源与过程研究所 一种利用dna进行信息存储的编码方法和解码方法
CN105119717A (zh) * 2015-07-21 2015-12-02 郑州轻工业学院 一种基于dna编码的加密系统及加密方法
CN109300508A (zh) * 2017-07-25 2019-02-01 南京金斯瑞生物科技有限公司 一种dna数据存储编码解码方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114822695A (zh) * 2022-04-25 2022-07-29 中国科学院深圳先进技术研究院 用于dna存储的编码方法及编码装置
WO2023206023A1 (zh) * 2022-04-25 2023-11-02 中国科学院深圳先进技术研究院 用于dna存储的编码方法及编码装置
CN114822695B (zh) * 2022-04-25 2024-04-16 中国科学院深圳先进技术研究院 用于dna存储的编码方法及编码装置
CN116187435B (zh) * 2022-12-19 2024-01-05 武汉大学 基于大小喷泉码及mrc算法利用dna进行信息存储方法及系统

Also Published As

Publication number Publication date
CN115699189A (zh) 2023-02-03

Similar Documents

Publication Publication Date Title
WO2021243605A1 (zh) 生成dna存储编解码规则的方法和装置及dna存储编解码方法和装置
US11062793B2 (en) Systems and methods for aligning sequences to graph references
JP7168772B2 (ja) ニューラルネットワーク捜索方法、装置、プロセッサ、電子機器、記憶媒体及びコンピュータプログラム
Nevill-Manning et al. On-line and off-line heuristics for inferring hierarchies of repetitions in sequences
CN116151384B (zh) 量子电路处理方法、装置及电子设备
CN113012665A (zh) 音乐生成方法及音乐生成模型的训练方法
WO2024036662A1 (zh) 一种基于数据采样的并行图规则挖掘方法及装置
CN113487024A (zh) 交替序列生成模型训练方法、从文本中抽取图的方法
CN115563927A (zh) 一种gpu加速构建最小直角斯坦纳树的芯片布线方法
CN115756597A (zh) 一种基于多模态代码表示的注释自动生成方法
WO2022091536A1 (ja) 意味表現解析システム及び意味表現解析方法
KR102339723B1 (ko) Dna 저장 장치의 연성 정보 기반 복호화 방법, 프로그램 및 장치
US12040057B2 (en) Scaffold-oriented universal line system
JP6622921B2 (ja) 文字列辞書の構築方法、文字列辞書の検索方法、および、文字列辞書の処理システム
Gao et al. Fragment‐based deep molecular generation using hierarchical chemical graph representation and multi‐resolution graph variational autoencoder
WO2021056167A1 (zh) 信息编码和解码方法、装置、存储介质以及信息存储和解读方法
CN112634989A (zh) 基于片段重叠群的双面基因组片段填充方法及装置
US7617089B2 (en) Method and apparatus for compiling two-level morphology rules
CN113688936B (zh) 一种图像文本的确定方法、装置、设备和存储介质
WO2023206023A1 (zh) 用于dna存储的编码方法及编码装置
KR102378038B1 (ko) 타겟 네트워크의 속성을 갖는 합성 네트워크를 얻기 위한 그래프 생성 장치 및 방법
JP2006031403A (ja) トランスデューサ処理装置、機械翻訳モデル作成装置、音声認識モデル作成装置、トランスデューサ処理方法、トランスデューサ処理プログラム、および、記録媒体
US20200388350A1 (en) Methods and apparatuses for performing character matching for short read alignment
JP2006260073A (ja) 遺伝子関係抽出プログラム、遺伝子関係抽出方法および遺伝子関係抽出装置
JPH11259482A (ja) 複合名詞の機械翻訳方式

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20939218

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20939218

Country of ref document: EP

Kind code of ref document: A1