CN114023374A - DNA channel simulation and coding optimization method and device - Google Patents
DNA channel simulation and coding optimization method and device Download PDFInfo
- Publication number
- CN114023374A CN114023374A CN202111307148.6A CN202111307148A CN114023374A CN 114023374 A CN114023374 A CN 114023374A CN 202111307148 A CN202111307148 A CN 202111307148A CN 114023374 A CN114023374 A CN 114023374A
- Authority
- CN
- China
- Prior art keywords
- simulation
- channel
- coding
- sequencing result
- dna
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004088 simulation Methods 0.000 title claims abstract description 155
- 238000005457 optimization Methods 0.000 title claims abstract description 86
- 238000000034 method Methods 0.000 title claims abstract description 65
- 238000012163 sequencing technique Methods 0.000 claims abstract description 76
- 108020004414 DNA Proteins 0.000 claims abstract description 71
- 238000013461 design Methods 0.000 claims abstract description 37
- 108091028043 Nucleic acid sequence Proteins 0.000 claims abstract description 24
- 238000007619 statistical method Methods 0.000 claims abstract description 16
- 238000009826 distribution Methods 0.000 claims description 47
- 238000011084 recovery Methods 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 6
- 239000005547 deoxyribonucleotide Substances 0.000 claims 1
- 125000002637 deoxyribonucleotide group Chemical group 0.000 claims 1
- 230000009897 systematic effect Effects 0.000 abstract description 10
- 239000002699 waste material Substances 0.000 abstract description 7
- 230000009286 beneficial effect Effects 0.000 abstract 1
- 102000053602 DNA Human genes 0.000 description 65
- 238000010586 diagram Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 238000002474 experimental method Methods 0.000 description 7
- 238000012795 verification Methods 0.000 description 7
- 238000012408 PCR amplification Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 238000005070 sampling Methods 0.000 description 4
- 230000006820 DNA synthesis Effects 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 3
- 238000003786 synthesis reaction Methods 0.000 description 3
- 238000001712 DNA sequencing Methods 0.000 description 2
- 238000007792 addition Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- XUIMIQQOPSSXEZ-UHFFFAOYSA-N Silicon Chemical compound [Si] XUIMIQQOPSSXEZ-UHFFFAOYSA-N 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000037429 base substitution Effects 0.000 description 1
- 229910002056 binary alloy Inorganic materials 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007481 next generation sequencing Methods 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 229910052710 silicon Inorganic materials 0.000 description 1
- 239000010703 silicon Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Landscapes
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Molecular Biology (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Genetics & Genomics (AREA)
- Chemical & Material Sciences (AREA)
- Physiology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The application relates to the technical field of information storage, in particular to a DNA channel simulation and coding optimization method and a device, wherein the method comprises the following steps: establishing a channel simulation model aiming at a storage condition based on a given storage link and parameters; inputting the coded DNA sequence into a channel simulation model to obtain a simulation sequencing result, and determining a decoding condition according to the simulation sequencing result; and obtaining coding optimization data through the simulation sequencing result and the corresponding decoding condition statistical analysis, and optimizing channel coding design and channel parameter design by using the coding optimization data. The embodiment of the application allows a user to quickly build a DNA channel model, verifies the feasibility of a specific coding scheme with extremely low experimental cost, and obtains the optimal redundancy design for a specific channel by a systematic adjustment method, thereby being beneficial to reducing the possibility of the problems of decoding failure, storage space waste and the like.
Description
Technical Field
The present application relates to the field of information storage technologies, and in particular, to a method and an apparatus for optimizing DNA (deoxyribose Nucleic Acid) channel simulation and coding.
Background
In the era of data explosion, the traditional information storage mode is difficult to meet the rapidly increasing data storage demand: the amount of information generated annually on earth is an exponential growth, with the expectation that by 2040 years, one million tons of silicon-based chips will be required globally to store the data generated this year. MakingFor storing molecules of biological genetic information, DNA has many advantages for information storage: high storage density, low energy consumption, long storage period, etc., especially up to 1019bit/cm3Theoretically, only one kilogram of DNA is needed to store the current global information, and the method is an attractive information storage medium.
DNA consists of deoxynucleotides containing four different bases (A, T, G, C), and theoretically one nucleotide site can store two bits of data. In actual operation, data to be stored is encoded into a plurality of DNA chains, DNA carrying information is synthesized by a DNA synthesizer, and after a series of steps such as sampling, PCR amplification and the like, the data is read by a DNA sequencer in a sequencing mode, and the stored data is decoded and recovered. Since this channel is not perfect, various errors may be introduced during storage, including intra-strand errors such as base substitutions, losses, additions, and whole strand losses. To perfectly recover the stored data from the sequencing information containing errors, it is necessary to use suitable error correcting codes to combat the noise in the channel by adding a certain amount of redundancy in the encoding.
In the related art, the academic world proposes a solution for using RS codes, fountain codes, LDPC codes and the like for DNA coding, but how to design a coding mode suitable for DNA storage and how to systematically determine a suitable amount of redundancy is a technical difficulty in the field. However, according to shannon coding theory, to find the optimal coding mode and redundancy, redundancy must be introduced in a proper way to combat the special noise structure of a specific channel, but the development of DNA storage experiments under the current technical conditions still has the limitations of high price and long experimental period, and it is difficult to iteratively optimize the coding design through repeated experiments.
Therefore, the coding design for DNA information storage still lacks a fast and low-cost verification method and a systematic adjustment and optimization method, and needs to be solved urgently.
Content of application
The application provides a method and a device for simulating and optimizing a DNA channel and codes, which aim to solve the problems that a coding design for DNA information storage still lacks a quick and low-cost verification mode and a systematic adjustment and optimization method is not provided.
The embodiment of the first aspect of the application provides a DNA channel simulation and coding optimization method, which comprises the following steps: establishing a channel simulation model aiming at a storage condition based on a given storage link and parameters; inputting the coded DNA sequence into the channel simulation model to obtain a simulation sequencing result, and determining the decoding condition according to the simulation sequencing result; and obtaining coding optimization data through statistical analysis of the simulation sequencing result and the corresponding decoding condition, and optimizing the channel coding design and the channel parameter design by using the coding optimization data.
Optionally, in an embodiment of the present application, the inputting the encoded data into the channel simulation model to obtain a simulation sequencing result includes: inputting the coded DNA sequence into the channel simulation model to obtain a sequencing sequence and sequence existing states of each intermediate stage; and acquiring the simulation sequencing result according to the sequencing sequence and the sequence existing state of each intermediate stage.
Optionally, in an embodiment of the present application, after obtaining the simulated sequencing result, the method further includes: and extracting channel error characteristics based on the simulation sequencing result, and adjusting the coding optimization data by utilizing the channel error characteristics to obtain the optimal coding optimization data.
Optionally, in an embodiment of the present application, the obtaining of the encoding optimization data through statistical analysis of the simulation sequencing result and the corresponding decoding condition includes: obtaining one or more of distribution of errors in chains, distribution of copy number of each chain, number of sequences with errors and sequence loss, and proportion content of data recovery during decoding based on the simulation sequencing result; and obtaining the coding optimization data from one or more items of distribution of errors in the chains, distribution of copy numbers of each chain, number of sequences lost and containing errors and proportion content of data recovery during decoding.
Optionally, in an embodiment of the present application, the obtaining of the encoding optimization data through statistical analysis of the simulation sequencing result and the corresponding decoding condition further includes: based on the principle of redundancy and error matching, the optimal balance point of the information storage density and the successful recovery probability is determined by calculating the relationship among the redundancy, the information storage density and the successful recovery probability.
The embodiment of the second aspect of the present application provides a DNA channel simulation and coding optimization device, including: the system comprises an establishing module, a channel simulation module and a storage module, wherein the establishing module is used for establishing a channel simulation model aiming at storage conditions based on given storage links and parameters; the simulation module is used for inputting the coded DNA sequence into the channel simulation model to obtain a simulation sequencing result, and determining the decoding condition according to the simulation sequencing result; and the optimization module is used for obtaining coding optimization data through statistical analysis of the simulation sequencing result and the corresponding decoding condition, and optimizing the channel coding design and the channel parameter design by using the coding optimization data.
Optionally, in an embodiment of the present application, the simulation module includes: the generating unit is used for inputting the coded DNA sequence into the channel simulation model to obtain a sequencing sequence and sequence existing states of each intermediate stage; and the first acquisition unit is used for acquiring the simulation sequencing result according to the sequencing sequence and the sequence existing state of each intermediate stage.
Optionally, in an embodiment of the present application, the optimization module includes: the second acquisition unit is used for acquiring one or more items of distribution of errors in chains, distribution of copy number of each chain, number of sequences with errors and sequence loss in sequence and proportion content of data recovery during decoding on the basis of the simulation sequencing result; and the third acquisition unit is used for obtaining the coding optimization data from one or more of distribution of errors in the chains, distribution of copy numbers of all chains, number of sequences with errors and sequence loss, and proportion content of data recovery during decoding.
An embodiment of a third aspect of the present application provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the DNA channel simulation and coding optimization method as described in the above embodiments.
A fourth aspect of the present application provides a computer-readable storage medium, which stores computer instructions for causing the computer to execute the DNA channel simulation and coding optimization method according to the above embodiments.
The embodiment of the application can allow a user to quickly create a simulation model of a specific DNA channel, analyze channel error characteristics with extremely low experimental cost, verify the feasibility of a specific coding scheme, save a large amount of time and money, provide a systematic redundancy adjustment scheme, acquire the optimal redundancy design for the specific coding system, help to reduce the possibility of decoding failure, storage space waste and other problems, and effectively meet the coding design requirements of DNA information storage. Therefore, the problems that a coding design oriented to DNA information storage still lacks a quick and low-cost verification mode and a system adjustment and optimization method are solved.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart of a DNA channel simulation and coding optimization method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a DNA channel simulation and code optimization method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a simulation model construction method of a DNA channel simulation and coding optimization method according to an embodiment of the present application;
FIG. 4 is a diagram illustrating simulation results of a DNA channel simulation and coding optimization method according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a parameter optimization process of a DNA channel simulation and coding optimization method according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a user interface for a DNA channel simulation and code optimization method according to an embodiment of the present application;
FIG. 7 is a diagram of an example of a DNA channel simulation and code optimization apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.
The following describes a DNA channel simulation and coding optimization method, apparatus, electronic device, and storage medium according to embodiments of the present application with reference to the drawings. In order to solve the problems that a coding design for DNA information storage still lacks a quick and low-cost verification mode and a system adjustment and optimization method mentioned in the background technology center, the application provides a DNA channel simulation and coding optimization method, in the method, a user is allowed to quickly create a simulation model of a specific DNA channel, channel error characteristics are analyzed at extremely low experimental cost, the feasibility of a specific coding scheme is verified, a large amount of time and money can be saved, a systematic redundancy adjustment scheme is provided, the optimal redundancy design for the specific coding system is obtained, the possibility of the problems of decoding failure, storage space waste and the like is favorably reduced, and the coding design requirement of the DNA information storage is effectively met. Therefore, the problems that a coding design oriented to DNA information storage still lacks a quick and low-cost verification mode and a system adjustment and optimization method are solved.
Specifically, fig. 1 is a schematic flow chart of a DNA channel simulation and coding optimization method provided in the embodiment of the present application.
As shown in fig. 1, the DNA channel simulation and coding optimization method includes the following steps:
in step S101, a channel simulation model for the storage conditions is established based on the given storage links and parameters.
It is to be understood that, as shown in fig. 2, first, the embodiment of the present application establishes a channel simulation model for a storage condition based on a given storage link and parameters. The channel simulation model may be constructed in a modular cascade manner, so that the channel simulation model may be used to generate a simulation result in a random simulation or analysis manner in the following steps, which will be described in detail below.
For example, as shown in fig. 3, in the embodiment of the present application, a channel simulation model is quickly established in a module cascade manner, and two basic modules, namely, an error increase module E and a distribution transformation module D, may be used to simulate two basic changes, namely, an in-chain error, a change in chain copy number, and a loss of an entire chain, which are newly introduced; through the combination of the two basic modules and the additional special module, a simulation model of each main experimental link such as DNA synthesis, decay, PCR amplification, sampling, DNA sequencing and the like is constructed; and combining the pre-constructed simulation models according to actually adopted experimental steps to obtain a complete channel model for the specific channel. Through the modular construction mode, a user can quickly customize a specific channel model and is allowed to expand a new module according to experimental needs.
In addition, the experimental parameters may be determined by: use system default parameters from the reference; using actual measured instrument parameters and experimentally measured instrument parameters; providing actual data obtained under a specific experimental setting, and minimizing the difference between the simulation data and the actual data to fit to obtain the parameters of the channel, which is not limited herein.
The simulation model of the embodiment of the application simulates the random process of error generation and distribution change of each chain copy in each stage in a random simulation mode, and partial links can also use an analytical solution to perform certain approximate accelerated simulation speed, so that the use requirement is further met.
In step S102, the encoded DNA sequence is input into a channel simulation model to obtain a simulation sequencing result, and the decoding condition is determined from the simulation sequencing result.
It can be appreciated that, as shown in fig. 2, secondly, the embodiment of the present application may input the encoded data into an established channel simulation model, obtain a simulation sequencing result, and attempt to decode to verify the feasibility of the encoding design.
Optionally, in an embodiment of the present application, inputting the encoded data into a channel simulation model to obtain a simulation sequencing result, including: inputting the coded DNA sequence into a channel simulation model to obtain a sequencing sequence and sequence existing states of each intermediate stage; and acquiring a simulation sequencing result according to the sequencing sequence and the sequence existing state of each intermediate stage.
For example, the data input into the channel simulation model in the embodiment of the present application are N DNA sequences, the data output by the simulation model is actual sequencing data for simulation, and the simulation and the versatility of the model are ensured by adopting an input and output format similar to that of an actual experiment.
In addition, in an embodiment of the present application, after obtaining the simulation sequencing result, the method further includes: and extracting channel error characteristics based on the simulation sequencing result, and adjusting the coding optimization data by utilizing channel error characteristic analysis to obtain the optimal coding optimization data.
That is, the channel simulation model of the embodiment of the present application accepts the encoded DNA sequence as input, and provides the sequence existence status of each intermediate stage for further analysis of the channel error characteristics while providing the simulated final sequencing sequence as simulation output. That is, in addition to the final sequencing data, the simulation model may also provide information such as the existence status of DNA intermediaries for enhancing the understanding of the target channel.
In step S103, coding optimization data is obtained from the simulation sequencing result and the corresponding decoding condition statistical analysis, and the coding optimization data is used to optimize the channel coding design and the channel parameter design.
It is understood that, as shown in fig. 2, finally, the embodiment of the present application optimizes the coding design by the principle of matching redundancy with error amount with the goal of seeking information storage density and successful recovery probability.
Optionally, in an embodiment of the present application, the obtaining of the encoding optimization data by statistical analysis of the simulation sequencing result and the corresponding decoding condition includes: obtaining one or more of distribution of errors in chains, distribution of copy number of each chain, number of sequences lost and containing errors and proportion content of data recovery during decoding based on a simulation sequencing result; and obtaining the coding optimization data from one or more items of distribution of errors in chains, distribution of copy numbers of each chain, number of sequences lost and containing errors and proportion content of data recovery during decoding.
In the actual implementation process, the content of the statistical analysis of the embodiment of the present application may include, but is not limited to, the distribution of errors within chains, the distribution of copy numbers of each chain, the number of sequences lost and containing errors, and the content of the proportion of data recovery at the time of decoding.
Further, in an embodiment of the present application, the obtaining of the encoding optimization data by statistical analysis of the simulation sequencing result and the corresponding decoding condition further includes: based on the principle of redundancy and error matching, the optimal balance point of the information storage density and the successful recovery probability is determined by calculating the relationship among the redundancy, the information storage density and the successful recovery probability.
It should be understood by those skilled in the art that the encoding optimization method of the embodiment of the present application determines the optimal balance point of the information storage density and the successful restoration probability by calculating the relationship between the redundancy, the information storage density and the successful restoration probability based on the principle of redundancy and error matching.
For example, step S103 in the embodiment of the present application includes:
step S1031: and (3) running the simulation and decoding process for multiple times, and counting to obtain error distribution (including distribution of copy number of each chain, distribution of errors in the chain, distribution of lost number of sequences, and distribution of number of error sequences after voting) and decoding redundancy requirement distribution (distribution of redundancy required by decoding under a certain error condition).
Step S1032: and calculating the relationship between the information storage density, the successful recovery probability and the use redundancy according to the distribution obtained in the step S1031, and recommending a proper redundancy design according to actual requirements.
In short, the embodiment of the application includes a given storage link and a given parameter, a channel simulation model for a storage condition is established, encoded data is input into the established channel simulation model to obtain a simulation sequencing result, decoding is tried to verify the feasibility of a coding design, statistical analysis is performed on the simulation data and the decoding condition to obtain a systematic coding optimization scheme, so that a user can be allowed to quickly establish a DNA channel model, the feasibility of a specific coding scheme is verified at an extremely low experimental cost, an optimal redundancy design for a specific channel is obtained through a systematic adjustment method, and the possibility of problems such as decoding failure, storage space waste and the like is reduced.
The following examples are presented to schematically illustrate the DNA channel simulation and coding optimization methods of the examples of the present application.
In one embodiment of the present application, as shown in fig. 1 and fig. 2, a simulation model is established for a specific DNA channel under a certain experimental condition, and coding verification and redundancy optimization of a DNA fountain code are implemented based on the simulation model. It should be particularly noted that the embodiments of the present application are only exemplary, and besides the target channel and the coding method used in the examples, the present application can be applied to simulation of various channel conditions and optimization of various coding methods, and the present embodiment is not to be construed as limiting the present application.
Specifically, the DNA fountain code is a common code in the field of DNA information storage, and the coding principle is as follows: during encoding, binary data is divided into N fragments, the fragments are linearly combined by using a fountain algorithm to generate (1+ alpha) N 'droplets', and each droplet is added with the length LRSThe RS code of (a) is then converted into a DNA sequence. After passing through the DNA channel, a part of the DNA sequence is lost, and errors such as base deletion, addition, substitution and the like may occur in the DNA sequence. During decoding, each DNA sequence is converted into a binary system, an RS code is used for correcting errors in a chain, and if too many errors cannot be corrected, the chain is directly discarded; the original data is solved by using the residual error-free 'droplets', and when the number of the residual 'droplets' is slightly larger than N, the original data can be recovered. Alpha and LRSThe higher the setting, the more redundancy is added, the stronger the ability to combat channel noise, but the information density will also decrease accordingly, and it is necessary to select a suitable value according to the noise characteristics of the channel. In this embodiment, a process of establishing a channel simulation model, simulating an actual storage experiment to verify a coding scheme, and performing system adjustment on redundancy by using the method of the embodiment of the present application will be shown.
Step S1: and (4) giving a storage link and parameters, and establishing a channel simulation model aiming at the storage condition.
In some embodiments, the experimenter expects to use primer pool chip synthesis techniques for DNA synthesis and PCR amplification of the synthesized DNA pool; the data is stored in the DNA pool for a certain time; to read the data, a small amount of solution was taken from the DNA pool, amplified by PCR and read by sequencing using the illumina next generation sequencing platform. According to the experimental process used in the implementation, the corresponding DNA channel model is constructed by sequentially combining the modules of DNA synthesis, PCR amplification, DNA decay, sampling, PCR amplification and DNA sequencing. Wherein, parameters of synthesis and sequencing links refer to public measurement data of corresponding platforms, and parameters used in links such as PCR and the like are consistent with actual experiments.
Step S2: and inputting the coded data into the established channel simulation model to obtain a simulation sequencing result, and trying to decode and verify the feasibility of the coding design.
In some embodiments, the lena. jpg file is encoded by fountain codes, and the length L of a single DNA sequence is set to 104bp according to the restriction of the synthesis and sequencing platform on the length of DNA, and L is set to 0.5 according to αRSRedundancy was set at 4, yielding 2076 encoded DNA sequences. And (2) simulating by using a given channel to obtain a simulated sequencing result, decoding the obtained DNA sequence to obtain 2612 'droplets' capable of being corrected by RS codes, setting the noise degree of which the redundancy is obviously higher than that of the channel, and decoding and recovering Lena.
Besides verifying the feasibility of the decoding scheme, the simulation model in the method can also be used for obtaining the copy number distribution and the variation of the intra-chain error distribution in each stage (figure 4.a), the copy of the DNA sequence corresponding to each stage has the variation of the form (figure 4.b), the result of voting of a plurality of pieces of sequencing data of a single DNA sequence (figure 4.c), and the variation trend of the error number after adjusting parameters such as the sequencing depth, the sampling depth and the like, so that the comprehensive system knowledge of the error characteristics of the target channel can be obtained, and the method is helpful for guiding the proposal of a new coding method.
Step S3: and carrying out statistical analysis on the simulation data and the decoding condition to obtain a systematic coding optimization scheme.
In some embodiments, the redundancy of fountain codes is optimized to achieve a high information storage density while achieving a desired decoding success rate using as little redundancy as possible.
First, length L is encoded for intra-chain RSRSAnd (6) optimizing. Estimating different L according to the following formula according to the distribution data of the chain containing k errors obtained by simulationRSLower information density, selecting L that can obtain the highest information densityRS:
According to D (k), L obtained by calculationRSOptimal 2, an estimated information density of 76% can be achieved, so L is chosenRS=2。
Then, α is optimized. In the embodiment of the application, the decoding failure probability p is calculatedfailAnd selecting proper alpha according to the successful decoding probability required by experiments in functional relation with alpha. Wherein p isfailThe (α) can be obtained by a relation of two distributions of the number of losses and the number of chains required for decoding:the two distributions can be obtained by running simulation and decoding processes for multiple times respectively and fitting the obtained data to a specific prior distribution, the prior distribution can be determined by theoretical derivation or experimental mode, as shown in fig. 5, in the example of the application, the lost number distribution uses the poisson distribution, and the decoding chain number distribution obeys the Gumbel distribution (as shown in fig. 5. b). According to the obtained pfailThe (α) curve, when a decoding success rate of 99% is desired, may be set to α -0.25-0.28 (see fig. 5. c).
In the embodiment, the analysis of the error characteristics of the target channel and the verification of the coding mode are realized through a low-cost computer simulation mode, so that a large amount of experiment cost can be saved; the optimal redundancy design is obtained through a system optimization mode, and the problems of decoding failure, storage space waste and the like are avoided.
According to the DNA channel simulation and coding optimization method provided by the embodiment of the application, a user can be allowed to quickly create a simulation model of a specific DNA channel, channel error characteristics are analyzed at extremely low experimental cost, the feasibility of a specific coding scheme is verified, a large amount of time and money can be saved, a systematic redundancy adjustment scheme is provided, the optimal redundancy design for the specific coding system is obtained, the possibility of decoding failure, waste of storage space and other problems is reduced, and the coding design requirement of DNA information storage is effectively met.
Next, a DNA channel simulation and coding optimization apparatus according to an embodiment of the present application will be described with reference to the drawings.
FIG. 7 is a block diagram of a DNA channel simulation and coding optimization apparatus according to an embodiment of the present application.
As shown in fig. 7, the DNA channel simulation and coding optimization apparatus 10 includes: a setup module 100, a simulation module 200 and an optimization module 300.
Specifically, the building module 100 is configured to build a channel simulation model for the storage condition based on the given storage elements and parameters.
And the simulation module 200 is configured to input the encoded DNA sequence into a channel simulation model to obtain a simulation sequencing result, and determine a decoding condition according to the simulation sequencing result.
And the optimization module 300 is configured to obtain coding optimization data through statistical analysis of the simulation sequencing result and the corresponding decoding condition, and optimize channel coding design and channel parameter design by using the coding optimization data.
Optionally, in an embodiment of the present application, the simulation module 200 includes: the device comprises a generating unit and a first acquiring unit.
The generating unit is used for inputting the coded DNA sequence into a channel simulation model to obtain a sequencing sequence and sequence existing states of each intermediate stage.
And the first acquisition unit is used for acquiring a simulation sequencing result according to the sequencing sequence and the sequence existence state of each intermediate stage.
Optionally, in an embodiment of the present application, the optimization module 300 includes: a second acquisition unit and a third acquisition unit.
And the second acquisition unit is used for acquiring one or more of distribution of errors in the chains, distribution of copy number of each chain, number of sequences with errors and lost sequences and proportional content of data recovery during decoding based on the simulation sequencing result.
And the third acquisition unit is used for obtaining the coding optimization data from one or more items of distribution of errors in the chains, distribution of copy numbers of all chains, number of sequences with errors and sequence loss, and proportion content of data recovery during decoding.
Optionally, in an embodiment of the present application, the simulation module 200 is further configured to, after obtaining the simulation sequencing result, extract a channel error feature based on the simulation sequencing result, and adjust the coding optimization data by using channel error feature analysis to obtain the optimal coding optimization data.
Optionally, in an embodiment of the present application, the optimization module 300 is further configured to determine an optimal balance point between the information storage density and the successful restoration probability by calculating a relationship between the redundancy, the information storage density, and the successful restoration probability based on the redundancy and the error matching principle.
It should be noted that the above explanation of the embodiment of the DNA channel simulation and coding optimization method is also applicable to the DNA channel simulation and coding optimization apparatus of this embodiment, and is not repeated here.
According to the DNA channel simulation and coding optimization device provided by the embodiment of the application, a user can be allowed to quickly create a simulation model of a specific DNA channel, channel error characteristics are analyzed at extremely low experimental cost, the feasibility of a specific coding scheme is verified, a large amount of time and money can be saved, a systematic redundancy adjustment scheme is provided, the optimal redundancy design for the specific coding system is obtained, the possibility of decoding failure, waste of storage space and other problems is reduced, and the coding design requirement of DNA information storage is effectively met.
Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device may include:
a memory 801, a processor 802, and a computer program stored on the memory 801 and executable on the processor 802.
The processor 802, when executing the program, implements the DNA channel simulation and encoding optimization methods provided in the embodiments described above.
Further, the electronic device further includes:
a communication interface 803 for communicating between the memory 801 and the processor 802.
A memory 801 for storing computer programs operable on the processor 802.
The memory 801 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
If the memory 801, the processor 802 and the communication interface 803 are implemented independently, the communication interface 803, the memory 801 and the processor 802 may be connected to each other via a bus and communicate with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.
Alternatively, in practical implementation, if the memory 801, the processor 802 and the communication interface 803 are integrated into one chip, the memory 801, the processor 802 and the communication interface 803 may communicate with each other through an internal interface.
The processor 802 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present Application.
The present embodiment also provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the DNA channel simulation and coding optimization method as above.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "N" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more N executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of implementing the embodiments of the present application.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or N wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.
Claims (10)
1. A DNA channel simulation and coding optimization method is characterized by comprising the following steps:
establishing a channel simulation model aiming at a storage condition based on a given storage link and parameters;
inputting the coded DNA sequence of the deoxyribonucleotide into the channel simulation model to obtain a simulation sequencing result, and determining the decoding condition according to the simulation sequencing result; and
and obtaining coding optimization data through statistical analysis of the simulation sequencing result and the corresponding decoding condition, and optimizing the channel coding design and the channel parameter design by using the coding optimization data.
2. The method of claim 1, wherein inputting the encoded data into the channel simulation model to obtain a simulation sequencing result comprises:
inputting the coded DNA sequence into the channel simulation model to obtain a sequencing sequence and sequence existing states of each intermediate stage;
and acquiring the simulation sequencing result according to the sequencing sequence and the sequence existing state of each intermediate stage.
3. The method of claim 2, further comprising, after obtaining the simulated sequencing result:
and extracting channel error characteristics based on the simulation sequencing result, and adjusting the coding optimization data by utilizing the channel error characteristics to obtain the optimal coding optimization data.
4. The method of claim 1, wherein the obtaining of the coding optimization data from the simulated sequencing results and the corresponding decoding statistical analysis comprises:
obtaining one or more of distribution of errors in chains, distribution of copy number of each chain, number of sequences with errors and sequence loss, and proportion content of data recovery during decoding based on the simulation sequencing result;
and obtaining the coding optimization data from one or more items of distribution of errors in the chains, distribution of copy numbers of each chain, number of sequences lost and containing errors and proportion content of data recovery during decoding.
5. The method of claim 1 or 4, wherein the obtaining of the encoding optimization data from the simulated sequencing result and the corresponding decoding statistical analysis further comprises:
based on the principle of redundancy and error matching, the optimal balance point of the information storage density and the successful recovery probability is determined by calculating the relationship among the redundancy, the information storage density and the successful recovery probability.
6. A DNA channel simulation and coding optimization apparatus, comprising:
the system comprises an establishing module, a channel simulation module and a storage module, wherein the establishing module is used for establishing a channel simulation model aiming at storage conditions based on given storage links and parameters;
the simulation module is used for inputting the coded DNA sequence into the channel simulation model to obtain a simulation sequencing result, and determining the decoding condition according to the simulation sequencing result; and
and the optimization module is used for obtaining coding optimization data through statistical analysis of the simulation sequencing result and the corresponding decoding condition, and optimizing the channel coding design and the channel parameter design by using the coding optimization data.
7. The apparatus of claim 6, wherein the simulation module comprises:
the generating unit is used for inputting the coded DNA sequence into the channel simulation model to obtain a sequencing sequence and sequence existing states of each intermediate stage;
and the first acquisition unit is used for acquiring the simulation sequencing result according to the sequencing sequence and the sequence existing state of each intermediate stage.
8. The apparatus of claim 6, wherein the optimization module comprises:
the second acquisition unit is used for acquiring one or more items of distribution of errors in chains, distribution of copy number of each chain, number of sequences with errors and sequence loss in sequence and proportion content of data recovery during decoding on the basis of the simulation sequencing result;
and the third acquisition unit is used for obtaining the coding optimization data from one or more of distribution of errors in the chains, distribution of copy numbers of all chains, number of sequences with errors and sequence loss, and proportion content of data recovery during decoding.
9. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the DNA channel simulation and coding optimization method according to any one of claims 1 to 5.
10. A computer-readable storage medium, on which a computer program is stored, the program being executable by a processor for implementing the DNA channel simulation and coding optimization method according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111307148.6A CN114023374A (en) | 2021-11-05 | 2021-11-05 | DNA channel simulation and coding optimization method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111307148.6A CN114023374A (en) | 2021-11-05 | 2021-11-05 | DNA channel simulation and coding optimization method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114023374A true CN114023374A (en) | 2022-02-08 |
Family
ID=80061643
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111307148.6A Pending CN114023374A (en) | 2021-11-05 | 2021-11-05 | DNA channel simulation and coding optimization method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114023374A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115933972A (en) * | 2022-11-24 | 2023-04-07 | 中国华能集团清洁能源技术研究院有限公司 | Distributed data storage method and system for multi-professional simulation platform |
-
2021
- 2021-11-05 CN CN202111307148.6A patent/CN114023374A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115933972A (en) * | 2022-11-24 | 2023-04-07 | 中国华能集团清洁能源技术研究院有限公司 | Distributed data storage method and system for multi-professional simulation platform |
CN115933972B (en) * | 2022-11-24 | 2024-05-31 | 中国华能集团清洁能源技术研究院有限公司 | Distributed data storage method and system of multi-specialty simulation platform |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112711935B (en) | Encoding method, decoding method, apparatus, and computer-readable storage medium | |
KR20190118853A (en) | Dna digital data storage device and method, and decoding method of dna digital data storage device | |
CN111858507B (en) | DNA-based data storage method, decoding method, system and device | |
Wilburn et al. | Remote homology search with hidden Potts models | |
Hamoum et al. | Channel model with memory for DNA data storage with nanopore sequencing | |
Song et al. | Super-robust data storage in DNA by de Bruijn graph-based decoding | |
CN113314187B (en) | Data storage method, decoding method, system, device and storage medium | |
CN114023374A (en) | DNA channel simulation and coding optimization method and device | |
Masutani et al. | Investigating the mitochondrial genomic landscape of Arabidopsis thaliana by long-read sequencing | |
Yuan et al. | DeSP: a systematic DNA storage error simulation pipeline | |
CN115312129A (en) | Gene data compression method and device in high-throughput sequencing background and related equipment | |
WO2019204702A1 (en) | Error-correcting dna barcodes | |
Landweber et al. | DNA2DNA computations: A potential “killer app”? | |
Chaykin et al. | DNA-storalator: end-to-end DNA storage simulator | |
Huo et al. | CS2A: A compressed suffix array-based method for short read alignment | |
EP2947589A1 (en) | Method and apparatus for controlling a decoding of information encoded in synthesized oligos | |
CN110915140B (en) | Method for encoding and decoding quality values of a data structure | |
Quah et al. | DNA data storage, sequencing data-carrying DNA | |
Shafir et al. | Sequence design and reconstruction under the repeat channel in enzymatic DNA synthesis | |
CN114730616A (en) | Information encoding and decoding method, apparatus, storage medium, and information storage and reading method | |
Banik | Effect of the side effect machines in edit metric decoding | |
Jiang et al. | DNA Storage Designer: A practical and holistic design platform for storing digital information in DNA sequence | |
Muttakin et al. | Motif discovery in unaligned DNA sequences using genetic algorithm | |
Rescheneder | Fast, accurate and user-friendly alignment of short and long read data with high mismatch rates | |
EP3427385A1 (en) | Method and device for decoding data segments derived from oligonucleotides and related sequencer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |