CN116597901A

CN116597901A - DNA information encoding and decoding method based on modulation effect

Info

Publication number: CN116597901A
Application number: CN202310576945.7A
Authority: CN
Inventors: 弥胜利; 曹让利; 黄玉; 任钱伦; 梁玮峰
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2023-05-22
Filing date: 2023-05-22
Publication date: 2023-08-15

Abstract

The invention provides a DNA information encoding and decoding method based on modulation, wherein the encoding method comprises the following steps: the method comprises the steps of information blocking, information checking, sequence conversion, modulation optimization, generation of a mark sequence, generation of an addressing sequence and output, wherein the base sequence can be optimized through a specially constructed modulation primer during encoding, so that the base sequence meets constraint conditions, the mark sequence is constructed, and correct original information can be recovered during decoding. The modulation primer, the index sequence and the primer sequence can be combined, so that the self addressing function can be saved, the PCR primer can be used, and the information redundancy is greatly reduced. This codec scheme can be used for encoding and decoding digital information stored by any computer, and the encoding density is close to the theoretical limit. And the number of addressing sequences with specificity, which are matched with the base sequence in scale, is generated, so that the requirements of multiple functions such as modulation, addressing, PCR amplification, information retrieval and the like are met.

Description

DNA information encoding and decoding method based on modulation effect

Technical Field

The invention relates to the technical field of computer information encoding and decoding, in particular to a DNA information encoding and decoding method based on modulation.

Background

With the development of internet technology, artificial intelligence, cloud computing and other technologies, the world has entered an informatization age, and accordingly, the explosion type growth of data is performed, and everyone is generating data and processing data at any time. Based on the statistics of the relevant data, the current accumulated data has exceeded 20 bytes and is also growing at a rate of 50% per year, which results in a huge gap in information storage in the near future. In order to solve the problem, people put the eyes in the field of molecular storage, and biomolecules with specific special structures and sequences such as DNA, polypeptide and the like become research hot spots of people, and compared with the photoelectromagnetic storage media such as flash memory and the like, the molecular storage media have smaller storage volume and larger storage density. Among various molecular storage media, DNA is expected to be a new generation of storage media because of its high stability, long-term storage under simple conditions, easy replication, and the like.

DNA information storage is to store information by using the arrangement sequence of four deoxyribonucleotides "A", "T", "C" and "G". The general process is that firstly, the computer information is encoded and decoded into DNA sequence information, the DNA sequence information is stored in a text file, and then DNA is synthesized according to the DNA sequence to store the information. When synthesizing DNA for storing information, DNA sequence information stored in a file is read, and DNA is synthesized according to the encoded sequence. Compared with the traditional information storage mode which can only store information in two states of 0 and 1, the DNA information storage can store more information in a quaternary storage mode, and the DNA with the nano-scale structure size is smaller in volume, so that the volume density of the DNA storage is several orders of magnitude higher than that of the traditional photoelectric storage media such as flash memory and the like. The quaternary coding mode is the storage density of DNA storage theory, and in the practical coding process, certain constraint conditions also need to be met due to the limitations of DNA sequencing and synthesis technology, and specifically, two constraint conditions are mainly included: 1. the content of G, C in deoxyribonucleotide should be 40% -60%; 2. the converted DNA strand should be as free of single repeated bases as possible, i.e., 4 or more single deoxyribonucleotides cannot be present in succession. The limitations of these constraints ensure that the encoded DNA strand can be synthesized and sequenced correctly, but also limit the encoding density of the DNA information storage. Therefore, in order to improve the practical value of DNA information storage, it is necessary to develop a high-density encoding scheme satisfying the constraint condition. In addition, as the data size of DNA information storage increases, information retrieval also becomes one of the factors limiting the increase in data size. The main current large-scale data retrieval method is mainly based on a PCR amplification method, and has data display, and the method can retrieve required target files in 200MB storage information. However, the PCR amplification method requires the construction of a specific PCR primer library, and the specificity of the primer itself, the primer and the stored information is required.

In general, it is necessary to create a high-density coding scheme and a high-capacity PCR primer library under the constraint condition for practical application of DNA information storage.

It should be noted that the information disclosed in the above background section is only for understanding the background of the application and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.

Disclosure of Invention

The application aims to effectively solve the problems faced by the existing DNA information storage and provides a DNA information encoding and decoding method based on modulation.

In order to achieve the above purpose, the present application adopts the following technical scheme:

a method for encoding DNA information based on modulation, comprising the steps of:

s1, information blocking: carrying out data blocking on file information to be coded to obtain a plurality of information blocks, and distributing an index sequence number for each information block;

s2, information verification: adding verification information for each information block;

s3, sequence conversion: converting byte stream data in each information block into binary information and then into a base sequence;

s4, modulation optimization: constructing a modulation primer, judging whether the base sequence meets constraint conditions, and optimizing the base sequence which does not meet the constraint conditions by using the modulation primer;

S5, generating a mark sequence: generating a unique mark sequence for each information block according to the modulation process information;

s6, generating an addressing sequence: generating an addressing sequence of each information block according to the modulation primer and the index sequence number of each information block;

s7, outputting: and connecting each information block with a mark sequence and an addressing sequence thereof, assembling the information blocks into a complete DNA chain, and outputting the complete DNA chain to a text for storage.

Further:

step S3 includes converting the information in the information block into a base sequence form by a quaternary-like encoding rule, wherein "00", "01", "10", "11" correspond to bases "a", "T", "C", "G", respectively.

The step S4 includes: constructing a GC modulation primer and a homopolymer modulation primer for each information block, wherein the modulation primer is a base sequence with 50% GC content and no homopolymer and has a length of 4, and the GC content and the homopolymer of the information sequence meet constraint conditions through modulation of the modulation primer;

the following modulation strategy was employed: if the base sequence meets the constraint condition, not modulating; if the GC content of the base sequence or the homopolymer has a problem, the corresponding modulation primer is used for modulation; the GC content of the base sequence and the homopolymer have problems, and two modulation primers are used for modulation at the same time;

Preferably, the whole sequence is directly modulated for the sequence with the GC content not meeting the requirement; positioning a sequence having an excessively long homopolymer to a position containing the homopolymer, and locally modulating only the homopolymer at the position;

preferably, selecting the shortest modulation result of the mark sequence from the sequences meeting the constraint condition after modulation optimization as the constraint sequence;

preferably, the modulation process of modulation optimization is an exclusive-or operation of 2 bits, the bases of the base sequence are sequentially grouped, each group of bases corresponds to the modulation primer, the exclusive-or operation is carried out on the bases one by one, and the operation result is the modulation result.

The step S5 comprises the following steps: generating a unique flag sequence for each information block to record modulation process information, the modulation process information recorded in the flag sequence comprising three parts: whether or not the whole modulation is performed using the GC-modulated primer; specific locations for homopolymer modulation; the number of times the homopolymer modulation is performed;

preferably, 1-bit binary information is used for representing whether the GC primer is used for integral modulation, the specific position for carrying out homopolymer modulation is represented by 8-bit binary digits, and the number of times of homopolymer modulation is stored by 7-bit binary information, and the number of times of homopolymer modulation is combined with the previous 1-bit GC modulation information to form 8-bit binary information;

Preferably, the tag sequence is converted to a base sequence using a quaternary-like coding rule, and the base sequence is modulated using GC-modulating primers and/or homopolymer-modulating primers.

The step S6 comprises the following steps: the index sequence number and the modulation primer are combined into an addressing sequence, and the addressing sequence of each information block is generated by adopting a structural addressing sequence design method;

the structural addressing sequence design method combines a marker, a balance sequence, the modulation primer and the index sequence number into an addressing sequence, wherein the marker is a homopolymer for ensuring the specificity of the addressing sequence, and the balance sequence is a randomly generated multi-bit base for ensuring the balance of GC content; an addressing sequence is respectively arranged at the front end and the rear end of each DNA chain and is respectively called a front addressing sequence and a rear addressing sequence, wherein GC modulation primers are stored on the front addressing sequence, and homopolymer modulation primers are stored on the rear addressing sequence.

The step S7 includes: and combining the constraint sequence after modulation optimization with the front primer, the rear primer and the marker sequence to form a complete structure of a DNA chain for storing information, and outputting the complete structure to a text file for storage.

A DNA information decoding method based on modulation for decoding a DNA strand encoded using said DNA information encoding method, comprising the steps of:

T1, addressing sequence recovery: reading the addressing sequence of the DNA chain, and recovering to obtain a modulation primer and an index sequence number;

t2: modulation recovery: restoring the tag sequence and decoding the base sequence into an unmodulated sequence based on the tag sequence information and the modulation primer;

t3: sequence conversion: converting the unmodulated sequence into a binary sequence and then recovering the binary sequence into byte stream data;

t4: and (3) checking information: according to the checking algorithm and the redundant information, checking and correcting the stored information;

t5: and (3) information recombination: and recovering the correct storage information into the original information according to the index sequence number.

Further:

step T1 includes: dividing the position of an addressing sequence according to the read DNA sequence, and disassembling according to the structure of the addressing sequence to obtain a modulation primer and an index sequence;

step T2 includes: firstly, a marker sequence is found out from a DNA sequence, modulation process information is decoded, and then, the DNA sequence is reversely modulated by combining a modulation primer which is analyzed from a primer, so that an unmodulated storage information sequence is obtained;

step T3 includes: firstly, converting an unmodulated stored information sequence from a base sequence to a binary sequence, and then converting the binary sequence into byte stream data;

Step T4 includes: the modulated and converted byte stream data comprises stored information and verification information, and the information is verified by using a verification algorithm and information errors are corrected;

step T5 includes: and (3) sorting the information after decoding and checking by using a sorting algorithm according to the index sequence number, recovering the information into original storage information, and finally outputting the original storage information into an original storage file.

A computer readable storage medium storing a computer program which, when executed by a processor, implements the DNA information encoding method and/or the DNA information decoding method.

A control apparatus comprising a processor and a storage medium for storing a computer program; wherein the processor is used for implementing the DNA information encoding method and/or the DNA information decoding method when executing the computer program.

The invention has the following beneficial effects:

the invention provides a DNA information encoding and decoding method based on modulation effect, which can optimize a base sequence through a specially constructed modulation primer when encoding, so that the base sequence meets constraint conditions, a mark sequence is constructed, and correct original information can be recovered when decoding. Furthermore, the structural addressing sequence design method combines the modulation primer, the index sequence and the primer sequence, not only can save the self addressing function, but also can be used as a PCR primer, thereby greatly reducing the information redundancy. Compared with the prior art, the coding and decoding scheme of the invention can code and decode the digital information stored by any computer, and the coding density is close to the theoretical limit and reaches 1.9bits/nt. Meanwhile, the number of the addressing sequences with specificity, which are matched with the base sequence in scale, can be generated by a structural addressing sequence design method, so that the requirements of multiple functions such as modulation, addressing, PCR amplification, information retrieval and the like are met.

The method for coding and decoding the DNA information based on the modulation effect and the method for designing the structured addressing sequence are used for coding and decoding the computer digital information, and can convert the computer digital information into DNA base sequences meeting the requirements of synthesis and sequencing with high-density coding density.

In a preferred embodiment, the digital information can be encoded at a very high encoding density without adding excessive redundancy by changing the primers and screening the DNA strand quality. Meanwhile, in order to realize the subsequent information retrieval operation based on PCR, a structural addressing sequence design method is used for generating a large number of specific addressing sequences, so that the functions of addressing and PCR primers are realized while the information redundancy on a DNA chain is reduced, and the PCR primer generation process is simplified.

Drawings

Fig. 1 shows a coding flow diagram of an embodiment of the invention.

FIG. 2 illustrates a quaternary-like encoding rule diagram of an embodiment of the present invention.

Fig. 3 shows a modulation result table of an embodiment of the present invention.

Fig. 4 shows a schematic diagram of a modulation process according to an embodiment of the present invention.

FIG. 5 shows a table of modulated primers for an embodiment of the present invention.

Fig. 6 shows a flag sequence generation schematic of an embodiment of the present invention.

Fig. 7 shows a schematic diagram of the structure of the pre-addressing sequence according to an embodiment of the present invention.

Fig. 8 shows a schematic diagram of a post-addressing sequence structure according to an embodiment of the present invention.

Fig. 9 shows an addressing sequence table of an embodiment of the invention.

FIG. 10 is a diagram showing the structure of a DNA strand according to an embodiment of the present invention.

FIG. 11 shows a DNA strand base sequence table of an embodiment of the present invention, converting a TXT text file of information in the first chapter of "moral meridian" into a DNA strand base sequence.

Fig. 12 shows a decoding flow chart of an embodiment of the present invention.

Detailed Description

The following describes embodiments of the present invention in detail. It should be emphasized that the following description is merely exemplary in nature and is in no way intended to limit the scope of the invention or its applications.

It will be understood that when an element is referred to as being "mounted" or "disposed" on another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. In addition, the connection may be for both a fixing action and a coupling or communication action.

It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are merely for convenience in describing embodiments of the invention and to simplify the description by referring to the figures, rather than to indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus are not to be construed as limiting the invention.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the embodiments of the present invention, the meaning of "plurality" is two or more, unless explicitly defined otherwise.

Referring to fig. 1, an embodiment of the present invention provides a DNA information encoding method based on modulation, including the steps of:

Referring to fig. 12, the embodiment of the invention further provides a DNA information decoding method based on modulation, for decoding a DNA strand encoded by the DNA information encoding method, comprising the steps of:

In some embodiments, a method for encoding DNA information based on modulation comprises the steps of:

Step 1: information blocking, reading a file, and blocking file data according to an index sequence;

step 2: information verification, namely adding a certain amount of logic redundancy information through an RS (Reed-Solomon) verification algorithm;

step 3: converting the byte stream data into a binary sequence of 01001100 and then into a base sequence of TCAG;

step 4: modulating and optimizing, namely constructing a modulating primer, judging whether the base sequence meets constraint conditions, and optimizing the base sequence which does not meet the constraint conditions by using the modulating primer;

step 5: generating a mark sequence, recording modulation process information according to a modulation optimization process, and generating the mark sequence;

step 6: generating an addressing sequence, and designing the addressing sequence of each information block by using a structural primer design method according to the modulation primer and the index sequence of each information block;

step 7: and outputting, namely connecting the base sequence with the marker sequence and the front and rear PCR primers, assembling the whole DNA chain, and outputting the whole DNA chain to a storage file for storage.

In the step 1, the information is segmented, the file is read from the computer file to be stored in a binary form, and the read file is converted into byte stream data.

In the step 1, the information is segmented, and the length of the single DNA strand is generally about 200-300nt due to the limitation of the single DNA strand synthesis method. The read data is partitioned into information block sizes that can be stored in a single DNA strand.

And (3) dividing the information in the step (1), and simultaneously, distributing an index sequence number to each information block according to the dividing sequence. The index number can select 8-bit binary information, 16-bit binary information and other forms according to design requirement, and can store 2 at most ⁿ Information blocks. For more information blocks, the index sequence number length can be extended continuously.

And (2) information verification, namely adding certain redundant information to each information block by using an RS (reed-solomon) verification algorithm, wherein the number of bytes capable of correcting errors is half of the size (number of bytes) of the added redundant information. The RS checking algorithm is mainly used for correcting the replacement errors in the storage process, and the insertion and deletion errors can be converted into the replacement errors for correction. In addition, in the information verification, various verification methods such as an LDPC (low density parity check) algorithm, a cyclic redundancy check algorithm and the like can be used, and only the added redundant information is different in size and error correction capability, so that only the information verification function is realized.

And (2) information verification, wherein redundant information added through an RS (Reed-Solomon) verification algorithm is still byte stream data, and each divided information block and the added verification information are combined together to form an information block before encoding.

In the step 3, the sequence conversion is performed, the information blocks after completing the information block division and verification still exist in the form of byte stream data, and the byte stream data can be converted into a binary sequence through the binary conversion and uniformly expanded into an 8-bit binary form.

And (3) converting the sequence into a binary information block, and converting the binary information block into a base sequence through a quasi-quaternary coding rule. The quasi-quaternary coding rules are "00", "01", "10", "11" corresponding to "A", "T", "C", "G", respectively.

The preparation optimization in the step 4, wherein the preparation primer is a base sequence with 50% GC content and no homopolymer and has the length of 4, and the specific generation mode is obtained by screening after random combination. The modulation primers herein may be of different lengths depending on the design of the modulation process.

In the step 4, the modulation optimization is performed, and each information block needs to generate two modulation primers, and is divided into GC modulation primers and homopolymer modulation primers according to different functions.

The modulation optimization in the step 4 has different modulation strategies for different base sequences, and is specifically divided into three cases: 1. the base sequence meets the constraint condition, and modulation is not needed; 2. if the GC content of the base sequence or the homopolymer has a problem, the corresponding modulation primer is used for modulation; 3. both GC content and homopolymer of the base sequence are problematic, and two primers are used simultaneously.

In the step 4, the modulation optimization is carried out, the modulation process is an exclusive-or operation of 2 bits, every 4 bases of the base sequence are divided into a group, each group of bases corresponds to the modulation primer, the exclusive-or operation is carried out on the bases one by one, and the operation result is the modulation result. Examples: the base sequence "AAAACTAG … …" and the homopolymer-prepared primer "ACTG" were prepared, and the result was "ACTGCGTA … …". In the modulation optimization process, other forms such as a group of 8 bases can be selected, and only the length of the modulation primer needs to be correspondingly changed, but the whole modulation process still follows the 2-bit exclusive-or operation to obtain the modulation structure.

And (3) the modulation optimization in the step (4) is to directly perform the above-mentioned modulation process on the whole sequence of the sequence with the GC content not meeting the requirement, and the whole modulation is performed.

The modulation optimization in step 4, for sequences with too long homopolymers, needs to be located to the position containing the homopolymer, and only the homopolymer at this position is modulated, which is local modulation.

And (3) performing modulation optimization in the step (4), and after the modulation optimization, performing quality evaluation on the optimized base sequence to judge whether the base sequence meets the constraint condition or not. If the constraint condition is not satisfied, the generated modulation primer needs to be rewritten, and modulation optimization is performed again until the constraint condition is satisfied. If the constraint condition is met, the modulation result with the shortest tag sequence is selected, the modulation result is stored and is called as a constraint sequence, and two modulation primers are also stored.

And 5, generating a mark sequence, wherein the modulation optimization process performed in the step 4 needs to generate the mark sequence, and recording the modulation process information.

And 5, generating a mark sequence, wherein the modulation process information recorded by the mark sequence mainly comprises three parts: 1. whether or not to use GC primers for overall modulation; 2. specific locations for homopolymer modulation; 3. number of homopolymer modulations.

Step 5 generates a flag sequence, and uses 1-bit binary information to indicate whether the GC primer is used for overall modulation, wherein '0' represents unused and '1' represents used. Homopolymer modulation information is represented directly by an 8-bit binary number of its position, for example: homopolymer modulation occurred on group 4, 4 base sets, and the position information was "00000100". The homopolymer modulation times are converted into 7-bit binary information to be stored, and the 7-bit binary information is combined with the previous 1-bit GC modulation information to form 8-bit binary information. In the generating process of the above-mentioned mark sequence, the structural form of each part can be adjusted according to the actual use condition, for example: change the length of the homopolymer adjustment information, change the length of the number of homopolymer adjustments, etc.

And 5, generating a marker sequence, and converting the marker sequence which is all binary information into a base sequence by using a similar quaternary coding rule. And then the base sequence is modulated by using GC modulation primers, and the last marker sequence can be obtained by modulating by using homopolymer modulation primers.

The step 6 generates an addressing sequence, and after the previous 5 steps, the stored information is important to the index sequence number of the information block and the two modulation primers besides the constraint sequence and the mark sequence. The index sequence number and the modulation primer are now incorporated into the addressing sequence, which is generated for storage using a structured addressing sequence design method.

And 6, generating an addressing sequence, wherein the addressing sequence is divided into four parts by a structural addressing sequence design method: a marker, a balance sequence, a modulation primer, and an index sequence number. Wherein the marker is a length 4 homopolymer- "AAAA", "TTTTTT", "CCCC", "GGGG" for ensuring the specificity of the addressing sequence. The markers may also be adjusted according to the actual use process, for example: homopolymers of length 5, etc. are used. The balance sequence is randomly generated multi-bit base and is used for guaranteeing the GC content balance of the whole PCR primer, and the specific length is determined according to the whole length of the PCR primer. The modulation primer and the index sequence number are the information stored in the previous steps, and the four parts are combined to obtain the addressing sequence.

Step 6 generates addressing sequences, and one addressing sequence is respectively placed at the front end and the rear end of each DNA chain and is respectively called front addressing sequence and rear addressing sequence. The front addressing sequence and the rear addressing sequence are generated in the same way, the structural addressing sequence is generated, but the front addressing sequence and the rear addressing sequence are also different, GC modulation primers are stored on the front addressing sequence, and homopolymer modulation primers are stored on the rear addressing sequence. The structure between the two can be adjusted according to the actual situation.

And 7, outputting, namely combining the modulated constraint sequence with a mark sequence and an addressing sequence to form a complete DNA chain, and having the function of information storage. And storing all the coded DNA chain sequences into a storage file for storage.

In the DNA coding method based on modulation, firstly, stored information is converted into a base sequence according to a quasi-quaternary coding rule, then a modulation primer sequence meeting the requirements is constructed, modulation optimization is carried out on the base sequence obtained through conversion by using the modulation primer until the base sequence meets constraint conditions, and meanwhile, the modulation primer is stored to generate a mark sequence, so that information blocking and information verification are completed. By the coding method, the original information can be converted into the base sequence meeting the constraint condition at a very high coding density, and the corresponding addressing sequence is generated at the same time, so that the requirements of subsequent PCR amplification and other operations are met.

In other embodiments, a method of decoding DNA information based on modulation, comprising the steps of:

step 1: restoring the addressing sequence, reading the DNA chain base sequence, and disassembling the addressing sequence to obtain a modulation primer and an index sequence number;

step 2: modulating and restoring, firstly reading a mark sequence, obtaining modulation process information, and then decoding a base sequence into an unmodulated sequence;

step 3: converting the sequence, namely converting the decoded information block from a base sequence into a binary sequence, and recovering the binary sequence into byte stream data;

step 4: information verification, namely performing checksum error correction on the stored information and the redundant information according to an RS (Reed-Solomon) verification algorithm;

step 5: and (3) information reorganization, namely sorting the information blocks according to the index sequence numbers, and recovering the information blocks to original information.

In the restoration of the addressing sequence, after the base sequences of the DNA chains are read, the base sequences at the front end and the rear end are the front addressing sequence and the rear addressing sequence, and the addressing sequence of each DNA chain is determined according to the specific addressing sequence length and the design.

In the restoration of the addressing sequence, the index sequence number and two modulation primers of each DNA chain can be disassembled from the addressing sequence according to a structural addressing sequence design method.

In the restoration of the modulation, the DNA base sequence from which the addressing sequence is removed is divided into an information sequence and a marker sequence.

In the modulation recovery, firstly, a marker sequence is recovered through a GC modulation primer, and then the marker sequence is recovered into binary information to obtain modulation process information such as homopolymer modulation position information.

In the modulation recovery, the information sequence is modulated again at the same position according to the modulation process information obtained by the mark sequence and the two modulation primers, and the information sequence is recovered into an unmodulated base sequence form.

In the sequence conversion, the base sequence after the modulation and restoration is reversely converted into a binary sequence by using a quasi-quaternary coding rule.

In the sequence conversion, binary sequence is converted into byte stream data through binary conversion.

The information verification is that in byte stream data, data is divided into two parts, information is stored and redundant information is verified, an RS verification algorithm is used for checking whether the stored information has errors or not, and the errors are corrected. In the information verification process, various verification methods such as LDPC (low density parity check) algorithm, cyclic redundancy check algorithm and the like can be used, and the corresponding verification algorithm is selected according to the verification algorithm used in the coding process, so that the information can be corrected and corrected.

In the information reorganization, according to the index sequence number obtained by disassembling the addressing sequence, the corrected storage information is combined together by using an ordering algorithm such as bubbling ordering, bisection ordering and the like, and the corrected storage information is restored into correct and complete original information, and then the correct and complete original information is output as a corresponding computer file to be stored.

Specific embodiments of the present invention are described further below.

In the embodiment, the first chapter content of the moral meridian is taken as an encoding object, and the specific content is' 01. The channel can be a channel, and is very channel. Name can be named, very name. The beginning of the unknown world. The mother of everything is named. So it is not desirable to look at the best. There is often a desire to look at its micro-scale. The two are identical and different, which is called as brown. The first place is the best and wonderful door. "total 216 bytes of data".

Fig. 1 is a flowchart of a coding scheme using the first chapter content of "moral meridian" as a coding object, and the specific steps are as follows: firstly, reading and storing a TXT text file of the first chapter content of the moral meridian, partitioning the TXT text file according to a storage design, distributing index serial numbers, respectively carrying out information verification on each information block, and adding verification redundancy information. And (3) performing systematic conversion on the information block added with the check redundant information, converting the information block into a binary sequence, and converting the binary sequence into a base sequence by using a quasi-quaternary coding rule. Randomly generating a modulation primer, carrying out quality evaluation on the base sequence, and modulating the base sequence which does not meet the GC content requirement or contains an overlong homopolymer sequence by using the corresponding modulation primer until the base sequence meets the constraint condition. And generating a mark sequence according to the modulation process. And then, a structural addressing sequence design method is used for generating a front addressing sequence and a rear addressing sequence of each information block, and the front addressing sequence and the rear addressing sequence are combined into a complete DNA chain structure and output to a TXT text for storage.

In the encoding process, the size of the information block selected by the information block is 60 bytes, because the length of a single DNA chain is 200-300nt, the information block with the size of 60 bytes is selected, and finally, the length of the encoded DNA chain is about 308nt, so that the storage capacity of the single DNA chain can be utilized to the greatest extent, and meanwhile, the expenditure of index sequence numbers is reduced.

In the above encoding process, 216 bytes of data are added, and according to the size of the designed information block, the data can be divided into 4 information blocks, wherein 3 information blocks are 60 bytes, and the last information block is 36 bytes. In this embodiment, index numbers allocated to 4 information blocks according to the block sequence are "1, 2, 3, and 4", and then the 4 index numbers are converted into base sequences according to the sequence conversion step, so as to obtain 4 index numbers with a length of 4 nt.

In the above encoding process, in this embodiment, the RS check algorithm is used for information check, and 6 bytes of information, that is, 10% of check redundancy information, is added after each information block. Then, the information block size added with the redundancy information is 66 bytes, and the last information block size is 42 bytes, so that 3-byte information errors in the information block size can be corrected.

In the above encoding process, the first step of sequence conversion is binary conversion, and byte stream data is converted into an 8-bit binary sequence of "01000100" through a mathematical relationship between 16 and 2, and then "00", "01", "10", "11" are respectively corresponding to "a", "T", "C", "G", see fig. 2, using a quasi-quaternary encoding rule. The information block was converted into a base sequence, and the lengths of the 4 base sequences were 264nt, 168nt, respectively.

In the encoding process, when the modulated primer is generated, a randomization generation mode is adopted, but the generated modulated sequence has 50% GC content and no long homopolymer, and is a sequence which perfectly meets the constraint condition.

In the above encoding process, the modulation process is actually a two-bit xor operation, and the result may be obtained according to the xor result, or may be obtained by directly searching the modulation result table, as shown in fig. 3. For example: the stored information is AGCC, the modulation primer is ATCG, first-bit modulation is carried out, namely 'A' and 'A', the result is still '00', namely 'A', according to exclusive OR operation, namely '00' and '00', and the modulation result table is checked, so that the same result is obtained.

In the encoding process, the modulation optimization process is divided into two steps, as shown in the schematic diagram of the modulation process in fig. 4, firstly, whether the GC content in the base sequence meets the requirement is checked, if not, the GC modulation primer is used for integral modulation, then the base sequence is subjected to homopolymer screening, and if the homopolymer exists, the position where the homopolymer exists is modulated by using the homopolymer modulation primer.

In the encoding process, the modulation optimization process is a multiple-cycle process, the requirement that the base sequence meets the constraint condition is achieved by continuously replacing different modulation primers, and meanwhile, a group of results with the minimum homopolymer modulation frequency, namely the shortest mark sequence length, are selected as the final constraint sequence according to the homopolymer modulation frequency. In the examples, the modulated primers used are shown in FIG. 5.

In the encoding process, the flag sequence is generated as shown in fig. 6. First, 1 bit binary information is used to indicate whether the GC primer is used for overall modulation, with "0" representing unused and "1" representing used. Homopolymer modulation information is represented directly by an 8-bit binary number of its position, for example: homopolymer modulation occurred on group 1, 4 base groups, and the positional information was "00000001". The homopolymer modulation times are converted into 7-bit binary information to be stored, and the 7-bit binary information is combined with the previous 1-bit GC modulation information to form 8-bit binary information. And converting the mark sequence of the binary information into a base sequence by using a similar quaternary coding rule. And finally, performing one-step modulation on the base sequence by using a GC modulation primer to obtain a final marker sequence. As shown in fig. 6, a process of generating a signature sequence without GC modulation, but with two homopolymer modulations, is shown.

In the encoding process, the modulation optimization process has selected the optimal modulation primer, and in the embodiment, in the modulation optimization process performed on the 4 information blocks by the four selected modulation primers, no homopolymer modulation is performed, so that the tag sequence is the shortest 4nt length. In addition, none of the first three base sequences was GC content modulated, and only the last one was GC content modulated.

In the above encoding process, in this embodiment, the designed addressing sequence length is 20nt, which includes: 4nt marker, 8ntGC balance sequence, 4nt modulation primer, 4nt index number. Wherein the markers are randomly generated homopolymers of length 4 and the GC balance sequences are also randomly generated to ensure that the GC content of the entire addressing sequence is within the desired range. The modulation primer is used in the modulation process, and the index sequence is a base sequence obtained by modulating the modulation primer according to the index sequence allocated when the information is partitioned.

In the encoding process, the addressing sequence is divided into a front addressing sequence and a rear addressing sequence, the front addressing sequence and the rear addressing sequence on the same DNA chain are generated in the same mode, but the sequence of each component part is different, and the sequence is specifically shown in the primer structure diagrams of fig. 7 and 8. The preparation primer in the front primer is a GC preparation primer, and the preparation primer in the rear primer is a homopolymer preparation primer. In an embodiment, the four sets of addressing sequences generated are shown in fig. 9.

In the above coding, the complete DNA strand structure includes a base sequence, a tag sequence and a tandem addressing sequence, as shown in FIG. 10.

In the above coding, each part of the information block is assembled into a complete DNA strand, and the first three DNA strands of the example consist of 264nt base sequence, 4nt tag sequence, 20nt pre-addressing sequence, 20nt post-addressing sequence, and 308nt total. The last strand consists of a 168nt base sequence, a 4nt tag sequence, a 20nt pre-addressing sequence, a 20nt post-addressing sequence, for a total of 212nt.

Through the coding process, a TXT text file of 216 bytes of 'moral warp' first chapter information is converted into 4 DNA chains, and the 4 DNA chains are respectively as follows: 308nt,212nt. The specific base sequence of each DNA strand is shown in FIG. 11.

Fig. 12 is a decoding flow corresponding to the encoding process. In the DNA strand, the length of the addressing sequence is 20nt. According to the structure of the DNA chain, the front addressing sequence and the rear addressing sequence can be obtained from the DNA chain, and according to the structural design of the addressing sequences, the index sequence numbers of the two modulation primers and the DNA chain can be further obtained. Then, the base sequence and the mark sequence are separated from the DNA chain, the base sequence is modulated once according to the modulation process information of the mark sequence, the base sequence which is not modulated can be obtained, and then the sequence conversion is carried out to recover the base sequence into byte stream data. And finally, carrying out information correction and error correction on each DNA chain, assembling correct information into original storage information according to the index sequence number, outputting the original storage information into TXT text, and finishing the decoding process.

In the above decoding process, the process of disassembling the DNA strand and the addressing sequence needs to be performed according to the DNA strand structure and the addressing sequence structure designed in practical use. The recovery of the tag sequence information is also performed based on the structure of the tag sequence.

In the above decoding process, the sequence conversion is still divided into two steps, the first step is to convert the base sequence into binary sequence information, and the second step is to convert the base sequence into byte stream data.

In the decoding process, an RS check algorithm is used for information check, and DNA chain sequencing is performed by a bubbling sequencing algorithm. After the information checksum information is recombined, the DNA chain information is assembled into correct and complete original information, and the correct and complete original information is output as TXT text, namely the first chapter information of the 'moral meridian'.

In summary, according to the method for encoding and decoding DNA information based on modulation provided by the invention, the base sequence can be optimized by the specially constructed modulation primer during encoding, so that the base sequence meets the constraint condition, and meanwhile, the marker sequence is constructed, and the correct original information can be recovered during decoding. Furthermore, the structural addressing sequence design method combines the modulation primer, the index sequence and the primer sequence, not only can save the self addressing function, but also can be used as a PCR primer, thereby greatly reducing the information redundancy. Compared with the prior art, the coding and decoding scheme of the invention can code and decode the digital information stored by any computer, and the coding density is close to the theoretical limit and reaches 1.9bits/nt. Meanwhile, the number of the addressing sequences with specificity, which are matched with the base sequence in scale, can be generated by a structural addressing sequence design method, so that the requirements of multiple functions such as modulation, addressing, PCR amplification, information retrieval and the like are met.

The embodiment of the invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the DNA information encoding method and/or the DNA information decoding method.

The embodiment of the invention also provides a control device, which comprises a processor and a storage medium for storing a computer program; wherein the processor is used for implementing the DNA information encoding method and/or the DNA information decoding method when executing the computer program.

The embodiments of the present invention also provide a processor executing a computer program, at least performing the method as described above.

The storage medium may be implemented by any type of volatile or non-volatile storage device, or combination thereof. The nonvolatile Memory may be a Read Only Memory (ROM), a programmable Read Only Memory (PROM, programmable Read-Only Memory), an erasable programmable Read Only Memory (EPROM, erasableProgrammable Read-Only Memory), an electrically erasable programmable Read Only Memory (EEPROM, electricallyErasable Programmable Read-Only Memory), a magnetic random Access Memory (FRAM, ferromagneticRandom Access Memory), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a compact disk Read Only (CD-ROM, compact Disc Read-Only Memory); the magnetic surface memory may be a disk memory or a tape memory. The volatile memory may be random access memory (RAM, random Access Memory), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (SRAM, static Random Access Memory), synchronous static random access memory (SSRAM, synchronousStatic Random Access Memory), dynamic random access memory (DRAM, dynamic Random AccessMemory), synchronous dynamic random access memory (SDRAM, synchronous Dynamic Random AccessMemory), double data rate synchronous dynamic random access memory (ddr SDRAM, double Data RateSynchronous Dynamic Random Access Memory), enhanced synchronous dynamic random access memory (ESDRAM, enhanced Synchronous Dynamic Random Access Memory), synchronous link dynamic random access memory (SLDRAM, syncLink Dynamic Random Access Memory), direct memory bus random access memory (DRRAM, direct Rambus Random Access Memory). The storage media described in embodiments of the present invention are intended to comprise, without being limited to, these and any other suitable types of memory.

In the several embodiments provided by the present application, it should be understood that the disclosed systems and methods may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.

Alternatively, the above-described integrated units of the present invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

The methods disclosed in the method embodiments provided by the application can be arbitrarily combined under the condition of no conflict to obtain a new method embodiment.

The features disclosed in the several product embodiments provided by the application can be combined arbitrarily under the condition of no conflict to obtain new product embodiments.

The features disclosed in the embodiments of the method or the apparatus provided by the application can be arbitrarily combined without conflict to obtain new embodiments of the method or the apparatus.

Claims

1. A method for encoding DNA information based on modulation, comprising the steps of:

2. The modulation-based DNA information encoding method according to claim 1, wherein step S3 comprises converting information in the information block into a base sequence form by a quaternary-like encoding rule, wherein "00", "01", "10", "11" correspond to bases "a", "T", "C", "G", respectively.

3. The modulation-based DNA information encoding method of claim 1, wherein step S4 comprises: constructing a GC modulation primer and a homopolymer modulation primer for each information block, wherein the modulation primer is a base sequence with 50% GC content and no homopolymer and has a length of 4, and the GC content and the homopolymer of the information sequence meet constraint conditions through modulation of the modulation primer;

4. The modulation-based DNA information encoding method of claim 1, wherein step S5 comprises: generating a unique flag sequence for each information block to record modulation process information, the modulation process information recorded in the flag sequence comprising three parts: whether or not the whole modulation is performed using the GC-modulated primer; specific locations for homopolymer modulation; the number of times the homopolymer modulation is performed;

5. The method for encoding DNA information based on modulation as claimed in claim 1, wherein the step S6 comprises: the index sequence number and the modulation primer are combined into an addressing sequence, and the addressing sequence of each information block is generated by adopting a structural addressing sequence design method;

6. The method for encoding DNA information based on modulation as claimed in claim 1, wherein the step S7 comprises: and combining the constraint sequence after modulation optimization with the front primer, the rear primer and the marker sequence to form a complete structure of a DNA chain for storing information, and outputting the complete structure to a text file for storage.

7. A DNA information decoding method based on modulation for decoding a DNA strand encoded using the DNA information encoding method according to any one of claims 1 to 6, comprising the steps of:

8. The method for decoding DNA information based on modulation as claimed in claim 7, wherein the step T1 comprises: dividing the position of an addressing sequence according to the read DNA sequence, and disassembling according to the structure of the addressing sequence to obtain a modulation primer and an index sequence;

9. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the DNA information encoding method according to any one of claims 1 to 6 and/or the DNA information decoding method according to any one of claims 7 to 8.

10. A control apparatus comprising a processor and a storage medium for storing a computer program; wherein the processor is adapted to implement the DNA information encoding method of any one of claims 1 to 6 and/or the DNA information decoding method of any one of claims 7 to 8 when executing the computer program.