CN112582030B - Text storage method based on DNA storage medium - Google Patents

Text storage method based on DNA storage medium Download PDF

Info

Publication number
CN112582030B
CN112582030B CN202011508358.7A CN202011508358A CN112582030B CN 112582030 B CN112582030 B CN 112582030B CN 202011508358 A CN202011508358 A CN 202011508358A CN 112582030 B CN112582030 B CN 112582030B
Authority
CN
China
Prior art keywords
text
sequence
original text
dna
length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011508358.7A
Other languages
Chinese (zh)
Other versions
CN112582030A (en
Inventor
刘文斌
昝乡镇
姚祥宇
许�鹏
方刚
陈智华
石晓龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN202011508358.7A priority Critical patent/CN112582030B/en
Publication of CN112582030A publication Critical patent/CN112582030A/en
Application granted granted Critical
Publication of CN112582030B publication Critical patent/CN112582030B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a text storage method based on a DNA storage medium, which comprises the following steps: acquiring an original text, and encoding the original text to obtain a DNA storage sequence; synthesizing the DNA storage sequence to obtain a DNA molecular sequence, amplifying the DNA molecular sequence, and storing the amplified DNA molecular sequence; acquiring a stored DNA molecular sequence, and transcoding to obtain an original text; transcoding to obtain an original text comprises the following steps: sequencing the stored DNA molecular sequence to obtain the reading length of the DNA molecular sequence; preprocessing the reading length, removing noise data in the reading length, and transcoding the preprocessed reading length to obtain an original text. The method directly converts the stored DNA molecular sequence through the reading length of the sequence, removes more redundant codes, improves the storage efficiency, fully uses semantic information in the original text in the process of transcoding and decoding, has strong query processing capability, and can be widely applied to the technical field of system biology research.

Description

Text storage method based on DNA storage medium
Technical Field
The application relates to the technical field of system biology research, in particular to a text storage method based on a DNA storage medium.
Background
With the development of distributed, cloud computing and internet of things technologies, the total amount of data generated by human beings every day is in an exponentially explosive growth situation. Traditional magnetic, optical, electrical and other storage technologies cannot meet the storage requirement of exponential growth of mass data in the future. Further, general-purpose processors (CPUs) and application specific processing chips (ASICs) based on semiconductors suffer from moore's law in terms of power consumption, size, reliability, etc. Therefore, the search for new information storage modes has become a key fundamental problem for sustainable development of information technology. As a carrier of life genetic information, DNA molecules have the advantages of high density, small volume, good storage stability, low energy consumption and the possibility of being fused with biological calculation in the aspect of storage, and a novel data processing mode integrating the storage and calculation is realized. The procedure for DNA storage is typically: the binary file in the computer is encoded into a base sequence, then synthesis, amplification and sequencing are carried out, and then the original information is recovered from the base sequence. However, most of the current researches add a plurality of redundant codes to the original input information, for example, the internal codes solve the problem of base errors in the sequence and the external codes solve the problem of sequence-level deletion. While the prior art does have its unique advantages, the disadvantages are also apparent. For example, the storage efficiency is low, the decoding process is complex, semantic information is not utilized, and the information query processing capability is poor.
Disclosure of Invention
In view of this, in order to at least partially solve one of the above technical problems, an embodiment of the present application is to provide a text storage method based on a DNA storage medium, which can realize convenient and efficient text indifferent storage.
In a first aspect, the present application provides a text storage method based on a DNA storage medium, including the steps of:
acquiring an original text, and encoding the original text to obtain a DNA storage sequence;
synthesizing the DNA storage sequence to obtain a DNA molecular sequence, amplifying the DNA molecular sequence, and storing the amplified DNA molecular sequence;
acquiring a stored DNA molecular sequence, and transcoding to obtain the original text;
the transcoding to obtain the original text comprises the following steps:
sequencing the stored DNA molecular sequence to obtain the reading length of the DNA molecular sequence;
preprocessing the reading length, removing noise data in the reading length, and transcoding the preprocessed reading length to obtain the original text.
In a possible embodiment of the present application, the step of obtaining the original text, and the step of encoding the original text to obtain the DNA storage sequence includes:
generating a coding base sequence according to a coding rule and characters in the original text, and generating an index value according to the coding base sequence;
generating byte check codes according to characters in the original text;
and constructing the DNA storage sequence according to the index value, the byte check code and the text data formed by the coding base sequence.
In a possible embodiment of the present application, the step of generating a byte check code according to characters in the original text includes:
coding characters in the original text through the inner codes to obtain binary character strings;
and carrying out grouping base coding according to the binary character string to obtain the byte check code.
In a possible embodiment of the present application, the step of preprocessing the read length, removing noise data in the read length, and transcoding the preprocessed read length to obtain the original text includes:
acquiring the reading length, and reversely pushing according to the coding rule to obtain a decoded character line;
correcting the error of the decoded character row to obtain a decoded text character row;
and obtaining a plurality of groups according to the decoded text character line and the text content, and decoding the groups to obtain an original text.
In a possible embodiment of the present application, the step of preprocessing the read length, removing noise data in the read length, and transcoding the preprocessed read length to obtain the original text further includes:
and determining the character with the minimum Hamming distance as the decoding character of the error base according to the error base of the read length.
In a possible embodiment of the present application, the step of obtaining a plurality of packets according to the decoded text character line and the text content includes:
dividing the index values of the decoded text character lines to obtain a plurality of groups, and determining the text similarity of group members;
performing secondary division on the group members according to the text similarity, wherein the secondary division comprises at least one of the following steps:
according to a preset first threshold, adding members with the text similarity smaller than the first threshold to other groups;
determining a mean value of the text similarity, and deleting the group members according to the mean value;
and clustering the members not belonging to the group according to the text similarity to obtain a new group.
In a possible embodiment of the present application, the step of decoding the packet to obtain the original text includes:
determining weight values of characters in the decoded text character lines in the group;
determining a unique length value of the packet such that a length value of a decoded text character line in the packet is the same as the unique length value;
and determining the characters of the original text according to the decoded text character rows with consistent length values and the weight values of the characters, and combining to obtain the original text.
Advantages and benefits of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application:
the method comprises the steps of encoding an original text into a base sequence, synthesizing and amplifying the base sequence, storing a DNA molecular sequence obtained after amplification, sequencing the stored DNA molecular sequence to obtain a reading length of the sequence, deleting the noise reading length, and recovering the original text according to the reading length; the method directly converts the stored DNA molecular sequence through the reading length of the sequence, removes more redundant codes, improves the storage efficiency, fully uses semantic information in the original text in the transcoding and decoding processes, and has strong query processing capability.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of steps of a text storage method based on a DNA storage medium according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a DNA storage sequence according to an embodiment;
FIG. 3 is a flowchart illustrating steps for grouping text character lines and text content according to the decoding of the text character lines according to an embodiment;
FIG. 4 is a histogram of the accuracy of restoring English text at different error rates and sequencing depths according to the embodiment.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.
In a first aspect, as shown in fig. 1, the present application provides a text storage method based on a DNA storage medium, comprising steps S01-S03:
s01, acquiring an original text, and encoding the original text to obtain a DNA storage sequence.
Taking English text as an example, the embodiment encodes the characters of the English text according to the encoding rule to form a DNA storage sequence.
In this embodiment, the step of encoding the original text to obtain the DNA storage sequence specifically includes steps S011-S013:
s011, generating a coding base sequence according to coding rules and characters in an original text, and generating an index value according to the coding base sequence;
s012, generating byte check codes according to characters in the original text;
s013, constructing a DNA storage sequence according to the index value, the byte check code and the text data formed by the coding base sequence.
Specifically, various characters appearing in the English text are coded in sequence according to a character coding rule, and the coding base sequence of each M (M > 0) text characters is a storage data unit. And (3) checking and generating a t-bit byte check code by using a Reed-Solomon code (RS) of M text characters, and encoding the t-bit byte check code into a corresponding base sequence by using n-bit decimal digits according to the sequence of generation of the storage data units to serve as an Index value (Index) of the storage data units. Thereby, the method is used for the treatment of the heart disease. A DNA storage sequence is composed of index value part, RS check code and text data field.
Taking n as 5, t as 4 and M as 25 as an example; as shown in FIG. 2, a DNA storage sequence structure is shown.
In this example, the base sequence corresponding to each text character is shown in table 1:
TABLE 1
In this example, in the DNA memory sequence, the first part is Index, which is also a base sequence, marking the order of DNA memory lines in the original encoded text file. Index base sequences are each a unit of 6 bases, and each corresponds to a decimal number of n digits. The Index numbers correspond to the number code table shown in table 2:
TABLE 2
In this embodiment, the step S012 of generating byte check codes from characters in the original text may be further subdivided into steps S012a and S012b:
s012a, coding characters in the original text through an inner code to obtain a binary character string;
s012b, performing grouping base coding according to the binary character string to obtain the byte check code.
Specifically, the embodiment converts characters in the original english text into binary character strings through RS test, groups the characters into groups of 4 bits, and base-encodes the binary data of each group of 4 bits according to table 3.
TABLE 3 Table 3
RS grouping Encoding RS grouping Encoding RS grouping Encoding RS grouping Encoding
0000 GTGT 0100 CACA 1000 TCAC 1100 ACTC
0001 GATG 0101 GTTC 1001 TACC 1101 AGCT
0010 AGAC 0110 TGGT 1010 GAGA 1110 TCGA
0011 CTTG 0111 CAGT 1011 GAAC 1111 TGCA
In the embodiment of the encoding rule, i.e. the encoding relationship provided in tables 1, 2 and 3, the length of the DNA storage sequence is fixed, and the length value L is n×5+8×t+m×4. If the number of characters L (L > 0) of the encoded English text in the storage sequence is smaller than M, the rest of the base sequence units of the storage sequence can be composed of base sequences corresponding to (M-L) space characters.
S02, synthesizing the DNA storage sequence to obtain a DNA molecular sequence, amplifying the DNA molecular sequence, and storing the amplified DNA molecular sequence.
Specifically, the DNA storage sequence obtained in step S01 is synthesized, amplified, and stored. Wherein, the synthetic process is to obtain DNA storage sequence, and to connect deoxynucleotide one by one to synthesize DNA chain, namely DNA molecule sequence, through chemical reaction, i.e. according to the predetermined nucleotide sequence. The amplification process, i.e. the creation of multiple copies of the DNA molecule sequence, in the example, the DNA molecule sequence PCR (Polymerase Chain Reaction) is amplified, i.e. the polymerase chain reaction. PCR amplification is a molecular biological technique for amplifying specific DNA fragments, and can be regarded as specific DNA replication outside organisms, and the greatest feature of PCR is that it can greatly increase trace amounts of DNA. The PCR process in the examples is divided into three steps: 1) DNA denaturation (90 ℃ -96 ℃): under the action of heat, the hydrogen bond of the double-stranded DNA template is broken to form single-stranded DNA; 2) Annealing (60 ℃ -65 ℃): the temperature of the system is reduced, and the primer binds to the DNA template to form a partial double strand. 3) Extension (70 ℃ C. -75 ℃ C.). Under the action of Taq enzyme (about 72 ℃ C. And optimal activity), dNTPs are used as raw materials, and the DNA strand complementary to the template is synthesized by extending from the 3' -end of the primer in the direction from the 5 '. Fwdarw.3 ' -end. The DNA content doubles after each cycle of denaturation, annealing and extension.
Further, the amplified DNA sequences are stored, for example, in a DNA molecular database.
S03, acquiring a stored DNA molecular sequence, and transcoding to obtain an original text. Wherein, the transcoding to obtain the original text includes steps S031-S032:
s031, sequencing the stored DNA molecular sequence to obtain the reading length of the DNA molecular sequence;
s032, preprocessing the reading length, removing noise data in the reading length, and transcoding the preprocessed reading length to obtain an original text.
Specifically, sequencing a stored DNA molecule sequence, i.e., DNA sequencing (DNA sequencing), refers to analyzing the base sequence of a specific DNA fragment, i.e., the (G) arrangement of adenine (a), thymine (T), cytosine (C) and guanine; for example, sequencing is carried out by adopting a second-generation sequencer or a third-generation sequencer, and a result file output by the sequencer consists of reads; wherein reads are the judgment of the base composition of a DNA sequence molecule by a sequencer, namely the read length.
In step S032, before decoding and restoring english text characters from reads obtained by sequencing, data preprocessing is needed, and the data preprocessing mainly includes deleting low-quality reads, that is, processing noise data in a read length, and includes: deleting reads that cannot correct an insertion or deletion, correcting reads that can correct an insertion or deletion. On the basis of data preprocessing, decoding, RS error correction and multi-sequence error correction are carried out on the obtained reads, and then the word error correction technology is applied to recover the original coded English text.
More specifically, in embodiments the process of preprocessing reads includes at least one of the following steps:
1) The 'N' base in reads is replaced by a base 'A', wherein the 'N' character refers to that the sequencer cannot accurately give the specific base at the position, and the 'N' is adopted for replacement.
2) Deletion of reads of low quality, i.e. reads with a Phred mass of less than 20 consecutive four bases. The quality value corresponding to each base in reads reflects the accuracy degree of the base identification, and according to the condition that the length of one coding unit is 6 and the coding unit of each digit of Index value is 5 determined by the coding rule in the step S01, the value of the phred of continuous 6 bases is determined to be low in the pretreatment process, namely, the quality of the reads is determined to be unqualified and the reads are deleted.
3) Deleting reads with too small a number of bases, i.e., reads with a length of reads less than (L-5);
4) Reads with an excessive number of bases, i.e., reads with a length greater than (L+5), are deleted.
In some alternative embodiments, the read length is preprocessed, and the process of deleting the low quality read length may further include step 5):
5) Reads with insertion/deletion errors between lengths (L-2) and (L+2) are corrected. For the wrong base unit, the character whose hamming distance is the smallest value is determined as its decoded character.
For example, the complete process of correcting insertion/deletion errors of reads is:
a) Setting a sliding window with the size ofThe number of decoding units, e.g. +.>2.
b) ready from left to right, fetch in turnThe base sequences corresponding to the decoding units can not be completely extractedUntil the coding units, calculating +.>The decoding units are located from the minimum hamming distance list of the encoding table. If the values of the elements of the list are all more than or equal to 2, indicating that the first decoding unit has insertion or deletion, and executing the step c); otherwise, repeating the step b).
c) Inserting an appropriate character or deleting a character into each base of the first coding unit in sequence, the inserted or deleted character must satisfy: condition 1) is a character corresponding to a coding unit having a minimum hamming distance according to a coding table; the condition 2) that the minimum Hamming distance list of each element of the sliding window after the character is inserted or deleted is less than 2; otherwise, executing the step b).
d) Reads with length not equal to L are deleted from the insert/deletion corrected reads.
In this embodiment, the process of transcoding to obtain the original text is performed in step S032, which may be further subdivided into steps S032a-S032c:
s032a, acquiring the read length after preprocessing, and reversely pushing according to the coding rule to obtain a decoded character line;
s032b, correcting the error of the decoded character line to obtain a decoded text character line;
s032c, obtaining a plurality of groups according to the character line of the decoded text and the text content, and decoding according to the groups to obtain the original text.
Specifically, for the reads obtained through the preprocessing, first, according to the character encoding table, index encoding table, RS group encoding table adopted in step S01, 6 consecutive bases are used as a coding unit, so as to obtain the characters corresponding to the coding unit, and further obtain the decoded character line corresponding to the reads. And then, performing RS error correction on the decoded character line corresponding to each ready to generate a decoded text character line formed by splicing text character strings only containing index information and error correction results. And grouping according to the index value of the decoded text character line and the text content. According to the multiple principle, the obtained real text lines corresponding to each group, namely the original text lines, are decoded, put into a set T, and the decoded text character lines in the set T are ordered according to index values. And sequentially removing index values of the decoding text character lines in the T set, and outputting text data area character strings to the decoding character file.
In this embodiment, in step S032c, the process of obtaining several packets according to the decoded text character line and the text content can be further subdivided into steps S032c1-S032c2:
s032c1, dividing the text character lines into a plurality of groups according to the index values of the decoded text character lines, and determining the text similarity of members in the groups;
S032C2, performing secondary division on members in the group according to the text similarity, wherein the secondary division comprises at least one of the steps A-C:
A. and adding members with text similarity smaller than the first threshold to other groups according to the preset first threshold.
B. And determining the average value of the text similarity, and deleting the grouping members according to the average value.
C. And clustering the members not belonging to the group according to the text similarity to obtain a new group.
Specifically, as shown in fig. 3, the grouping is performed according to the index value of the decoded text character line and the text content. When the primary grouping is carried out, the index value is adopted for grouping;
after preliminary grouping, for each decoded text character line of the group with the group membership number less than 3, the text similarity of the text data area and the center member of other groups (the group center member refers to the highest average value of the text similarity of the member and other members of the group to which the member belongs, the member can approximately represent the actual storage line corresponding to the group) is adopted, if the similarity is greater than a certain threshold value(e.g.)>Taking 0.8), the decoded character line is deleted from the current packet and delivered to the packet with the highest similarity to the text. In an embodiment, the text similarity calculation method of two character strings includes: the two character strings s1 and s2 are firstly subjected to sequence comparison by using a sequence comparison algorithm such as a Needle-Wunsch algorithm, the number of characters in the same position of the compared character strings is counted, the count is directly divided by the maximum value of the lengths of the character strings s1 and s2, and the divided result is the text similarity of the two character strings.
For the members of each group, according to the text similarityThenAnd deleting the members with larger text similarity difference with other members in the group. The text similarity between a member of the group and other members of the group is specifically a mean value of the text similarity between the member and other members of the group to which the member belongs.
The unique decoded character row represented by the packet is deleted for packets whose Index value is illegal. The basis for determining whether the Index value is legal is: the Index value is compared to the number of DNA storage sequences sequenced when encoding the text file, and if the Index value is small, it is legal, otherwise illegal.
For the decoding text character line of undetermined group, according to the text similarityAnd clustering, and deleting the illegal clusters of the Index value of the decoded text character line corresponding to the clusters. And delivering the decoded text character lines of the undetermined group into the group with the maximum text similarity with the decoded text character lines according to the text similarity.
In this embodiment, in step S032c, the process of decoding according to the packet to obtain the original text may be subdivided into steps S032c3-S032c5:
s032c3, determining the weight value of characters in the decoded text character line in the group;
s032c4, determining the unique length value of the grouping, so that the length value of the decoded text character line in the grouping is the same as the unique length value;
s032c5, determining the characters of the original text according to the decoded text character lines with consistent length values and the weight values of the characters, and combining to obtain the original text.
Specifically, according to the multiple principle, the obtained real text lines corresponding to each group are decoded, put into a set T, and the decoded text character lines in the set T are ordered according to index values. The specific steps of the unique decoding character line represented by the multiple judging group are as follows:
first, an initial weight value of each decoded text character line of the packet is calculated, and in an embodiment, the weight calculation rule is as follows: the number of english letters correctly decoded by the decoded text character line/the number of all english letters decoded by the decoded text character line.
And determining the unique length value of the packet, wherein the unique length value corresponding to the packet is the value with the highest occurrence frequency of the length of the decoded text character line in the packet.
If the membership of the group is less than τ (e.g., τ is 3), the word spellings in each decoded text character line of the group are checked and corrected by themselves. If the number of the members of the group is smaller than tau, the decoding text character line with the character length not equal to the length theta of the unique decoding character line to be decoded in the group is compared with any decoding text decoding character line with the length theta in the group in sequence, and then the decoding text character line is properly stretched or contracted.
And sequentially calculating character values of corresponding columns in the unique decoding lines corresponding to the groups according to the data of each decoding text character line in the groups. The calculation rules in the embodiment are: determining characters in each column, and sequentially calculating the sum of weight values of the characters in each row in all rows; and selecting the weight value and the largest character.
In summary, the implementation procedure of this embodiment can be summarized as follows: and coding each character sequentially appearing in the English text according to a coding rule, and sequentially adding index values into the base sequences of every N original text characters to obtain a series of DNA storage sequences. Combining the DNA storage sequences into base sequences, performing biological storage, amplification and sequencing, and then performing data cleaning on reads in a sequencing file. Decoding, RS error correction and multi-sequence error correction are carried out on the reads, and then the word error correction technology is applied to recover the original coded English text.
As shown in fig. 4, the data can indicate that the english text can be completely restored when the depth 25 is sequenced under the conditions that the error rates are 0.01, 0.02 and 0.05 respectively; when the error rate is 0.1 and the sequencing depth is 45, the original English text can be completely reproduced.
From the above specific implementation process, it can be summarized that, compared with the prior art, the technical solution provided by the present application has the following advantages or advantages:
according to the technical scheme, the storage efficiency is improved, semantic information in an original text is fully utilized in the transcoding and decoding processes, and the query processing capability is high.
In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present application are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.
Furthermore, while the application is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the functions and/or features may be integrated in a single physical device and/or software module or may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present application. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the application as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the application, which is to be defined in the appended claims and their full scope of equivalents.
Wherein the functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium may even be paper or other suitable medium upon which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present application have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the application, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present application has been described in detail, the present application is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present application, and these equivalent modifications and substitutions are intended to be included in the scope of the present application as defined in the appended claims.

Claims (4)

1. A text storage method based on a DNA storage medium, comprising the steps of:
acquiring an original text, and encoding the original text to obtain a DNA storage sequence;
synthesizing the DNA storage sequence to obtain a DNA molecular sequence, amplifying the DNA molecular sequence, and storing the amplified DNA molecular sequence;
acquiring a stored DNA molecular sequence, and transcoding to obtain the original text;
the transcoding to obtain the original text comprises the following steps:
sequencing the stored DNA molecular sequence to obtain the reading length of the DNA molecular sequence;
preprocessing the reading length, removing noise data in the reading length, and transcoding the preprocessed reading length to obtain the original text;
the step of preprocessing the reading length, removing noise data in the reading length, and transcoding the preprocessed reading length to obtain the original text, which comprises the following steps:
acquiring the read length after pretreatment, and reversely pushing according to the coding rule to obtain a decoded character line;
correcting the error of the decoded character row to obtain a decoded text character row;
obtaining a plurality of groups according to the decoded text character lines and text contents, and decoding the groups to obtain an original text;
the step of obtaining a plurality of groups according to the decoded text character line and the text content comprises the following steps:
dividing the index values of the decoded text character lines to obtain a plurality of groups, and determining the text similarity of group members;
and carrying out secondary division on the grouping members according to the text similarity, wherein the secondary division comprises at least one of the following steps:
according to a preset first threshold, adding members with the text similarity smaller than the first threshold to other groups;
determining a mean value of the text similarity, and deleting the group members according to the mean value;
clustering the members not belonging to the group according to the text similarity to obtain a new group;
the step of decoding the packet to obtain the original text comprises the steps of:
determining weight values of characters in the decoded text character lines in the group;
determining a unique length value of the packet such that a length value of a decoded text character line in the packet is the same as the unique length value;
and determining the characters of the original text according to the decoded text character rows with consistent length values and the weight values of the characters, and combining to obtain the original text.
2. The method for storing text based on a DNA storage medium according to claim 1, wherein the step of obtaining an original text, and encoding the original text to obtain a DNA storage sequence, comprises:
generating a coding base sequence according to a coding rule and characters in the original text, and generating an index value according to the coding base sequence;
generating byte check codes according to characters in the original text;
and constructing the DNA storage sequence according to the index value, the byte check code and the text data formed by the coding base sequence.
3. The method of claim 2, wherein the step of generating byte-check codes from characters in the original text comprises:
coding characters in the original text through a Reed Solomon code to obtain a binary character string;
and carrying out grouping base coding according to the binary character string to obtain the byte check code.
4. The method of claim 1, wherein the step of preprocessing the read length, removing noise data in the read length, and transcoding the preprocessed read length to obtain the original text, further comprises:
and determining the character with the minimum Hamming distance as the decoding character of the error base according to the error base of the read length.
CN202011508358.7A 2020-12-18 2020-12-18 Text storage method based on DNA storage medium Active CN112582030B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011508358.7A CN112582030B (en) 2020-12-18 2020-12-18 Text storage method based on DNA storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011508358.7A CN112582030B (en) 2020-12-18 2020-12-18 Text storage method based on DNA storage medium

Publications (2)

Publication Number Publication Date
CN112582030A CN112582030A (en) 2021-03-30
CN112582030B true CN112582030B (en) 2023-08-15

Family

ID=75136171

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011508358.7A Active CN112582030B (en) 2020-12-18 2020-12-18 Text storage method based on DNA storage medium

Country Status (1)

Country Link
CN (1) CN112582030B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113299347B (en) * 2021-05-21 2023-09-26 广州大学 DNA storage method based on modulation coding
CN113315623B (en) * 2021-05-21 2023-01-24 广州大学 Symmetric encryption method for DNA storage
CN113314187B (en) * 2021-05-27 2022-05-10 广州大学 Data storage method, decoding method, system, device and storage medium
WO2023272499A1 (en) * 2021-06-29 2023-01-05 中国科学院深圳先进技术研究院 Encoding method, decoding method, apparatus, terminal device, and readable storage medium
CN114218937B (en) * 2021-11-24 2022-12-02 中国科学院深圳先进技术研究院 Data error correction method and device and electronic equipment
CN114356220B (en) * 2021-12-10 2022-10-28 中科碳元(深圳)生物科技有限公司 Encoding method based on DNA storage, electronic device and readable storage medium
CN117254819B (en) * 2023-11-20 2024-02-27 深圳市瑞健医信科技有限公司 Medical waste intelligent supervision system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104850760A (en) * 2015-03-27 2015-08-19 苏州泓迅生物科技有限公司 Artificially synthesized DNA storage medium with coding information, storage reading method for information, and applications
CN106845158A (en) * 2017-02-17 2017-06-13 苏州泓迅生物科技股份有限公司 A kind of method that information Store is carried out using DNA
CN109416928A (en) * 2016-06-07 2019-03-01 伊路米纳有限公司 For carrying out the bioinformatics system, apparatus and method of second level and/or tertiary treatment
CN110427786A (en) * 2019-05-31 2019-11-08 西藏自治区人民政府驻成都办事处医院 A method of use DNA as text information efficient storage medium
CN110706751A (en) * 2019-09-25 2020-01-17 东南大学 DNA storage encryption coding method
CN111183233A (en) * 2017-10-02 2020-05-19 皇家飞利浦有限公司 Assessment of Notch cell signaling pathway activity using mathematical modeling of target gene expression
CN111368132A (en) * 2020-02-28 2020-07-03 元码基因科技(北京)股份有限公司 Method for storing audio or video files based on DNA sequences and storage medium
CN111600609A (en) * 2020-05-19 2020-08-28 东南大学 DNA storage coding method for optimizing Chinese storage

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104850760A (en) * 2015-03-27 2015-08-19 苏州泓迅生物科技有限公司 Artificially synthesized DNA storage medium with coding information, storage reading method for information, and applications
CN109416928A (en) * 2016-06-07 2019-03-01 伊路米纳有限公司 For carrying out the bioinformatics system, apparatus and method of second level and/or tertiary treatment
CN106845158A (en) * 2017-02-17 2017-06-13 苏州泓迅生物科技股份有限公司 A kind of method that information Store is carried out using DNA
CN111183233A (en) * 2017-10-02 2020-05-19 皇家飞利浦有限公司 Assessment of Notch cell signaling pathway activity using mathematical modeling of target gene expression
CN110427786A (en) * 2019-05-31 2019-11-08 西藏自治区人民政府驻成都办事处医院 A method of use DNA as text information efficient storage medium
CN110706751A (en) * 2019-09-25 2020-01-17 东南大学 DNA storage encryption coding method
CN111368132A (en) * 2020-02-28 2020-07-03 元码基因科技(北京)股份有限公司 Method for storing audio or video files based on DNA sequences and storage medium
CN111600609A (en) * 2020-05-19 2020-08-28 东南大学 DNA storage coding method for optimizing Chinese storage

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
音视频文件的DNA信息存储;陈为刚;黄刚;李炳志;尹烨;元英进;;中国科学:生命科学(第01期);第1-4页 *

Also Published As

Publication number Publication date
CN112582030A (en) 2021-03-30

Similar Documents

Publication Publication Date Title
CN112582030B (en) Text storage method based on DNA storage medium
Chandak et al. Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes
US20180211001A1 (en) Trace reconstruction from noisy polynucleotide sequencer reads
Organick et al. Scaling up DNA data storage and random access retrieval
US9830553B2 (en) Code generation method, code generating apparatus and computer readable storage medium
CN111858507B (en) DNA-based data storage method, decoding method, system and device
EP2947779A1 (en) Method and apparatus for storing information units in nucleic acid molecules and nucleic acid storage system
JP4912646B2 (en) Gene transcript mapping method and system
CN112749247B (en) Text information storage and reading method and device
Shomorony et al. Information-theoretic foundations of DNA data storage
US11600360B2 (en) Trace reconstruction from reads with indeterminant errors
Wang et al. Oligo design with single primer binding site for high capacity DNA-based data storage
CN112100982B (en) DNA storage method, system and storage medium
EP3160049A1 (en) Data processing method and device for recovering valid code words from a corrupted code word sequence
Conde-Canencia et al. Nanopore DNA sequencing channel modeling
Wei et al. Improved coding over sets for DNA-based data storage
CN113314187B (en) Data storage method, decoding method, system, device and storage medium
US20070113137A1 (en) Error Correction in Binary-encoded DNA Using Linear Feedback Shift Registers
Tang et al. Error-correcting codes for short tandem duplication and edit errors
Yehezkeally et al. On codes for the noisy substring channel
Mu et al. RBS: a rotational coding based on blocking strategy for DNA storage
Wu et al. HD-code: End-to-end high density code for DNA storage
Wang Coding for DNA data storage
Luo Clustering for DNA Storage
Milenkovic et al. DNA-Based Data Storage Systems: A Review of Implementations and Code Constructions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant