CN114254748A - Extended coding method, system and related device for storage channel - Google Patents

Extended coding method, system and related device for storage channel Download PDF

Info

Publication number
CN114254748A
CN114254748A CN202111136319.3A CN202111136319A CN114254748A CN 114254748 A CN114254748 A CN 114254748A CN 202111136319 A CN202111136319 A CN 202111136319A CN 114254748 A CN114254748 A CN 114254748A
Authority
CN
China
Prior art keywords
coding
index
sequence
symbol sequence
binary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111136319.3A
Other languages
Chinese (zh)
Inventor
刘凯
任玉彬
张洪杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202111136319.3A priority Critical patent/CN114254748A/en
Publication of CN114254748A publication Critical patent/CN114254748A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/123DNA computing
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/05Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
    • H03M13/13Linear codes
    • H03M13/15Cyclic codes, i.e. cyclic shifts of codewords produce other codewords, e.g. codes defined by a generator polynomial, Bose-Chaudhuri-Hocquenghem [BCH] codes
    • H03M13/151Cyclic codes, i.e. cyclic shifts of codewords produce other codewords, e.g. codes defined by a generator polynomial, Bose-Chaudhuri-Hocquenghem [BCH] codes using error location or error correction polynomials
    • H03M13/1515Reed-Solomon codes
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method

Abstract

The application provides an extended coding method of a storage channel, which comprises the following steps: obtaining binary data, decomposing the binary data into binary source blocks, and coding the binary source blocks by adopting preset codes to obtain independent data source blocks; coding a source symbol sequence in an independent data source block by using a preset coder to obtain a coding sequence which does not contain a homopolymer; generating a primary data source block index and a secondary source symbol sequence index by adopting automatic indexing, and compounding in a DNA coding symbol sequence to obtain a base index; and mapping the DNA coding symbol sequence into a binary symbol sequence, and performing inverse mapping after adding controlled redundancy through RS error correction coding to obtain the DNA sequence. The method breaks through the DNA data coding theoretical information density of 2bits/Base by using natural bases as basic storage units, and realizes efficient, reliable and practical DNA coding. The present application also provides an extended coding system of a storage channel, a computer-readable storage medium, and an electronic device, having the above-mentioned advantageous effects.

Description

Extended coding method, system and related device for storage channel
Technical Field
The present application relates to the field of data coding, and in particular, to a method, a system, and a related apparatus for extended coding of a storage channel.
Background
The current mainstream storage media capacity has not been able to meet the exponentially increasing data storage requirements. The storage space of the conventional storage media, such as magnetic media, optical media, solid-state media, etc., is about to reach the physical processing limit, and it is urgent to develop a novel storage medium with low power consumption, high density and persistent storage. The DNA is used as a carrier molecule of genetic information, has the advantages of long service life, high information density, low energy consumption, low maintenance cost and the like, and is a potential data storage medium. However, current DNA data storage is limited to only 4 natural nucleotides A, T, C, G, and the coding density is limited by the deficiency of the basic memory unit. Shuichi hoshika et al developed DNA and RNA like systems consisting of eight nucleotide "letters" A, T, C, G, P, Z, S, B, which constitute four orthogonal pairs. The extension of DNA data storage basic units by non-natural nucleotides is an effective new way to further improve the theoretical storage density of DNA.
DNA data storage density and reliability are mainly affected by the biochemical properties and technical limitations inherent to DNA synthesis and sequencing, as well as by the probability distribution of nucleotides. Within acceptable error, the ideal DNA synthesis length is about 250bp, so large-scale data can only be encoded by thousands of short DNA strands. Therefore, a large number of redundant bases need to be added to perform the addressing function to reconstruct the data encoded in the DNA strand. Another major limiting factor is random errors and burst errors caused by deletions and internal errors (insertions, deletions, substitutions) of DNA molecules in DNA data storage channels. Furthermore, the imbalance between homopolymer and GC content increases the rate of erroneous and biased reads in sequencing, which is an important factor limiting the reliability of DNA data storage. Many scholars have focused on the design of coding algorithms to achieve nucleotide coding sequences with controlled GC content and no homopolymer. However, the desired code density is often not achieved with control of homopolymer and GC content.
Disclosure of Invention
An object of the present application is to provide an extended coding method, an extended coding system, a computer-readable storage medium, and an electronic device for a storage channel, which can effectively control the content of bases summarized in a DNA coding process and a repeated base subsequence.
In order to solve the above technical problem, the present application provides an extended coding method for a storage channel, which has the following specific technical scheme:
acquiring binary data;
decomposing the binary data into binary source blocks, and coding the binary source blocks by adopting preset codes to obtain independent data source blocks;
encoding the source symbol sequence in the independent data source block by using a preset encoder to obtain an encoding sequence which does not contain a homopolymer;
generating a primary data source block index and a secondary source symbol sequence index of the coding sequence by adopting automatic indexing, and compounding the primary data source block index and the secondary source symbol sequence index in a DNA coding symbol sequence to obtain a base index;
and mapping the DNA coding symbol sequence into a binary symbol sequence, and performing inverse mapping after controlled redundancy is increased through RS error correction coding to obtain the DNA sequence.
Optionally, after obtaining the independent data source block, the method further includes:
precoding the independent data source block to generate a middle symbol and a redundancy repair symbol corresponding to the middle symbol; the redundant repair symbols are used to recover the lost sequence of coded symbols.
Optionally, if the preset encoder is an LZW algorithm encoder, when the preset encoder is used to encode the source symbol sequence in the independent data source block, the method further includes:
and controlling the interval range of the LZW code word by dynamically expanding the capacity of the dictionary, and designing a constraint mapping table by taking the LZW code word as an index.
Optionally, if the preset encoder is an encoder based on the improved Base64 algorithm, and when the preset encoder is used to encode the source symbol sequence in the independent data source block, the method further includes:
encoding the source symbol sequence with a balanced code.
Optionally, designing a constraint mapping table with the LZW codeword as an index includes:
and designing a constraint mapping table by using the LZW code word as an index and adopting a pseudo-random mapping corresponding relation.
The present application also provides an extended coding system of a storage channel, comprising:
the acquisition module is used for acquiring binary data;
the decomposition module is used for decomposing the binary data into binary source blocks, and coding the binary source blocks by adopting preset codes to obtain independent data source blocks;
the encoding module is used for encoding the source symbol sequence in the independent data source block by using a preset encoder to obtain an encoding sequence which does not contain a homopolymer;
the index generating module is used for generating a primary data source block index and a secondary source symbol sequence index of the coding sequence by adopting automatic index, and compounding the primary data source block index and the secondary source symbol sequence index in a DNA coding symbol sequence to obtain a base index;
and the mapping module is used for mapping the DNA coding symbol sequence into a binary symbol sequence, and performing inverse mapping after controlled redundancy is increased through RS error correction coding to obtain the DNA sequence.
Optionally, after obtaining the independent data source block, the method further includes:
a precoding module, configured to precode the independent data source block, and generate a middle symbol and a redundant repair symbol corresponding to the middle symbol; the redundant repair symbols are used to recover the lost sequence of coded symbols.
Optionally, if the preset encoder is an LZW algorithm encoder, the method further includes:
and the dynamic extensible module is used for controlling the interval range of the LZW code word by dynamically extending the capacity of the dictionary and designing a constraint mapping table by taking the LZW code word as an index.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method as set forth above.
The present application further provides an electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the method described above when calling the computer program in the memory.
The application provides an extended coding method of a storage channel, which comprises the following steps: acquiring binary data; decomposing the binary data into binary source blocks, and coding the binary source blocks by adopting preset codes to obtain independent data source blocks; encoding the source symbol sequence in the independent data source block by using a preset encoder to obtain an encoding sequence which does not contain a homopolymer; generating a primary data source block index and a secondary source symbol sequence index of the coding sequence by adopting automatic indexing, and compounding the primary data source block index and the secondary source symbol sequence index in a DNA coding symbol sequence to obtain a base index; and mapping the DNA coding symbol sequence into a binary symbol sequence, and performing inverse mapping after controlled redundancy is increased through RS error correction coding to obtain the DNA sequence.
On the basis, the content of GC Base and a repeated Base subsequence can be effectively controlled based on an improved LZW algorithm and a Base64 coding algorithm respectively, functions of RaptorQ erasure code, arithmetic coding, automatic index generation algorithm, RS error correction and the like are integrated, a high-efficiency, reliable and practical DNA coding mode is realized, meaningful reference is provided for the application of artificial Base in actual digital storage, and the potential of expanding other storage media is realized.
The application further provides a decoding and encoding system of a storage channel, a computer-readable storage medium and an electronic device, which have the above beneficial effects and are not described herein again.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for spreading codes of a storage channel according to an embodiment of the present application;
FIG. 2 is a schematic diagram of DNA data encoding based on the coexistence of natural and artificial bases;
FIG. 3 is a schematic representation of LZW elimination homopolymers and control of specific base content as provided in the examples herein;
FIG. 4 is a diagram of improved Base64 encoding provided by embodiments of the present application;
FIG. 5 is a schematic diagram of the improved Base64 reverse encoding provided by the embodiments of the present application;
FIG. 6 is a table of Base64 codes provided by an embodiment of the present application;
FIG. 7 is a balance code table provided in an embodiment of the present application;
FIG. 8 is a histogram of 4, 6, 8-base coding densities for RALR and RABR systems for text, picture and video without symbol level compression as provided by embodiments of the present application;
fig. 9 is a schematic structural diagram of an extended coding system for a storage channel according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The method aims to expand basic DNA storage elements to improve theoretical coding density and provide a reliable high-density coding scheme for applying artificial bases to actual data storage. On the basis, the GC Base content and repeated Base subsequence can be effectively controlled based on an improved LZW algorithm and a Base64 encoding algorithm, two sets of high-efficiency and reliable practical DNA encoding prototype systems which respectively integrate RaptorQ erasure codes, arithmetic codes, an automatic index generation algorithm, RS error correction and other functional modules are developed, and then 4-, 6-, 8-Base encoding of texts, pictures and videos is realized based on the two systems.
Referring to fig. 1, fig. 1 is a flowchart of a method for spreading codes of a storage channel according to an embodiment of the present application, where the method includes:
s101: acquiring binary data;
s102: decomposing the binary data into binary source blocks, and coding the binary source blocks by adopting preset codes to obtain independent data source blocks;
s103: encoding the source symbol sequence in the independent data source block by using a preset encoder to obtain an encoding sequence which does not contain a homopolymer;
s104: generating a primary data source block index and a secondary source symbol sequence index of the coding sequence by adopting automatic indexing, and compounding the primary data source block index and the secondary source symbol sequence index in a DNA coding symbol sequence to obtain a base index;
s105: and mapping the DNA coding symbol sequence into a binary symbol sequence, and performing inverse mapping after controlled redundancy is increased through RS error correction coding to obtain the DNA sequence.
First, binary data to be encoded needs to be acquired. Of course, if the acquired data is not directly in binary form, the encoding process of the embodiment may be executed after the acquired data is converted into binary data.
Since the synthetic length of a single strand of DNA is limited within an acceptable error range, large-scale data must be segmented to control the length of the coding sequence. In the case of erasure redundancy and minimum encoded padding data requirements, the binary data is partitioned into binary source blocks consisting of a sequence of source symbols. The minimum number of decomposition times in the decomposition process is not limited, and no matter what decomposition times or decomposition parameters are adopted, the obtained binary source block can meet the requirement of the synthesis length of the DNA single strand.
As a preferred implementation, after obtaining the binary source block, the independent data source block may be precoded to generate intermediate symbols and redundant repair symbols corresponding to the intermediate symbols. The redundant repair symbols are used to recover the lost coded symbol sequence. Specifically, RaptorQ systematic codes can be used to encode in units of independent data source blocks to generate repair symbols to recover the loss of the entire DNA molecule. The number and size of repair symbols are important factors affecting the encoding density.
Data compression is performed at the coded symbol level using arithmetic coding to reduce redundancy in each RaptorQ coded symbol. As more and more encoded symbol sequence elements are received, the arithmetic encoder progressively reduces the size of the sub-interval in which the arithmetic codeword tag is located until the last element. The binary representation of the point in the subinterval corresponding to the last element is the arithmetic code word of the sequence.
There is no limitation on how to encode the source symbol sequence in the independent data source block by using the preset encoder, and preferably, the embodiment provides two different encoding modes here:
if the preset encoder is an LZW algorithm encoder, when the preset encoder is used for encoding the source symbol sequence in the independent data source block, the method further comprises the following steps:
and controlling the interval range of the LZW code word by dynamically expanding the capacity of the dictionary, and designing a constraint mapping table by taking the LZW code word as an index. Specifically, the LZW codeword may be used as an index, and the constraint mapping table may be designed by using a pseudo-random mapping correspondence. The function of the constraint mapping table is to map the LZW code words into base pairs consisting of two bases one by one, and further map the LZW code word sequence into a DNA sequence without a homopolymer.
If the preset encoder is an encoder based on the modified Base64 algorithm, when the preset encoder is used to encode the source symbol sequence in the independent data source block, the source symbol sequence may be encoded by using a balanced code.
The elimination of homopolymers and control of the content of specific bases are important means to reduce current errors in DNA synthesis and sequencing. Exceeding a certain base content by a certain range may cause a sequencing bias. But because there are no high-throughput sequencing preference parameters based on large-scale data storage that involve the coexistence of natural and artificial nucleotides. It is assumed that GC content is still a factor that leads to sequencing bias in the data storage of 6-, 8-base encoded DNA. According to the source coding theory, bases are distributed with equal probability, which is favorable for increasing the coding density, so that the specific nucleotide content is controlled to be about 1/N, wherein N represents the number of nucleotide types. The base content can be fine-tuned according to the synthesis and sequencing preferences of the DNA data store for the coexistence of specific natural and artificial bases.
If the improved LZW dictionary compression algorithm is adopted, the homopolymer is eliminated and the GC base content is controlled by controlling the number of terms in the dynamic expansion dictionary and the mapping rule between LZW code words and nucleotides. If the modified Base64 encoding algorithm was used, GC Base content was controlled and the probability of homopolymer formation was significantly reduced by introducing 64 balanced 8-bit codes of 4bits each for 0 and 1 and code reshaping. The pseudo-random mapping corresponding relation is adopted in the design process of the mapping table of the improved Base64 coding algorithm and the LZW algorithm, so that the Base tends to equal probability distribution.
Since the irregular distribution of DNA strands cannot support structured addressing, DNA sequence interpretation of encoded data must rely on rational indexing design. And generating a homopolymer-free composite base index by adopting an automatic index generation algorithm according to the number of the data source blocks and the number of the source symbols, wherein different composite base indexes have larger Hamming distances.
At the code symbol sequence level, check redundancy is generated by RS code encoding to correct base substitutions and dislocations in the DNA molecule. The codeword generator polynomial is computed in the Galois field and then multiplied by the generator polynomial to obtain the RS codeword polynomial to obtain the corresponding RS code symbol sequence. Finally, the RS-encoding symbols are mapped to corresponding DNA-encoding symbol sequences. And generating a homopolymer-free composite base index integrated in the DNA coding sequence by adopting an automatic index generation algorithm according to the number of the data source blocks and the number of the source symbols. The encoding of the DNA encoding symbols into binary form using an RS encoder adds controlled redundancy within the symbol to correct base errors in the DNA strand. Mapping the binary DNA coding symbol sequence to obtain the final DNA coding sequence.
On the basis, the GC Base content and the repeated Base subsequence can be effectively controlled based on an improved LZW algorithm and a Base64 coding algorithm respectively, functions of RaptorQ erasure code, arithmetic coding, automatic index generation algorithm, RS error correction and the like are integrated, an efficient, reliable and practical DNA coding mode is realized, meaningful reference is provided for application of artificial bases in actual digital storage, and the potential of expanding other storage media is achieved.
The present application is further described below with reference to the accompanying drawings:
FIG. 2 is a schematic diagram of DNA data encoding based on the coexistence of natural and artificial bases.
Wherein 2 is an octanucleotide "letter" A, T, C, G, P, Z, S, B forming four orthogonal pairs forming 4 pairs of orthogonal pairs. Three source files of texts, pictures and videos which are subjected to encoding symbol sequence horizontal data compression respectively reach 4.63bits/Base,3.88bits/Base and 3.27bits/Base based on 8-Base encoding density of an RABR system. The encoding process of the RALR and RABR prototype systems includes I) partitioning the digital file into binary sequence source blocks composed of source symbols, the RaptorQ encoder encoding in units of independent data source blocks. Generating an intermediate symbol through precoding, and further generating a redundancy repaired symbol through the intermediate symbol so as to recover the lost whole DNA chain of the coded data; II) performing arithmetic compression by taking the source symbol sequence as an independent unit to reduce information redundancy in the RaptorQ coding symbol sequence. IV) encoding the arithmetic symbols using a modified Base64 encoder (RABR) or LZW encoder (RALR) to obtain DNA encoding symbols with homopolymer elimination and a GC Base content of about 1/N, where N represents the number of types of encoding bases.
FIG. 3 is a schematic representation of LZW elimination of homopolymers and control of specific base content.
In which, FIG. 3 shows the principle of eliminating homopolymer and controlling the content of specific base by taking 8-base coding as an example, including the following steps:
1) the 8 single bases in the LZW initial dictionary were combined pairwise to form 64 pairs of base pairs, then the repeated base pairs were removed to eliminate homopolymers, and the remaining 56 pairs of non-repeated base pairs made up the LZW codeword and base sequence mapping table.
2) According to the number of entries in the mapping table, the number of entries in the dynamic extended dictionary formed by the LZW encoding process does not exceed 48 entries, namely the number of entries in the mapping table minus the number of entries in the LZW initial dictionary, and the code words of the LZW are limited within the range of [0,55 ].
3) Mapping LZW codewords into base sequences according to a mapping table can eliminate homopolymers, and random combination of symbols in the mapping table cannot generate three continuous bases.
4) In LZW coding, because the generation probability of small-value code words is larger than that of large-value code words, the content of each base can be qualitatively controlled to be about 1/N by adjusting the corresponding relation between different code words and nucleotide pairs in a mapping table. In addition, the mapping relation between the LZW code word and the base sequence is designed based on pseudo-randomization, so that the content of the base can be effectively controlled to be 1/N, wherein N is the number of the base types.
FIGS. 4-7 of the present application are schematic diagrams of the encoding and decoding of the improved 4-Base 64.
Wherein, fig. 4 is a schematic diagram of improved Base64 encoding, including: I) binary data of every 3 bytes is re-divided into 4 groups of 6 bits. Two "0" s are filled in front of each set of 6 bits, forming a new 8-bit binary data. The 8-bit binary data is then converted to a Base64 encoded symbol sequence according to the decimal number represented and a mapping table containing 64 printable characters. II) coded remodeling and balancing. The Base64 encoded symbols are grouped into 7 character groups, and the 1 st, 3 rd and 5 th characters in each group are converted into 8-bit binary codes according to a customized balance encoding table (fig. 7), wherein each 8-bit balance code comprises four '1's and four '0's. In each group, the 2 nd, 4 th, 6 th and 7 th characters are converted into 6-bit binary codes according to a Base64 code table (figure 6), the 7 th character is divided into 3 groups of 2bits, and each group of 2bits is added to the 6-bit binary corresponding to the 2 nd, 4 th and 6 th characters to form 8-bit binary. The regenerated 8-bit binary corresponding to the 2 nd, 4 th and 6 th bits and the 8-bit balanced code corresponding to the 1 st, 3 rd and 5 th Base64 coded symbols respectively form 2 rows and 8 columns of binary data. Each of the two rows is based on { 00: a, T: 10, C: 01, G: 11 mapping rules to DNA sequences. According to the mapping rule, "C" or "G" is mapped only when "1" appears in the balanced code, and the probability of "1" appearing in the balanced code is about 50%.
Fig. 8 of the present application is a summary histogram of 4, 6, 8-base coding densities for RALR and RABR systems for text, picture and video without symbol level compression.
Wherein, the graph a and the graph b are respectively the RALR and RABR coding densities of the text under the condition of setting and not setting erasure correcting redundancy; graphs c and d are the RALR and RABR coding densities of the pictures with and without erasure correcting redundancy; graphs e and f are RALR and RABR encoding densities for video with and without erasure correction redundancy;
the application integrates two practical high-density reliable DNA data coding prototype systems aiming at DNA storage channels with coexisting natural and artificial bases, and realizes the 4-, 6-and 8-base DNA data coding of three file types of texts, pictures and videos respectively. The RALR and RABR systems designed by the application realize higher decoding success rate and reliability and the maximum theoretical fault tolerance of RaptorQ and RS coding capability through RaptorQ codes. In addition, the improved Base64 code and the improved Lempel-Ziv-Welch (LZW) code are adopted to respectively eliminate the homopolymer and control the content of specific nucleotides in a new coding mode, and the pseudo-random mapping corresponding relation is adopted in the design process of a mapping table of the Base64 coding algorithm and the LZW algorithm, so that the Base tends to equal probability distribution, the DNA sequencing error probability is effectively reduced, reliable storage is realized under the condition of increasing less erasure correction and error correction redundancy, and higher coding efficiency is kept. Aiming at three video files of text, picture and video, the 4-, 6-and 8-base coding densities of the RALR system and the RABR system are respectively 2.50bits/base, 3.11bits/base, 3.41bits/base, 2.97bits/base, 4.01bits/base and 4.63 bits/base. Under the condition of increasing a large amount of controllable erasure correction redundancy, the coding density of the RALR system based on 4-, 6-and 8-base coding can still reach 2.17bits/base, 2.70bits/base and 2.97bits/base, and the coding density of the RABR system can still reach 2.58bits/base, 3.40bits/base and 3.97 bits/base.
The following describes an extension coding system for a storage channel provided in an embodiment of the present application, and the extension coding system described below and the extension coding method for a storage channel described above may be referred to correspondingly.
Referring to fig. 9, fig. 9 is a flowchart of a method for spreading and coding a storage channel according to an embodiment of the present application, and the present application further provides a system for spreading and coding a storage channel, including:
an obtaining module 100, configured to obtain binary data;
the decomposition module 200 is configured to decompose the binary data into binary source blocks, and encode the binary source blocks by using preset codes to obtain independent data source blocks;
the encoding module 300 is configured to encode the source symbol sequence in the independent data source block by using a preset encoder to obtain a coding sequence that does not include a homopolymer;
an index generating module 400, configured to generate a primary data source block index and a secondary source symbol sequence index of the coding sequence by using automatic indexing, and compound the primary data source block index and the secondary source symbol sequence index in a DNA coding symbol sequence to obtain a base index;
and the mapping module 500 is used for mapping the DNA coding symbol sequence into a binary symbol sequence, and performing inverse mapping after controlled redundancy is added through RS error correction coding into the DNA sequence.
Based on the above embodiment, as a preferred embodiment, the method further includes:
a precoding module, configured to precode the independent data source block, and generate a middle symbol and a redundant repair symbol corresponding to the middle symbol; the redundant repair symbols are used to recover the lost sequence of coded symbols.
Based on the foregoing embodiment, as a preferred embodiment, if the preset encoder is an LZW algorithm encoder, the method further includes:
and the dynamic extensible module is used for controlling the interval range of the LZW code word by dynamically extending the capacity of the dictionary and designing a constraint mapping table by taking the LZW code word as an index.
The present application also provides a computer readable storage medium having stored thereon a computer program which, when executed, may implement the steps provided by the above-described embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The application further provides an electronic device, which may include a memory and a processor, where the memory stores a computer program, and the processor may implement the steps provided by the foregoing embodiments when calling the computer program in the memory. Of course, the electronic device may also include various network interfaces, power supplies, and the like.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system provided by the embodiment, the description is relatively simple because the system corresponds to the method provided by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A method for spreading coding of a memory channel, comprising:
acquiring binary data;
decomposing the binary data into binary source blocks, and coding the binary source blocks by adopting preset codes to obtain independent data source blocks;
encoding the source symbol sequence in the independent data source block by using a preset encoder to obtain an encoding sequence which does not contain a homopolymer;
generating a primary data source block index and a secondary source symbol sequence index of the coding sequence by adopting automatic indexing, and compounding the primary data source block index and the secondary source symbol sequence index in a DNA coding symbol sequence to obtain a base index;
and mapping the DNA coding symbol sequence into a binary symbol sequence, and performing inverse mapping after controlled redundancy is increased through RS error correction coding to obtain the DNA sequence.
2. The extension encoding method of claim 1, wherein after obtaining the independent data source block, further comprising:
precoding the independent data source block to generate a middle symbol and a redundancy repair symbol corresponding to the middle symbol; the redundant repair symbols are used to recover the lost sequence of coded symbols.
3. The extended coding method of claim 1, wherein if the preset encoder is an LZW algorithm encoder, and a preset encoder is used to encode the source symbol sequence in the independent data source block, the method further comprises:
and controlling the interval range of the LZW code word by dynamically expanding the capacity of the dictionary, and designing a constraint mapping table by taking the LZW code word as an index.
4. The extension encoding method of claim 1, wherein if the preset encoder is an encoder based on the modified Base64 algorithm, and the preset encoder is used to encode the source symbol sequences in the independent data source blocks, the method further comprises:
encoding the source symbol sequence with a balanced code.
5. The extended coding method of claim 3, wherein designing a constraint mapping table with the LZW codeword as an index comprises:
and designing a constraint mapping table by using the LZW code word as an index and adopting a pseudo-random mapping corresponding relation.
6. An extended coding system for a storage channel, comprising:
the acquisition module is used for acquiring binary data;
the decomposition module is used for decomposing the binary data into binary source blocks, and coding the binary source blocks by adopting preset codes to obtain independent data source blocks;
the encoding module is used for encoding the source symbol sequence in the independent data source block by using a preset encoder to obtain an encoding sequence which does not contain a homopolymer;
the index generating module is used for generating a primary data source block index and a secondary source symbol sequence index of the coding sequence by adopting automatic index, and compounding the primary data source block index and the secondary source symbol sequence index in a DNA coding symbol sequence to obtain a base index;
and the mapping module is used for mapping the DNA coding symbol sequence into a binary symbol sequence, and performing inverse mapping after controlled redundancy is increased through RS error correction coding to obtain the DNA sequence.
7. The extension coding system of claim 6, further comprising:
a precoding module, configured to precode the independent data source block, and generate a middle symbol and a redundant repair symbol corresponding to the middle symbol; the redundant repair symbols are used to recover the lost sequence of coded symbols.
8. The extended coding system of claim 6, wherein if the default encoder is an LZW algorithm encoder, further comprising:
and the dynamic extensible module is used for controlling the interval range of the LZW code word by dynamically extending the capacity of the dictionary and designing a constraint mapping table by taking the LZW code word as an index.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for spreading coding of a storage channel according to any one of claims 1 to 5.
10. An electronic device, characterized in that it comprises a memory in which a computer program is stored and a processor which, when it is called up in said memory, carries out the steps of the method for spreading coding of a memory channel according to any one of claims 1 to 6.
CN202111136319.3A 2021-09-27 2021-09-27 Extended coding method, system and related device for storage channel Pending CN114254748A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111136319.3A CN114254748A (en) 2021-09-27 2021-09-27 Extended coding method, system and related device for storage channel

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111136319.3A CN114254748A (en) 2021-09-27 2021-09-27 Extended coding method, system and related device for storage channel

Publications (1)

Publication Number Publication Date
CN114254748A true CN114254748A (en) 2022-03-29

Family

ID=80790396

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111136319.3A Pending CN114254748A (en) 2021-09-27 2021-09-27 Extended coding method, system and related device for storage channel

Country Status (1)

Country Link
CN (1) CN114254748A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115296799A (en) * 2022-07-21 2022-11-04 杭州跃马森创信息科技有限公司 Quick face recognition method for micro-service user identity authentication
WO2023201782A1 (en) * 2022-04-23 2023-10-26 中国科学院深圳先进技术研究院 Information coding method and apparatus based on dna storage, and computer device and medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023201782A1 (en) * 2022-04-23 2023-10-26 中国科学院深圳先进技术研究院 Information coding method and apparatus based on dna storage, and computer device and medium
CN115296799A (en) * 2022-07-21 2022-11-04 杭州跃马森创信息科技有限公司 Quick face recognition method for micro-service user identity authentication
CN115296799B (en) * 2022-07-21 2023-03-14 杭州跃马森创信息科技有限公司 Quick face recognition method for micro-service user identity authentication

Similar Documents

Publication Publication Date Title
Wang et al. Construction of bio-constrained code for DNA data storage
CN114254748A (en) Extended coding method, system and related device for storage channel
KR101503058B1 (en) Apparatus and method for channel encoding and decoding in communication system using low-density parity-check codes
EP1506621B1 (en) Decoding of chain reaction codes through inactivation of recovered symbols
KR101502623B1 (en) Apparatus and method for channel encoding and decoding in communication system using low-density parity-check codes
AU2003256588A1 (en) Bit-interleaved coded modulation using low density parity check (ldpc) codes
EP3509018A1 (en) Method for biologically storing and restoring data
CA2345237A1 (en) Information additive code generator and decoder for communication systems
JP2004147318A (en) Ldpc decoding apparatus and method thereof
KR20110007865A (en) Data compression method
CN115459781A (en) Long sequence DNA storage coding method based on static interleaving coding
US20100315269A1 (en) Decoding Method
CN101826940A (en) Method and system for optimizing pre-decoding set in luby transform codes
Li et al. Repairing Reed-Solomon Codes Over $ GF (2^\ell) $
KR101503656B1 (en) Apparatus and method for channel encoding and decoding in communication system using low-density parity-check codes
WO1998021829A3 (en) Modified reed solomon code selection and encoding system
CN116187435B (en) Method and system for storing information by utilizing DNA (deoxyribonucleic acid) based on large and small fountain codes and MRC (MRC) algorithm
CN106788454B (en) Construction method of local unequal codes
US9235610B2 (en) Short string compression
JP2009182421A (en) Decoding method and decoding device
EP2293449B1 (en) Method and apparatus for generating a coding table
Immink High-rate maximum runlength constrained coding schemes using nibble replacement
KR101923116B1 (en) Apparatus for Encoding and Decoding in Distributed Storage System using Locally Repairable Codes and Method thereof
EP4273711A3 (en) Efficient encoding methods
KR20090025671A (en) Appratus and method for gernerating linear code

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination