EP4150114A1 - Verfahren und vorrichtung zur decodierung von in einem dna-basierten speichersystem gespeicherten daten - Google Patents

Verfahren und vorrichtung zur decodierung von in einem dna-basierten speichersystem gespeicherten daten

Info

Publication number
EP4150114A1
EP4150114A1 EP21731261.0A EP21731261A EP4150114A1 EP 4150114 A1 EP4150114 A1 EP 4150114A1 EP 21731261 A EP21731261 A EP 21731261A EP 4150114 A1 EP4150114 A1 EP 4150114A1
Authority
EP
European Patent Office
Prior art keywords
nucleotide
nucleotides
decoding
dna
probability density
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP21731261.0A
Other languages
English (en)
French (fr)
Inventor
Laura Conde-Canencia
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Universite de Bretagne Sud
Original Assignee
Universite de Bretagne Sud
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Universite de Bretagne Sud filed Critical Universite de Bretagne Sud
Publication of EP4150114A1 publication Critical patent/EP4150114A1/de
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/65Purpose and implementation aspects
    • H03M13/6597Implementations using analogue techniques for coding or decoding, e.g. analogue Viterbi decoder
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/483Physical analysis of biological material
    • G01N33/487Physical analysis of biological material of liquid biological material
    • G01N33/48707Physical analysis of biological material of liquid biological material by electrical means
    • G01N33/48721Investigating individual macromolecules, e.g. by translocation through nanopores
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/123DNA computing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/05Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
    • H03M13/11Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits using multiple parity bits
    • H03M13/1102Codes on graphs and decoding on graphs, e.g. low-density parity check [LDPC] codes
    • H03M13/1105Decoding
    • H03M13/1111Soft-decision decoding, e.g. by means of message passing or belief propagation algorithms
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/03Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words
    • H03M13/05Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits
    • H03M13/11Error detection or forward error correction by redundancy in data representation, i.e. code words containing more digits than the source words using block codes, i.e. a predetermined number of check bits joined to a predetermined number of information bits using multiple parity bits
    • H03M13/1102Codes on graphs and decoding on graphs, e.g. low-density parity check [LDPC] codes
    • H03M13/1148Structural properties of the code parity-check or generator matrix
    • H03M13/1171Parity-check or generator matrices with non-binary elements, e.g. for non-binary LDPC codes
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M13/00Coding, decoding or code conversion, for error detection or error correction; Coding theory basic assumptions; Coding bounds; Error probability evaluation methods; Channel models; Simulation or testing of codes
    • H03M13/61Aspects and characteristics of methods and arrangements for error correction or error detection, not provided for otherwise
    • H03M13/611Specific encoding aspects, e.g. encoding by means of decoding
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2111/00Details relating to CAD techniques
    • G06F2111/08Probabilistic or stochastic CAD

Definitions

  • TITLE method and device for decoding data stored in a DNA-based storage system
  • the present description relates to a device for decoding data stored in a storage system based on DNA (deoxyribonucleic acid) with a nanopore sequencer and a corresponding decoding method.
  • DNA sequencing One of the methods of reading data stored in DNA-based storage systems is known as DNA sequencing ("DNA sequencing"). Their goal is to determine the exact nucleotides and their order in a DNA sequence that encodes digital data.
  • nanopore sequencing is based on the detection of changes in an ionic current when a DNA sequence passes through a hole at the nanoscale.
  • Each nucleic base or nucleotide causes a different magnitude of current drop due to its different atomic structure. This makes it possible to identify the nucleotide passing through the nanopore at a given time.
  • the main advantage of nanopore sequencers is that they can read long sequences in a single step, up to several tens of thousands of nucleotides.
  • micro-array based synthesis makes it possible to synthesize DNA sequences of lengths up to 200 nucleotides, at a cost of approximately $ 0.001 per nucleotide.
  • its major drawback is its high error rate. From a general point of view, we can affirm that the current synthesis methods combine either a high cost and a high precision or on the contrary a low high cost and a low precision, and the current research tries to reduce the gap between these. two extremes.
  • the main error-generating events in DNA synthesis are simple substitutions [1] [2] [10] of one nucleotide for another and the substitution error rates depend mainly on performance and the cost of the technology [13]
  • a method for decoding a sequence of binary data encoded by a sequence of nucleotides to be decoded comprising B types of DNA nucleotides, B being an integer equal to 2, 3 or 4 , the decoding method comprising
  • the probability density function is a Gaussian probability density function and the flexible decoding algorithm is based on a modeling of the current drop measurement produced by the one-variable nanopore sequencer noise modulated by pulse-amplitude modulation at B discrete levels, each level corresponding to an average value of the probability density function obtained for a given type of nucleotide, the modulated noise variable being noisy by B channels of additive Gaussian white noise corresponding respectively to the statistical distributions obtained for the B types of nucleotides.
  • the nanopore sequencer is thus modeled as an asymmetric communication channel.
  • the error correcting code is a turbo code or an LDPC, Low-Density Parity-Check code decoding algorithm.
  • the flexible decoding algorithm is for example a turbo-code algorithm of the MAP, Message type
  • the flexible decoding algorithm is, for example, a Min-Sum algorithm for LDPC codes or a belief propagation (BP) algorithm for LDPC codes.
  • BP belief propagation
  • the number B of nucleotide type is equal to 4 and the flexible decoding algorithm with error correcting code is applied to symbols in a Galois body of order 4, each symbol in the Galois body of order 4 corresponding to one nucleotide.
  • the order in which the nucleotides are associated with the symbols in the Galois body of order 4 corresponds to the inverse order of the average values of the probability density functions of the current drop amplitudes obtained for the different types of nucleotides.
  • the reliability information for a measurement value y k and a type of nucleotide i is calculated as follows: [Math.201] where Ci is the mean value of the probability density function and ⁇ i the standard deviation of the probability density function obtained for the type of nucleotide i.
  • a decoding device comprising means for implementing the steps of a method according to the first aspect.
  • the means can be hardware and / or software means configured to implement the functions defined in this document for the decoding device.
  • the decoding device comprising at least one memory and at least one processor, the memory storing program instructions configured to cause said decoding device to execute the steps of a method according to the first aspect when the program instructions are executed by the processor.
  • a computer program comprising program instructions for the execution of the steps of a method according to the first aspect when said program is executed by a computer.
  • a recording medium readable by a computer on which is recorded a computer program comprising program instructions for the execution of the steps of a method according to the first aspect when said program is executed by a computer.
  • a DNA-based data storage system comprising a nanopore sequencer and a decoding device according to the second aspect.
  • FIG.1 illustrates aspects of the calculation of parity blocks in a coding method according to one or more embodiments
  • FIG.2 is a schematic representation of a DNA-based storage system according to one or more embodiments
  • FIG.3 represents examples of statistical distributions of current drop measurements obtained for different types of nucleotides in a nanopore sequencer
  • FIG.4 represents a table of parameters (mean and standard deviation) of the statistical distributions presented in Fig. 3;
  • fig.5] represents a flowchart of a flexible decoding method according to one or more embodiments.
  • DNA-based data storage systems with nanopore sequencers will be described in more detail. These storage systems are based on the use of quaternary codes based on graphs defined on Galois fields of order 4 and associated decoding algorithms based on "soft" samples ("soft samples”). -saxonne). These samples are the products of nanopore sequencers, the main characteristic of which is the passage of the DNA sequence at a controlled rate through one or more nanopores.
  • each DNA nucleotide is represented as an element of a Galois body of order 4.
  • Likelihood calculations are introduced to take into account an asymmetric DNA channel pattern and ion current drops.
  • An algorithm of the “Min-Sum” type suitable for a quasi-optimal decoding of low complexity is presented. The results of the simulations show that the error correction method proposed in this document is capable of guaranteeing an almost error-free reading of data with regard to the substitution errors and with ideal conditions for the synthesis.
  • the contributions described in this document relate specifically to the correction of substitution errors produced during nanopore sequencing.
  • the first contribution concerns the use of a quaternary encoding / decoding scheme using a representation of DNA nucleotides as elements of a Galois field of order 4 (this also includes the correspondence with the digital information to be stored in the system) and the use of non-binary correction codes defined in this Galois field. This is to avoid the quaternary to binary conversions that would be required when using a binary encoding / decoding scheme.
  • the second contribution concerns the use of the statistical distributions of the amplitudes of the ionic current signals as "flexible" intrinsic information, that is to say as reliability information.
  • Likelihood calculations at the decoder based on graphs which take into account these amplitudes and the specific model of asymmetric DNA channel without memory (ie the output at a given moment, depends only on the input at that moment but not on the entered at previous or later times).
  • the coding approaches known in the state of the art [8] [15] for this problem do not make it possible to exploit the "soft" information provided by the nanopore system.
  • the error correcting codes likely to use this intrinsic information can be LDPC (Low-Density Parity-Check) codes or turbo-codes.
  • a Galois field GF (q) is a finite set of q elements of which any element can be described as a function of a primitive element, noted here a.
  • the elements (or symbols) of the Galois field GF (q) are denoted ⁇ 0, ⁇ 0 , ⁇ 1 , ... a q-2 ⁇ .
  • LPDC codes are error correcting codes in the category of linear block codes, the parity matrix of which has the property of being sparse (in English, "low-density parity check matrix”), c ' that is, it contains only a small number of undamaged items compared to its total number of items.
  • LDPC codes can in fact be characterized, like linear block codes, by a so-called parity matrix, generally denoted H.
  • the dimensions M x N of the parity matrix correspond, for the number of rows M, to the number of parity constraints of the code, and for the number of columns N, to the length of the code words of the code considered (i.e. the number of symbols in a code word).
  • the elements of a parity matrix of a non-binary LDPC code will belong to a non-binary (q> 2) Galois field GF (q) and the matrix products of the above equations will be performed using the laws addition and multiplication of the field GF (q), denoted respectively in the following.
  • a non-binary LDPC code is a linear code associated with an input data block and defined by a very sparse parity check matrix H whose non-harmful elements belong to a finite field GF (q), where q > 2.
  • q finite field GF
  • d c the number of 1 in a row of the matrix H
  • d v the number of 1 in a column of matrix H.
  • a representation of the parity matrix of the code uses a two-part graph called a “Tanner graph”.
  • this representation maps, by means of branches, M nodes, called “parity nodes” (in English “Check Node” or “CN”) with N nodes, called “Variable nodes” (in English “Variable Node” or “VN”).
  • Each non-zero element of the parity matrix is represented in the corresponding Tanner graph by a branch joining the parity node corresponding to the row of the element in the matrix H to the variable node corresponding to the column of the element in the matrix H.
  • Each parity node of the graph thus represents a parity equation determined by the branches which connect it to the nodes of variables.
  • parity check matrix H 100 is shown in the left part of FIG. 1 and the corresponding Tanner graph 110 and represented in the right part of FIG. 1.
  • the parity matrix H is defined in the Galois field GF (4) whose elements are denoted ⁇ 0, ⁇ 0 , ⁇ 1 , ⁇ 2 ⁇ .
  • the three parity equations corresponding respectively to the three parity nodes of this graph are:
  • the Tanner graph 110 represents the iterative processing applied to each code word to be decoded.
  • the variable node VN k receives a vector ⁇ k calculated by a calculation block 10 k from the input y k to decode the symbol x k of a code word X.
  • This reliability information are for example based on the LLR (Log
  • the input vector ⁇ k is thus defined as follows: [Math.14]
  • the decoded code word is generated by the variable nodes after iteration (s) in the Tanner graph.
  • a decoded symbol of a decoded code word corresponds to the first value of the output vector (the symbols of the output vector being sorted in ascending order of LLR value, the first row of the output vector thus comprising the highest LLR value and the associated decoded symbol), that is to say the most probable value in GF (4) or the one with the lowest decoding error.
  • the variable node VN k thus receives a decoded vector containing the decoded value for the symbol X k of the code word X.
  • the decoded symbol identifies a type of nucleotide among the B types of DNA nucleotides and corresponds to the measurement value current drop.
  • Tanner's graphical representation can be exploited for the implementation of decoding algorithms whose efficiency has been shown on graph models, such as the so-called “belief propagation” algorithm (in English, “Belief propagation”, or “BP”) or message passing type algorithms (in English, “message passing”, or MP).
  • BP Belief propagation
  • MP message passing type algorithms
  • a message can be an intrinsic message (input data vector generated from channel information at the input of the decoder) or an extrinsic message (data vector generated during an iteration applied to an intrinsic message, these extrinsic messages are the messages exchanged between parity nodes and variable nodes).
  • Iterative LDPC code decoding algorithms based on a Tanner graph, in particular using the exchange of messages between the parity nodes and the variable nodes of the Tanner graph corresponding to the LPDC code considered have thus been developed.
  • These decoding algorithms can more generally be implemented or adapted for the decoding of all linear block codes likely to be represented by a bipartite graph comprising a set of parity nodes and a set of variable nodes.
  • Tanner graph of a non-binary LDPC code is generally much more sparse than a corresponding Tanner graph of a binary LDPC code having the same bit rate and the same code length in bits [19] [20] ,
  • the elements of GF (4) are thus denoted ⁇ 0, ⁇ 0 , ⁇ 1 , ⁇ 2 ⁇ , where a is the primitive element of this Galois field.
  • the elements of GF (4) are also called symbols.
  • the basic building blocks of DNA are the four nucleotides: Adenine (A), Cytosine (C), Guanine (G) and Thymine (T). Each nucleotide is represented by a symbol in GF (4).
  • the input binary data sequence is converted to a quaternary data or symbol sequence in GF (4).
  • LDPC coding a C code word with redundancy is obtained for each input code word, the C code word with redundancy comprising N symbols in GF (4) and consisting, on the one hand, of a block of
  • K symbols in GF (4) corresponding respectively to the K input symbols and, on the other hand, a parity check block of M symbols in GF (4) calculated on the basis of the coefficients of the parity matrix.
  • the LDPC coding is repeated for each input code block or word so as to obtain a succession of code words with redundancy.
  • a succession of code words with redundancy is then synthesized so as to obtain a DNA sequence encoding the input binary data blocks.
  • a nanopore device is then used to convert the DNA sequence encoding the input binary data blocks into a series of voltage drop amplitude measurement values.
  • a symbol X k decoded in GF (4) is obtained for each measurement y k of the current drop amplitude.
  • the decoded symbol corresponds to the most probable symbol, the one for which the decoding error is the smallest.
  • the flexible decoding algorithm can be an LDPC code decoding algorithm (for example, a Min-Sum algorithm for LDPC codes or a BP algorithm, Belief Propagation) or turbo-codes (for example, a MAP algorithm, Message Parsing Algorithm).
  • LDPC code decoding algorithm for example, a Min-Sum algorithm for LDPC codes or a BP algorithm, Belief Propagation
  • turbo-codes for example, a MAP algorithm, Message Parsing Algorithm.
  • the reliability values used are based on the statistical distributions of current drop measurements produced during the passage through a nanopore sequencer of a reference sequence composed of nucleotides of a type considered. (excluding other types of nucleotides).
  • the messages exchanged comprise the symbols of the code words processed and a reliability information item associated with each symbol.
  • reliability information is calculated from a measurement value y k supplied by the nanopore device.
  • This reliability information is based on the LLR (Log Likehood Ratio) function as explained in more detail in this document.
  • LLR Log Likehood Ratio
  • the messages exchanged include probability densities include two densities, one for the value "0" and the other for the value "1".
  • the messages or data vectors comprise, for example, pairs of binary values with which are respectively associated likelihood (or reliability) values.
  • the LDPC code considered is non-binary, and the symbols of the code words have values in the body of Galois GF (4).
  • a parity node of a decoder for non-binary LDPC codes of a code with values in the Galois body GF (q) thus receives d c input messages and generates d c output messages.
  • BP Belief Propagation
  • an output is constructed by selecting the q best combinations among q to the power of- 1. This leads to a computational complexity of l 'order of 0 (q 2 ).
  • the BP decoding algorithm can also be considered in the frequency domain. We then speak of a BP algorithm based on the Fourier transformation (in English, “Fourier Transform-based BP algorithm”). The passage in the frequency domain makes it possible to reduce the complexity of the BP algorithm, to reach a complexity of the order of 0 (d c ⁇ q ⁇ log (q)). B remains that the implementation of the BP algorithm presents a very high cost in terms of computational complexity, a cost which becomes prohibitive when considering values of q greater than 16.
  • EMS Extended Min-Sum
  • a parity node can be achieved by a combination of elementary parity nodes, where each elementary parity node receives as input two sorted messages and each containing n m pairs (symbol, reliability) from which it generates an output message containing the n m best possible combinations of the two input messages, the total number of combinations being equal to n m to the power 2.
  • n m pairs symbol, reliability
  • FIG. 2 shows a simplified model of a chain of components 110-190 of a DNA-based storage system 100. Note that this model does not include a component for compression.
  • This chain of components includes:
  • a component 160 for reading that is to say DNA sequencing
  • a component 190 for converting GF symbols (4) into binary data a component 190 for converting GF symbols (4) into binary data.
  • the component 110 is configured to convert binary data into GF symbols (4).
  • the component 110 implements a conversion function between binary data and the GF symbols (4) defined as follows:
  • component 190 is configured to convert GF symbols (4) into binary data and uses the inverse conversion function to that of component 110.
  • the component 120 is configured to introduce, during coding, insertion / deletion error correction codes.
  • the component 180 is configured to use, during decoding, the error correction codes to correct the insertions or deletions errors introduced at the time of encoding.
  • substitution errors i.e. that we aim to isolate the problem of substitutions occurring during nanopore sequencing.
  • the components 120 and 180 are complementary to the components 130 and 170 which aim to correct substitution errors.
  • the component 130 implements coding functions of the data blocks from the component 120 involving the generation of parity check blocks, by means of error correcting codes of the type LDPC (Low-Density Parity-Check) or turbo codes, defined on a Galois body of order 4, so that each encoded symbol corresponds to one of the four basic nucleotides of DNA (i.e. - say ' ⁇ ', T, 'C' and 'G').
  • LDPC Low-Density Parity-Check
  • turbo codes defined on a Galois body of order 4
  • LDPC coding generates parity check blocks.
  • This component 130 is applied to a succession of elements in the body of Galois GF (4).
  • the component 170 here called LDPC GF decoder (4), uses the parity check codes to correct the errors on the data blocks at the output of the component 160.
  • the component 130 is configured to encode a sequence of quaternary data (or symbols in GF (4)) by means of an LDPC or turbo code type coding.
  • a sequence of quaternary data or symbols in GF (4)
  • a word of C code with redundancy is obtained for each input code word, the C code word with redundancy comprising N symbols in GF (4) and consisting, on the one hand, of a block of K symbols in GF (4 ) corresponding respectively to the K input symbols and, on the other hand, to a parity check block of M symbols in GF (4) calculated on the basis of the coefficients of the parity matrix.
  • LDPC encoding is repeated for each input code block or word so that a succession of code words with redundancy which will be processed by the DNA synthesis component 140.
  • Component 140 is configured to perform DNA synthesis from the incoming symbol sequence in GF (4).
  • the synthesis is based on the following correspondence function:
  • nucleotide A has the highest average value, followed, in order by nucleotides T, C then G.
  • Simple insertion / deletion error corrections can also be used by integrating corrective codes during synthesis, that is to say at the level of component 140.
  • Tenengolts codes [cf: G. Tenengolts Nonbinary Codes "Correcting Single Deletion or Insertion" IEEE Transactions on Information Theory vol. ⁇ -30 pp. 766-769 1984] are well suited to this type of error and can be directly encoded in the DNA sequence.
  • the problem of reconstructing the DNA sequence from deletion / insertions followed by PCR techniques has been examined in [9] [10], as we focus in this paper specifically on sequencing techniques at nanopores, we assume that the reconstruction of the sequence is ideal or that we don't have to worry about it. Note however that the sequence reconstruction has motivated a large number of recent research [10] [11].
  • the components 120 and 180 will also not be described in more detail in this document.
  • component 160 is configured to perform DNA sequencing via a nanopore sequencer and thus the reading of a DNA sequence.
  • Component 150 is configured for DNA editing and corresponds to a process of removing and inserting DNA substrings at well-controlled locations. In addition, editing can be done by adding very specific point mutations [24] [25]. These possibilities will not be described in more detail in this document.
  • the current drop amplitude measurement operation during nanopore sequencing (component 160) is modeled as a data transmission channel so that the decoding can benefit from the reliability information to perform the associated soft decoding.
  • the measured values that is to say the amplitudes of the current drop measured for each nucleotide.
  • the current drop amplitude measurements obtained by component 160 are converted into symbols in GF (4) by means of a flexible decoding process using reliability information.
  • the parity check blocks generated by the encoding component 130 are used, are integrated into the flexible decoding (for N inputs, there are M symbols corresponding to the parity check blocks) and thus make it possible to correct substitution errors during sequencing.
  • Nanopore sequencing (component 160) generates as output measurement values of the current drops produced by the passage of the DNA sequence through the nanopore are transmitted to component 170. Details of the flexible decoding method (component 1701
  • each measurement value of a current drop corresponds to a sample.
  • a sample corresponds to the realization of a Gaussian random variable.
  • FIG. 3 represents the statistical distributions obtained for 4 types of nucleotides' ⁇ ', T,' C or 'G' respectively.
  • Each curve represents a statistical distribution of the current drop values obtained for a type of nucleotide.
  • the DNA chain cannot be formed with a single type of nucleotide, several known DNA chains passing through a nanopore sequencer are used and the current drop values are measured many times (1000, 2000 ...) to obtain the statistical distribution for a given type of nucleotide.
  • each statistical distribution corresponds to a Gaussian probability density function represented by a Gaussian curve corresponding to a Gaussian random variable.
  • nucleotide "A" is associated with a Gaussian curve, the average of which corresponds to a current drop of 1.25 nA;
  • nucleotide "T" is associated with a Gaussian curve, the average of which corresponds to a current drop of 0.68 nA;
  • nucleotide ‘C is associated with a Gaussian curve, the average of which corresponds to a current drop of 0.65 nA;
  • the nucleotide "G" is associated with a Gaussian curve, the average of which corresponds to a current drop of 0.3 nA. Assuming that the four nucleotides are equiprobable, a hard decoding which would use thresholds to identify each nucleotide would lead to high error rates because the curves are strongly superimposed. For example, as illustrated by FIG. 3, it is not possible when the value of the current drop is between 0.3 and 1 to know with certainty or with sufficient probability whether it is the nucleotide G, C or T. On the other hand, if the value of the current drop is greater than 1, the probability that it is nucleotide A is almost 100 3 ⁇ 4.
  • the current drop measured at a given instant is modeled as a variable modulated by pulse amplitude modulation at 4 levels (here noted 4-PAM, "Pulse Amplitude Modulation”), each level corresponding to the mean value of the probability density function of the current drop amplitudes obtained for a given nucleotide .
  • 4-PAM Pulse Amplitude Modulation
  • the modeling takes into account these statistical distributions by adding to this modulated variable 4 channels of additive Gaussian white noise corresponding respectively to the statistical distributions obtained for the 4 nucleotides.
  • the value of the standard deviation ⁇ i depends on the type of nucleotide A, G, C, T at the origin of the current drop in current and is determined from the normalized probability density function current drops for each DNA nucleotide, for example according to the values given in the table shown in FIG. 4.
  • the mean values Ci to C4 and the standard deviations ⁇ 1 to ⁇ 4 are given for the 4 distributions corresponding to the 4 nucleotides.
  • This table is an example of possible values for a given sequencing sequencer. In practice, for each sequencer and for given experimental conditions, a statistical analysis is implemented for this sequencer, so as to obtain the mean values and standard deviations specific to the sequencer used and / or to the experimental conditions.
  • the first step of the Min-Sum algorithm is the calculation of the value L k (x) for each symbol x of the code word. With the assumption that the four nucleotides are equiprobable, the value L k (x) of the symbol x in the code word can be defined by:
  • Ci is the mean value of the probability density function and ai the standard deviation of the probability density function obtained for the type of nucleotide i.
  • the message intrinsic to the input of the decoding algorithm is formed of 4 pairs consisting of: an LLR value ⁇ k (i) L and a symbol in GF (4), and they are ordered according to the LLR value obtained by the Math.8 equation.
  • LLR ⁇ k (i) L values are normalized, starting with 0, according to Equation Math.8. given above.
  • the flexible decoding algorithm is thus based on a modeling of the current drop measurement produced by the nanopore sequencer as a noisy variable modulated by amplitude pulse modulation at B discrete levels, each level corresponding to an average value of the Gaussian probability density function of the measurements of the current drops measured for a given nucleotide among the B types of nucleotides, the modulated noisy variable being noisy by B channels of additive Gaussian white noise corresponding respectively to the Gaussian statistical distributions obtained for the B types of nucleotides.
  • C2V (jk (l), k) and C2V (jk (2), k) (respectively V2C (jk (l), k) and V2C (jk (2), k)) the two messages C2V (respectively V2C ) associated with the VN k wherein j k (l) and j k (2) indicate the position of the two non-zero values of column k of the matrix H.
  • V2C and C2V messages are identical to the structure of the intrinsic message ⁇ k .
  • the C2N output message of the CN contains the 4 LLR C2N (1) L values (sorted in ascending order) and the GF C2N (1) GF symbols associated with them.
  • L (x), V2C (x) and C2V (x) are respectively the intrinsic LLR values, the extrinsic V2C and C2V messages associated with the symbol x.
  • the VN decoding equations can be divided into three steps.
  • Step 1 the calculation of V2C (x) for each x in GF (4) [Math.11]
  • Step 2 determining the minimum of the value of V2C)
  • Step 3 normalization [Math.13]
  • a parity node of degree d c can be broken down into control nodes elementary, for example in 3 (dc- 2) elementary control nodes.
  • Bubble-check algorithm at the level of the elementary parity node described for example in E. Boutillon, L. Conde-Canencia, “Simplified check node processing in nonbinary LDPC decoders ”, 6th International Symposium on Turbo Codes & Iterative Information Processing, Brest, France, September 2010.
  • a parity check matrix H is obtained.
  • the non-zero values of H can be chosen randomly from the elements of GF (4).
  • the corresponding matrices are designed to maximize the circumference of the associated bipartite graph, and minimize the multiplicity of cycles with a minimum length [23],
  • the Min-Sum algorithm is applied on the basis of the bipartite Tanner graph associated with the parity check matrix H obtained on this basis.
  • the decoding process iterates n it times and for each iteration the following operations are carried out: M updates of control nodes CN (M being the number of control nodes) and M * of updates of control nodes. VN variables. During the last iteration, a decision is taken for each symbol, the decoded GF (4) symbols are then generated and constitute the decoded DNA code word.
  • steps A) to C) are repeated in the next iteration.
  • a final decision is taken to estimate the code word using the new message C2V and the intrinsic message L k .
  • FIG. 5 schematically represents a flowchart of a flexible decoding process.
  • B 4 nucleotides Adenine (A), Cytosine (C), Guanine (G) and Thymine (T) .
  • a Gaussian probability density function of current drop measurements is obtained for each type of nucleotide among the B types of nucleotides.
  • These probability density functions are obtained from one or more sequences of reference nucleotides whose composition is known, and measurements of current drops produced during one or more passage of these reference nucleotide sequences through a nanopore sequencer. .
  • step 520 current drop amplitude measurements produced as the nucleotide sequence to be decoded passes through the nanopore device are obtained.
  • a calculation is carried out for each measurement value and for each type of nucleotide among the N types of nucleotides, of reliability information on the basis of the Gaussian probability density function obtained. for the type of nucleotide considered.
  • the reliability information for a measurement value y k and a type of nucleotide i is calculated according to the Math.8 equation from Ci, the mean value of the probability density function, and ⁇ i, l 'standard deviation of the probability density function obtained for the type of nucleotide i.
  • a decoded value is obtained for the measurement value by application to each current drop measurement considered and to the N reliability information obtained for the measurement value considered from a flexible decoding with a code.
  • error corrector, the error correction code is a turbo code or an LDPC code, Low-Density Parity-Check.
  • the decoding is based on a modeling of the current drop measurement produced by the nanopore device as a noisy variable modulated by pulse amplitude modulation at N discrete levels. Each level corresponds to an average value of the Gaussian probability density function of the measurements of the current drops obtained for a given nucleotide among the N types of nucleotides.
  • the modulated noisy variable is noisy by N channels of additive Gaussian white noise corresponding respectively to the Gaussian statistical distributions obtained for the N types of nucleotides.
  • the number N of nucleotide type is equal for example to 4 and the error correcting code is applied to quaternary data (or symbols) encoded in a Galois body of order 4.
  • Steps 520 to 540 are repeated for each measurement of a fall amplitude. of current produced when a nucleotide of the nucleotide sequence to be decoded passes through the nanopore device.
  • Monte-Carlo simulations were carried out to obtain performance curves of the DNA-based data storage chain with nanopore sequencing. To do this, we generated random binary sequences and converted them into DNA sequences, each nucleotide being represented by a GF symbol (4). We considered N different values and different coding rates for the LDPC code, and compared them to the results obtained with the hard detection.
  • SNER sequenced nucleotide error rate
  • Non-binary LDPC in DNA storage applications with flexible, appropriately modeled intrinsic information used in combination with an optimized Min-Sum decoder.
  • the present description thus relates to a software or computer program, capable of being executed by a computing device (for example, a computer) serving as a decoding device, by means of one or more data processors, this software / program comprising instructions for the execution by this computer device of all or part of the steps of one or more methods described in this document.
  • a computing device for example, a computer
  • this software / program comprising instructions for the execution by this computer device of all or part of the steps of one or more methods described in this document.
  • These instructions are intended to be stored in a memory of a computing device, loaded and then executed by one or more processors of this computing device so as to cause this computing device to execute the process concerned.
  • This software / program can be coded by means of any language of programming, and be in the form of source code, object code, or code intermediate between source code and object code, such as in a partially compiled form, or in any other desirable form.
  • the computing device can be implemented by one or more physically distinct machines.
  • the computing device can have the overall architecture of a computer, including components of such an architecture: data memory (s), processor (s), communication bus, hardware interface (s) for the connection. from this computing device to a network or other equipment, user interface (s), etc.
  • all or part of the steps of the decoding method described in this document are implemented by a decoding device provided with means for implementing these steps of this method.
  • These means can comprise software means (software) (eg instructions of one or more components of a program) and / or hardware means (eg data memory (s), processor (s), communication bus, hardware interface (s), etc.).
  • software eg instructions of one or more components of a program
  • hardware means eg data memory (s), processor (s), communication bus, hardware interface (s), etc.
  • Means implementing a function or a set of functions can also correspond in this document to a software component, to a hardware component or to a set of hardware and / or software components, capable of implementing the function. or the set of functions, as described below for the means concerned.
  • the present description also relates to an information medium readable by a data processor, and comprising instructions of a program as mentioned above.
  • the information medium can be any material means, entity or device, capable of storing the instructions of a program as mentioned above.
  • Usable program storage media include ROM or RAM, magnetic storage media such as magnetic disks and magnetic tapes, hard disks or optical readable digital data storage media, etc., or any combination thereof. supports.
  • the computer readable storage medium is not transient.
  • the information medium can be a transient medium (for example, a carrier wave) for the transmission of a signal (electromagnetic, electrical, radio or optical signal) carrying the program instructions.
  • This signal can be routed via an appropriate transmission means, wired or wireless: electrical or optical cable, radio or infrared link, or by other means.
  • An embodiment also relates to a computer program product comprising a computer readable storage medium on which program instructions are stored, the program instructions being configured to cause the computer device to implement everything. or part of the steps of one or more methods described herein when the program instructions are executed by one or more processors and / or one or more programmable hardware components.
  • all or part of the steps of the decoding method described in this document are implemented by electronic circuit, programmable or not, specific or not.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Chemical & Material Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • General Physics & Mathematics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Analytical Chemistry (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Public Health (AREA)
  • Geometry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Epidemiology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Genetics & Genomics (AREA)
  • Nanotechnology (AREA)
  • Hematology (AREA)
  • Urology & Nephrology (AREA)
EP21731261.0A 2020-05-15 2021-05-11 Verfahren und vorrichtung zur decodierung von in einem dna-basierten speichersystem gespeicherten daten Pending EP4150114A1 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FR2004819A FR3110177A1 (fr) 2020-05-15 2020-05-15 procédé et dispositif de décodage de données stockées dans un système de stockage à base d'ADN
PCT/FR2021/050826 WO2021229184A1 (fr) 2020-05-15 2021-05-11 Procédé et dispositif de décodage de données stockées dans un système de stockage à base d'adn

Publications (1)

Publication Number Publication Date
EP4150114A1 true EP4150114A1 (de) 2023-03-22

Family

ID=72885610

Family Applications (1)

Application Number Title Priority Date Filing Date
EP21731261.0A Pending EP4150114A1 (de) 2020-05-15 2021-05-11 Verfahren und vorrichtung zur decodierung von in einem dna-basierten speichersystem gespeicherten daten

Country Status (4)

Country Link
US (1) US20230187024A1 (de)
EP (1) EP4150114A1 (de)
FR (1) FR3110177A1 (de)
WO (1) WO2021229184A1 (de)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114328000B (zh) * 2022-01-10 2022-08-23 天津大学 1型2型分段纠错内码的dna存储级联编码与解码方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012088339A2 (en) * 2010-12-22 2012-06-28 Genia Technologies, Inc. Nanopore-based single dna molecule characterization using speed bumps
JP2019501641A (ja) * 2015-11-12 2019-01-24 サミュエル ウィリアムスSamuel WILLIAMS ナノポア技術を用いた短いdna断片の迅速な配列決定
US10883140B2 (en) * 2016-04-21 2021-01-05 President And Fellows Of Harvard College Method and system of nanopore-based information encoding
WO2018132457A1 (en) * 2017-01-10 2018-07-19 Roswell Biotechnologies, Inc. Methods and systems for dna data storage
EP3419180B1 (de) * 2017-06-19 2022-11-02 Universite De Bretagne Sud Vereinfachte, vorsortierte, syndrombasierte, erweiterte min-sum (ems) decodierung von nichtbinären ldpc codes
WO2019075100A1 (en) * 2017-10-10 2019-04-18 Roswell Biotechnologies, Inc. METHODS, APPARATUS AND SYSTEMS FOR STORING DNA DATA WITHOUT AMPLIFICATION

Also Published As

Publication number Publication date
US20230187024A1 (en) 2023-06-15
FR3110177A1 (fr) 2021-11-19
WO2021229184A1 (fr) 2021-11-18

Similar Documents

Publication Publication Date Title
EP2095512B1 (de) Verfahren und vorrichtung zur dekodierung von ldpc-codes und kommunikationsvorrichtung mit einer solchen vorrichtung
Welzel et al. DNA-Aeon provides flexible arithmetic coding for constraint adherence and error correction in DNA storage
JP3893383B2 (ja) Ldpc符号用検査行列生成方法および検査行列生成装置
EP2047605A1 (de) Nachrichtenweiterleitungsdekodierungsverfahren mit sequenzierung gemäss zuverlässigkeit der umgebung
EP1959572B1 (de) Message-Passing-Dekodierverfahren mit forcierter Konvergenz
JP2002353946A (ja) 有限サイズのデータブロックに対して誤り訂正符号を評価する方法
Guilloud Generic architecture for LDPC codes decoding
FR2849514A1 (fr) Code de geometrie algebrique adapte aux erreurs en rafale
WO2008059160A2 (fr) Codage et decodage d'un signal de donnees en fonction d'un code correcteur
Yang et al. Nonlinear programming approaches to decoding low-density parity-check codes
EP1905157B1 (de) Verfahren und vorrichtung zur kodierung einer datenfolge
EP4150114A1 (de) Verfahren und vorrichtung zur decodierung von in einem dna-basierten speichersystem gespeicherten daten
Conde-Canencia et al. Nanopore DNA sequencing channel modeling
EP2833555B1 (de) Verbessertes Verfahren zur Decodierung eines Korrekturcodes mit Message Passing, insbesondere für die Decodierung von Low-Density-Parity-Check-Codes oder Turbo-Codes
Grospellier Constant time decoding of quantum expander codes and application to fault-tolerant quantum computation
Zhao et al. Progressive algebraic Chase decoding algorithms for Reed–Solomon codes
Guilloud Architecture générique de décodeur de codes LDPC
EP3879708B1 (de) Dekodierung von ldpc-kodes mit selektiver bit-inversion
Lavauzelle Codes with locality: constructions and applications to cryptographic protocols
FR2913836A1 (fr) Codage et decodage de signaux de donnees de rendements variables
WO2018115648A1 (fr) Codage et de décodage de paquets de données dans un corps de galois
JP2004088449A (ja) 符号化装置及び符号化方法、並びに復号装置及び復号方法
EP4233177A1 (de) Effiziente decodierung von gldpc codes
Hamoum DNA data storage algorithms and synchronization
Pal et al. Error-Control Coding Algorithms and Architecture for Modern Applications Powered by LDPC Codes and Belief Propagation

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20221201

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)