LU92881B1

LU92881B1 - Methods for encoding and decoding a binary string and System therefore

Info

Publication number: LU92881B1
Application number: LU92881A
Authority: LU
Inventors: Lifu Song; An-Ping Zeng
Original assignee: Technische Univ Hamburg Harburg; Tutech Innovation Gmbh
Priority date: 2015-11-18
Filing date: 2015-11-18
Publication date: 2017-06-21
Also published as: WO2017085245A1

Abstract

The invention relates to a method and a system for encoding, respectively decoding, a binary string (SB). The encoding method comprises the steps of: • receiving said binary bit string (SB), • converting said binary bit string (SB) into a ternary string (ST1, ST2, … STN), whereby each ternary string is uniquely represented by a sequence of two nucleobases, selected form the group comprising Adenine (A), Cytosine (C), Thymine (T) and Guanine (G), • whereby for each consecutive ternary string (ST1, ST2, … STN) an error protection nucleobase (ET1, ET2, … ETN) is selected from the group comprising Adenine (A), Cytosine (C), Thymine (T) and Guanine (G), whereby each ternary string and its selected error protection nucleobase form an error protected block, • whereby said error protection nucleobase is selected according to a selection scheme taking into account at least a preceding previous error protected block if present. (Fig. 1)

Description

Methods for encoding and decoding a binary string and system therefore TU Hamburg-Harburg, Hamburg

The invention relates to methods for encoding and decoding a binary string and a system therefore.

Background

Reliable information storage on a large scale becomes an ever increasing problem.

While on the one hand the need for quick access towards data having a short lifetime increases it also becomes apparent that archiving large sets of data for long periods is getting a serious problem.

The need for such reliable data storage may be different. Just to mention a simple example: Storage of data relating to cancer and later-on processing of high amount of data sets for distilling commonalities needs storage of Hugh data amounts. Also the storage of knowledge for future generations is an ever increasing problem as the amounts of knowledge explodes while the lifetime of modern storage systems is less of old stylish books.

The presently used media of data storage such as magnetic tape or hard drives have a decisive shortcoming of limited life time and density, e.g. around 50 years for hard drivers. However, to achieve such a long-term storage the respective devices needs to be stored in special rooms in a strictly controlled environment such that temperature changes / humidity changes, etc. may not negatively impact the devices. Such storage is extremely cost-expensive.

Long-term archival of big data is thus expensive and challenging. I.e. reliable long term storage is extremely expensive.

Recently synthetic DNA (Deoxyribonucleic acid) has been proposed for future digital information storage with unprecedented density and longevity over thousands of years.

Even though some progress has been made, this progress is made on the expense of additional costs and / or increased efforts for adapting environmental conditions like in previous technologies.

To push the DNA information storage into a real cost-effective and practical technology, improvements along several lines are desperately needed. Unlike other data storage media, errors in “writing” and “reading” information in DNA introduce inevitable errors into the digital data, especially if fast and cheap synthesis and sequencing technologies (such as Nanopore sequencer) are to be used. I.e. the content of Guanine (G) and Cytosine (C) among the nitrogenous bases forming the DNA affects the stability of DNA. While a higher GC content may lead to a certain higher thermo-stability (e.g. due to 3 hydrogen bonds) a GC content above a certain level may tend to autolysis. Also it is known that a GC content above a certain level is hard to synthesize and is also hard to sequence. To address this problem, it has been proposed to use a 1 bit / per two base coding, where a binary “1” was coded as A/C (Adenine/ Cytosine) while a binary “0” was coded as G/T (Guanine/Thymine). However, such a coding scheme is extremely inefficient, leading again to an increased need for storage capacity.

However, also so called homopolymers may be a problem. Homopolymers are sequences of nitrogenous bases where a plurality of consecutive equal nitrogenous bases is arranged. Synthesis and sequencing of these homopolymers constitutes a problem.

To address this problem, a base-3 encoding scheme has been proposed. Within the proposed scheme in order to avoid long homopolymers a quarter of the encoding capacity was spend and additional complex means introducing a very high level of redundancy were taken to ensure a full coverage of every fragment during sequencing.

However, such a high redundancy (coverage) reduces the data density and raises the cost for information storage. In addition, while the problem associated with homopolymers is reduced, the problems associated with GC content may not be solved at the same time.

Other scientists tried to introduce error-correction codes. However, the usage thereof typically involves sophisticated en-/de- coding algorithms. The code achieved thereby introduces complex secondary structures which again may constitute a problem when sequencing.

It is therefore an object of the invention to provide methods and systems allowing for highly reliable encoding and decoding. It is another object to provide methods and systems which are cost effective.

Short description of the Invention

The object is solved by a method and a system for encoding a binary string. The method comprises a step of receiving said binary bit string, a step of converting said binary bit string into a ternary string, whereby each ternary string is uniquely represented by a sequence of two nucleobases, selected form the group comprising Adenine (A), Cytosine (C), Thymine (T) and Guanine (G), whereby for each consecutive ternary string an error protection nucleobase (ETi, Et2,.·· ETn) is selected from the group comprising Adenine (A), Cytosine (C), Thymine (T) and Guanine (G), and whereby said consecutive ternary string and said selected error detection nucleobase form an error protected block, whereby said error protection nucleobase is selected according to a selection scheme taking into account at least a preceding previous error protected block if present.

The invention also relates to a method and system for decoding a binary string, said binary string being encoded as a ternary string. The method comprises a step of receiving three consecutive nucleobases forming an error protected block, whereby each nucleobase is selected form the group comprising Adenine (A), Cytosine (C), Thymine (T) and Guanine (G), whereby the first and second nucleobases are representing information and said third nucleobase represents an error protection nucleobase, a step of checking each error protected block, whether the block is error-free on basis of the first and second nucleobase and an thereto expected error protection nucleobase and said received error protection nucleobase, whereby said expected error protection base is selected according to a selection scheme taking into account at least an preceding previous error protected block if present, and for each error-free block the first two nucleobases are converted into a binary bit string.

Further advantageous embodiments are subject to the detailed description as well as to the dependent claims.

Brief description of the drawings

In the following reference will be made towards the figures. In these

Fig. 1 shows an aspect of an exemplary embodiment of encoding a binary string according to the invention,

Fig. 2 shows another aspect of an exemplary embodiment of encoding a binary string according to the invention, and

Fig. 3 shows an aspect of an exemplary embodiment of decoding a binary string according to the invention.

Detailed Description

The present disclosure describes preferred embodiments with reference to the Figures, in which like reference signs represent the same or similar elements. Reference throughout this specification to “one embodiment,” “an embodiment,” or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases “in one embodiment,” “in an embodiment,” and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.

The described features, structures, or characteristics of the invention may be combined in any suitable manner in one or more embodiments. In the description, numerous specific details are recited to provide a thorough understanding of embodiments of the invention. I.e., unless indicated as alternative only any feature of an embodiment may also be utilized in another embodiment.

In addition, even though at some occurrences certain features will be described with reference to a single entity, such a description is for illustrative purpose only and actual implantations of the invention may also comprise one or more of these entities. I.e. usage of singular also encompasses plural entities unless indicated.

The principles of the invention will be best understood when discussing an example. DNA sequences are composed of nucleobases, selected form the group comprising Adenine (A), Cytosine (C), Thymine (T) and Guanine (G). I.e. by means of 4 different (constituting) nucleobases, it is possible to address 4 different information statuses. 4 different information statues in digital information processing are understood as information, which could be encoded by 2 bits, 00, 01,10, 11.

Now within the invention more than 2 bits are encoded.

In the following we will present one example, where 4 bits are encoded into a sequence of 2 nucleobases. Obviously, further embodiments may use different amounts of bits. E.g. it may also be possible to use coding schemes which employ 3 bits into 2 nucleobases, whereby the remaining bit may be used for other purposes such as redundancy, error detection, error correction.... As such the invention is not limited to a particular scheme.

In the following we will nevertheless assume the simple case that 4 bits are encoded in a sequence of 2 nucleobases.

The attribution of sequences of nucleobases will be assumed according to the following table:

Table 1

In this table for each binary String Sb composed of 4 bits a respective (unique) sequence of nucleobases ST is displayed.

Now suppose a text string as displayed in Figure 1 at the top. This text string may be represented by a respective binary string. For ease of understanding each character of the text string is displayed as an 8 bit sequence.

For each char, said 8 bit sequence may be understood as consecutive 4 bit sequences.

Let’s have a closer look.

The character for “space” is encoded as 0010 0000. Hence, the first 4-bits Sb = 0010 may be mapped into AC while the second 4-bits SB = 0000 may be mapped into AT. The letter “w” is encoded as 0111 0111. Hence, the first 4-bits SB = 0111 may be mapped into TT and the second 4-bits Sb = 0111 may be mapped into TT as well. This process may be done continuously. As each nucleobase could in principle store 2 bits, the sequence of two nucleobases is in the following referred to as a ternary string, i.e. a string of ternary information elements. Obviously, in our example the string is composed of two such ternary elements. However, there may also be other cases where three or more such ternary elements may constitute such a ternary string.

The sequences of nucleobases may be arranged in the coding scheme as displayed in table 1 such that GC sequences are held at a moderate level.

Now to enable error detection (in that case single error detection), an error protection nucleobase is added.

As a single nucleobase can encode two bits information, a single base error can lead to two bit changes which could be undiscovered.

To enable single base error detection, the 16 nucleobase - nucleobase combinations may be arranged in into four groups arranged in columns. Within each two-nucleobase column, all two- nucleobase combinations do not share any identical base in each of the single- nucleobase columns.

Therefore, any single nucleobase change will result in a two-base combination that does not belong to the corresponding two- nucleobase column any more. I.e. suppose that the initial two nucleobases were GA (2nd column). In case the first nucleobase would display T, the nucleobases would correspond to the first column, while in case the second nucleobase would display a C, the nucleobases would correspond to the fourth column.

To allow detection of an erroneous nucleobase, an error protection nucleobase is added.

The exact process may be subject to implementation.

We will first describe certain thoughts on the attribution of error protection nucleobases and later-on describe one example of implementation. We will refer to table 1 and figure 2.

Suppose the two nucleobases Sti=AC. For the respective column, the error protection nucleobase is C. Hence the error protection nucleobase Eti=C is associated to the ternary string Sn. For the second ternary string Sj2=AT the error protection nucleobase is found in column 1. Hence the error protection nucleobase Et2=A is associated to the ternary string Sj2- The process moves on to the third ternary string St3=TT. The respective error protection nucleobase is found in column 4. Hence the error protection nucleobase ET3=T is associated to the ternary string St3. The process moves on to the fourth ternary string St4=TT. The respective error protection nucleobase is again found in column 4. Hence the error protection nucleobase ET4=T is associated to the ternary string ST4. Next, the process moves on to the fifth ternary string Sts=TG. The respective error protection nucleobase is found in column 3. Hence the error protection nucleobase Ets=C is associated to the ternary string Sts- This process is shown in the second line of Figure 2.

Now, have a closer look at the third and fourth ternary string and the respective error protection nucleobase. They would all be T. As 6 (or more) consecutive Ts are understood as an unwanted homopolymer, we adapt the attribution scheme. In case a certain pattern would appear, like I I I I I I, the error protection nucleobase of the second ternary string is attributed according to a different scheme indicated in table 1 as E’t. I.e. in the case of the fourth ternary string TT, the error protection base is chosen according to the last line of table 1, i.e. the error protection base is selected in the 4th column as G.

In this way, the data encoding two-base and the error detecting base will not match to each other if there is any single base error in any of the three bases. In other words, single base error in any of the three bases can be detected by checking the compatibility between the two-base and the error detecting base during decoding. I.e. the error protection nucleobase selection takes into account at least a preceding error protected block, i.e. here error protected block formed of St3 and Et3 is taken into account. This process is shown in the third line of Figure 2.

Obviously, there might be several ways to implement such a processing.

One example of such an implementation is as follows:

First, error protection nucleobases are attributed according to the normal scheme. Afterwards, presence of certain patterns is examined. In case a certain pattern is present, the respective error protection nucleobase is altered according to a secondary scheme.

The table 1 provides along with the attribution scheme for error protection nucleobases for an attribution scheme which avoids extreme GC sequences and long homopolymers.

The error-detecting bases may be assigned in such a way that (1) “A” or “T” are assigned to “GC/GG/CG/CC” containing columns to avoid extreme GC sequences; (2) three-base homopolymers are avoided.

To fulfill requirement (1), the two-base combinations of “GC/GG/CG/CC” are kept within two columns such that “A” and “T” are available for assigning the errordetecting base.

Therefore, in the example “GC/CG” is arranged in column 4 while “GG/CC” is arranged in column 1.

Based on the same idea, the allowed two-base combinations that could be included in these two “GC/GG/CG/CC” containing columns are ”AA/TT” or “AT/TA”.

Hence, by means of the invention it is possible to encode quick and reliable in a manner allowing for long term storage.

Now as this scheme may still lead to TTT sequences, it may be foreseen in advantageous embodiments that when a certain pattern arises (like TTT), the next error protection base is selected according to a different process to ensure that the same pattern is avoided.

Naturally, the pattern may be different subject to the coding scheme, e.g. TTT or GGG or CCC or AAA.

The next block is then handled again according to the normal selection scheme.

During the encoding process, a normal selection scheme is used in general and when a three-base combination “TTT” is encountered, the coding scheme for assigning the consecutive error-detecting base is switched to a different selection scheme temporarily and switched back to the normal selection scheme immediately after encoding one three-base block. Furthermore, extreme GC combinations (i.e. a high GC content) are avoided as combinations showing such content (e.g. 3 consecutive nucleobases based on combinations having merely G or C (e.g. GGG, GGC, CCG or CCC) in said different selection scheme, are only present if the previous encoding three-base block is “TTT”, i.e. said different selection scheme was selected, which is rarely the case. I.e. by means of switching the coding scheme according to a pattern noticed in a preceding error protected block, one may even avoid 7 consecutive nucleobases of the same type, thereby improving reliability.

It is to be noted that when the error protection nucleobase is not added to the tail but to the header or even in between nucleobases of a ternary string, the attribution scheme may be adopted accordingly.

Now we will turn to the decoding process. The decoding process may be performed in pretty much the same scheme.

First a string of three consecutive nucleobases is received, e.g. Sn, En. We again assume that the first and second nucleobases are representing information and said third nucleobase represents an error protection nucleobase. These three consecutive nucleobases represent an error protected block Bi. Further blocks B2,... BN may be received as well.

Then each error protected block Bi, B2,... BN is checked whether the block B1t B2,... Bn is error-free on basis of the first and second nucleobase and an thereto expected error protection nucleobase.

This checking may be done by means of table 1.

Suppose we receive the blocks as indicated in the first line of Figure 3.

There based on the first two nucleobases in the first block AG, we would expect G as error protection base Eti. However, received error protection nucleobase Eti=C. Hence, we may deduce that an error is present. Hence, the first block is not correctly received.

If the received error protection base matches the expected error protection base, the error protected block is held error-free. Consequently, the two nucleobases containing the stored information may be translated back into a binary string according to table 1.

Hence, by means of the invention it is possible to decode quick and reliable in a manner allowing for long term storage.

Again, it may be foreseen that the coding avoids long homopolymers. Then the respective process needs to be implemented with respect to detection as well.

Here we may again rely on the same scheme as for the encoding. I.e. a selection scheme takes into account at least a preceding previous error protected block if present.

That is, in case of the block B4 it is taken into account that block B3 matches a certain pattern and consequently a different error protection nucleobase is to be expected. Again, as in the encoding example the pattern may be TTT.

The next block B5 is then handled again according to the normal selection scheme.

During the encoding process, a normal selection scheme is used in general and when a three-base combination “TTT” is encountered, the coding scheme for assigning the error-detecting base is switched to a different selection scheme temporarily and switched back to said normal selection scheme immediately after encoded one three-base block. Furthermore, extreme GC combinations can be avoided as three-base combinations with merely G or C (e.g. GGG, GGC, CCG or CCC) in different selection scheme is only present if the previous encoding three-base block is “TTT”. I.e. by means of switching the coding scheme according to a pattern noticed in a preceding error protected block, one may even avoid 7 consecutive nucleobases of the same type, thereby improving reliability.

Now, as DNA may be reproduced in numerous copies, one may advantageously benefit from multiple copies of the same original.

Suppose that one has received multiple copies showing errors. Then one may reproduce from said plurality of copies a copy which most likely represents the original.

In Figure 3 such erroneous copies of the same sequence are displayed. For ease of understanding, the erroneous nucleobases are indicated by bold italic face. Note, even though errors are shown only to be present in the second nucleobase of a block, the same process may apply in case an error is in a first nucleobase or an error protection nucleobase to be found.

Now, within each sequence we see one error. Gladly, the errors are of different nature, i.e. within each sequence a different block is erroneous. While in sequence 1 block Bi is erroneous, in sequence 2 block B3, in sequence 3 block B2 and in sequence 5 block B5 is erroneous. Hence, we may deduce from all sequences a sequence which is error-free. I.e. if a plurality of representations of a same block is received, erroneous representations may be replaced by error-free representations.

Without further detailing, it is apparent that the method steps of the encoding as well as the decoding may be embodied in respective synthesis systems for encoding and sequencing systems for decoding.

Although not further detailed, it may be envisaged that adding further higher coding schemes like the one known in binary world allowing for error detection and error correction, e.g. by implementing harming distances when mapping binary information towards ternary information and/or additional redundancy, code overlapping, spreading, etc. , e.g. Reed-Solomon error correction coding, may additionally improve the methods and systems.

In addition one may provision certain arrangement to detect the proper start of block. For that purpose different methods may be applied. A basic process is to first read at least 5 sequences. The first sequence is held as starting sequence and consequently by the first and second sequence one can determine the expected error protection sequence. Then one may shift the starting sequence and perform error protection determination again. This process may again be repeated. The most likely starting sequence of a block may be detected upon detecting no error. I.e. in case of row 2, one would examine ACC, CCA and CAT as a candidate block. Taking Table 1 into account, one would deduce that ACC and CCA would be valid blocks while CAT would not be a valid block.

Obviously, accuracy of this procedure may be improved by taking a plurality of blocks into account.

Now we assume in case of row 2 that the sequences ACC ATA, CCA TAT, CAT ATA are examined, ACC ATA are two valid consecutive blocks. Within the second sample CCA TAT, TAT would not be valid, while within the third sample CAT ATA, both CAT and ATA would not be valid.

Obviously there may also be other measures to indicate the proper start of a sequence. Such measures may include a predefined start sequence. Such a start sequence may also be repeated in predefined distances, e.g. every 8th block.

By means of the invention an information density far above that of commercial technologies such as hard disks may be provisioned. Furthermore, the storage of information by means of (synthetic) DNA provides for a long-term storage which needs little efforts for maintenance.

The invention envisages usage of a feature of DNA synthesis for digital information storage, namely there are always many copies of each DNA fragment synthesized. In other words, there is already a high “natural” redundancy introduced by DNA synthesis.

Additionally, as DNA fragments are special molecules with potential biological activities, the encoding scheme should provide way(s) to prevent that the encoded DNA fragments would form biologically dangerous sequences (e.g. viruses) with regard to biological safety.

By usage of coding schemes biological safety may be maintained.

The invention proposes in the described example to benefit of natural redundancy for error detection and correction.

The proposed encoding scheme is a self-error-detecting three-base block (SED3B) scheme for DNA digital data encoding and decoding.

The proposed SED3B scheme can tolerate relatively high error rates (>40%) in the synthesis (information writing) and sequencing (information retrieval) of DNA.

By using the SED3B scheme, the data storage density can be increased to about 9 Zettabytes (9 x 1021 byte or 9 x 106 petabyte) per gram DNA. DNA sequences encoded according to the proposed method can limit extreme GC content, remove homopolymers longer than 6bp and show a much simple secondary structure. Furthermore, the encoding scheme provides a strong biological safety of data storage.

Claims

Translation of the claims

A digital lock system comprising a digital lock (LK) and a server (SRV), wherein a first registered user can lock the digital lock (LK) and request the generation of a close and open key (AP), the server (SRV), the digital lock (LK) and the communication device of the first user (SP1) each have a public key and a private key, wherein the first registered user by using a communication device of the user (SP1) and the digital lock (LK) each communicating with the server (SRV), and wherein the first registered user can communicate with the digital lock (LK) by using a communication device of the user (SP1), wherein the server (SRV) has public keys of the digital lock (LK) and the communication device of the first user (SP1) as well as challenges (CLK, CSP1) for the communication device of the user s (SP1) and the digital lock (LK1) and a response (RSpi) to the challenge for the user's communication device.

A digital lock (LK) for a vehicle within a digital lock system according to claim 1, wherein the digital lock (LK) is capable of communicating with a communication device of the user via a near field communication system, wherein a user (SP1) sets the digital lock (LK ) via communication with the digital lock via the near field communication system, the digital lock (LK) being equipped with a second communication device allowing access to a wireless communication system, the digital lock (LK) being related to its location • wherein the digital lock (LK) further comprises one or more sensors to detect theft attempts, the digital lock (LK) further comprising alarm means, the alarm means being activated when a theft attempt is detected.

The digital lock of claim 2, wherein the digital lock (LK) further comprises a light, the light having a detector to detect ambient light conditions, the light being turned on in response to a detected lack of ambient light.

4. Digital lock (LK) according to claim 2 or 3, wherein the vehicle is a land vehicle, an aircraft or a watercraft.

A digital lock (LK) for a safe within a digital lock system according to claim 1, wherein the digital lock (LK) is capable of communicating with a communication device of the user via a near field communication system, wherein the user locks the digital lock by means of communication with the digital lock can be opened via the near field communication system, thereby allowing physical access to the locker, the digital lock further being equipped with a second communication device allowing access to a wireless communication system, the digital lock being in relation to its location can be tracked by the second communication device.

The digital lock (LK) of claim 5, wherein the digital lock further comprises a shaft movable relative to the housing, the digital lock further comprising a compartment for storing physical objects

The digital lock (LK) for a door within a digital lock system according to claim 1, wherein the digital lock (LK) is capable of communicating with a communication device of the user via a near field communication system, wherein the user locks the digital lock of the door by means of communication with the user opening the digital lock via the near field communication system, thereby opening the door, • wherein the digital lock is further equipped with a second communication device, which allows access to a wireless communication system.

The digital lock (LK) of any one of claims 2 to 7, wherein the digital lock has a physical I / O interface that allows wired access to the digital lock for close / open operations.

The key for a digital lock (LK) according to claim 7, wherein the key has a display for displaying location information of doors that can be opened / closed by the key.

10. Digital lock (LK) according to one of claims 2 to 8, wherein the digital lock has an energy storage.

11. Digital lock (LK) according to claim 10, wherein the energy storage allows charging by means of an energy harvester.