AU2018204252A1

AU2018204252A1 - Methods of coding and decoding information

Info

Publication number: AU2018204252A1
Application number: AU2018204252A
Authority: AU
Inventors: Sergey Nikolayevich Smirnov
Original assignee: Landigrad LLC
Current assignee: Landigrad LLC
Priority date: 2017-06-14
Filing date: 2018-06-14
Publication date: 2019-01-17
Anticipated expiration: 2038-06-14
Also published as: CN109086890A; EP3416076A1; ZA201803894B; TW201905691A; WO2018231100A1; RU2659025C1; US20180365276A1; KR20180136407A; AU2018204252B2; KR20200014926A; KR102269463B1; IL259907A; JP2019009776A; TWI673604B; CA3008070A1

Abstract

The inventions fall into the category of means of coding/ decoding of all types of information (text file, images, sound files) using DNA, RNA and amino acid sequences. A coding method was suggested according to which a data array to be coded is divided into chunks, each of which is assigned at least one element of the molecular genetic system; the set of elements in the system is supplemented by relevant indexing information consisting of i data bits and k check bits. Each piece of i+k bit data shall be written as a multiplet of the molecular genetic system consisting of the n number of nitrogenous bases or corresponding amino acids. Besides the information to be coded, one shall also write the value n and the information about the selected method of transforming the matrix into a sequence and the reading order. A decoding method was suggested according to which the machine-readable sequence is divided into chunks of information to be decoded, which includes combinations of i data bits and k check bits, and assigned at least one multiplet of the molecular genetic system consisting of the n number of nitrogenous bases or corresponding amino acids. The molecular genetic system used for the coding/ decoding consists of a base representing a four-nitrogenous base matrix A TU ; the nitrogenous bases are grouped according to the number of hydrogen bonds and the number of condensed rings contained in the molecular structures of nitrogenous bases; the matrix is formed in advance by raising tensors of each previous generation matrix to the second power multiple times in order to obtain a next-generation matrix. The system binary indices that form the i data bits correspond to the selected characteristic of the nitrogenous bases. As a result, the effectiveness of antinoise coding and decoding is enhanced due to the increase in the volume of transmitted and received information and due to the decrease in the number of elements used for coding/ decoding. 2 independent claims, 33 dependent claims, 17 illustrations. METHODS OF CODING AND DECODING INFORMATION Table Decimal Triplet for Triplet for Amino Indexing Latin Russian triplet DNA RNA acid informatio alphabet alphabet code n 63 GGG GGG Gly(G) 111111 space 62 GGC GGC Gly (G) 111110 capital letter 61 GCG GCG Ala(A) 111101 60 GCC GCC Ala (A) 111100 (dot) 59 CGG CGG Arg(R) 111011 , (comma) 58 CGC CGC Arg(R) 111010 ; (semicolon) 57 CCG CCG Pro(P) 111001 3 56 CCC CCC Pro(P) 111000 31 55 GGA GGA Gly (G) 110111 M 54 GGT GGU Gly (G) 110110 _L 53 GCA GCA Ala (A) 110101 enter; new paragraph; 52 GCT GCU Ala (A) 110100 _0 51 CGA CGA Arg (R) 110011 __ 50 CGT CGU Arg(R) 110010 _b 49 CCA CCA Pro(P) 110001 b 48 CCT CCU Pro(P) 110000 ? (question mark) 47 GAG GAG Glu(E) 101111 _1 46 GAC GAC Asp(D) 101110 _) 45 GTG GUG Val(V) 101101 number 44 GTC GUC Val(V) 101100 43 CAG CAG Gln(Q) 101011 42 CAC CAC His(H) 101010 41 CTG CUG Leu(L) 101001 40 CTC CUC Leu (L) 101000 14 39 GAA GAA Glu (E) 100111 <> (quotation marks) 38 GAT GAU Asp(D) 100110 F 37 GTA GUA Val(V) 100101 (D 36 GTT GUU Val(V) 100100 E 35 CAA CAA Gln(Q) 100011 34 CAT CAU His(H) 100010 color 33 CTA CUA Leu(L) 100001 () (parentheses) 32 CTT CUU Leu (L) 100000 b b 31 AGG A GG Arg (R) 011111 - (minus/ dash) 30 AGC AGC Ser(S) 011110 + (plus)

Description

METHODS OF CODING AND DECODING INFORMATION

The present inventions relate to the spheres of biochemistry, bio-pharmacology, bio-technology, genetic engineering, and hands-on programming of embedded physical and biochemical processes and technologies, namely the methods of coding all types of information (text files, images, sound files) using DNA, RNA and amino acid sequences for its further storage, processing, and transmission/ receipt.

There is a DNA-based method of coding text information (patent # US 6,312,911, 1PC C12Q1/68, G06N3/12, 2001) describing the process of coding text messages into DNA sequences and subsequent message extraction using the molecular genetic system, each element of which consists of three nitrogenous bases of the DNA and represents an alphanumeric symbol. Since the DNA has 4 bases (A - adenine, T thymine, C - cytosine, G - guanine), the maximal triplet representation of the known molecular genetic system will consist of 64 unique symbols, which number is equal to the number of combinations made from the four nitrogenous bases.

There is a method of coding any type of information (text files, images or sound files) described in the patent # US 2005/0053968 (IPC G06F 19/00, C12Q 1/68, G06N 3/12, GOIN 33/48, GHB 20/00, GOIN 33/50, G11C 13/02, 2005), which uses the molecular genetic system consisting of various combinations of the four nitrogenous bases of the DNA (G, A, C, T), where each combination represents a unique symbol. The known method is used to create a synthetic DNA molecule that will include the digital information and the encryption key. The synthetic DNA is placed into the DNA carrier for purpose of storage. In cases where the amount of DNA is too big, the information can be divided into several chunks. The method described in the patent can restore divided

2018204252 14 Jun 2018 chunks of DNA by means of comparing the primer of one of the chunks with the end primers of one of the possible subsequent chunks.

The application of the known methods is limited since they cannot be used for effective coding of large volumes of information and have low noise-immunity.

The prior art provides data on redundant coding of digital information with a noise-immune code, in which k check bits are added to data bits in order to increase the noise-immunity; thanks to the check bits one can identify and/ or correct errors that may appear in the decoder due to the communication channel. A method of noise-immune coding that uses check bits was described, e.g., in the patent No. RU 2408979 (IPC H03M 13/19, 2011).

The method that is the closest to the one stated in the claim is the method of coding information, according to which the data array to be coded is divided into discrete components (symbols), and each selected symbol is assigned at least one triplet, then a unique DNA sequence is constructed, which is split into a set of overlapping chunks of DNA. The set of DNA chunks is then supplemented by relevant indexing information consisting of i data bits, through which the information being coded is represented as a machine-readable sequence in binary form. (Patent # US 61/654,295, IPC G06F 19/00, 2013).

The drawback of this method is its comparatively high redundancy which takes place due to the fact that quite a substantial number of triplets is required for the coding, which can lead to a decrease in the efficiency of coding and errors in coding; another drawback is the low information capacity of each of the nitrogenous bases which is about 1.83 bits.

The capacity of the DNA to hold vast amounts of information is currently one of the properties of the DNA which is being researched in order for the DNA to be used as a data carrier. DNA molecules ensure the high density of the information being stored; they are durable and can store information for hundreds of years if kept in proper conditions (i.e., cool dry and dark areas). From the perspective of the theory of noise immunity in digital communication and the transmission of discrete signals, one can say that genetic information is inherently immune to noise. However, the problem that arises when using the known methods of coding and decoding information remains unsolved, which means that the original information gets distorted during the coding process due to various

2018204252 14 Jun 2018 external factors, such as defective DNA synthesis, degradation of DNA molecules which occurs with time and errors in the sequence structure. That is why, despite the previous developments, systematic research and classification of all specific interactions between the sequences of nitrogenous bases are still needed in order to define the conditions of effective, fast and error-free coding of vast volumes of information.

In developing these inventions, it was attempted to solve the problem of keeping large arrays of information without data losses when using the minimal number of elements of the physical storage medium.

As a result, if the suggested solution is implemented, the effectiveness of noiseimmune coding and decoding will be enhanced due to the increase in the volume of transmitted and received information and due to the decrease in the number of elements used for the coding and decoding.

The specified result can be achieved through coding, in which a data array is divided into logically complete chunks each of which is assigned at least one element of the molecular genetic system used for the coding; the set of the elements in the system is supplemented by relevant indexing information consisting of i data bits; each code combination of data bits is supplemented by a combination of k check bits defined based on the combination of i data bits; whereas, according to the invention, each piece of i+k bit data in binary form shall be written as a multiplet that shall consist of the n number of nitrogenous bases or corresponding amino acids and be an element of the molecular genetic system, consisting of a base representing a four-nitrogenous-base matrix

G C '

A T/U_ ; the nitrogenous bases are grouped according to the number of hydrogen bonds and the number of condensed rings contained in the molecular structures of nitrogenous bases; the matrix is formed in advance by raising tensors of each previousgeneration matrix to the second power multiple times in order to obtain a next-generation matrix. The system binary indices that form the i data bits correspond to the selected characteristic of the nitrogenous bases and write the information being coded, the n value that defines the size of the matrix used for coding, and the information about the selected method of transforming the matrix into a sequence and the reading order.

Whereas, according to the invention, the information to be coded is lined up as a machine-readable sequence in binary form including binary indices for each multiplet.

2018204252 14 Jun 2018

Whereas, according to the invention, the information to be coded is lined up as a sequence of nitrogenous bases.

Whereas, according to the invention, each code combination of i data bits is supplemented by a combination of m control bits defined depending on the combination of i and k bits;

Whereas, according to the invention, judging by the position of each multiplet in the matrix of the molecular genetic system, one shall identify whether the corresponding logically complete chunk has a recessive or dominant trait.

Whereas, according to the invention, the sequence of nitrogenous bases is split into logically complete chunks, in which the information about the beginning and the end of the chunk of information shall also be coded.

Whereas, according to the invention, the information to be coded is lined up as a sequence of nitrogenous bases in order to be coded on the DNA level.

Whereas, according to the invention, the information to be coded is lined up as a sequence of nitrogenous bases in order to be coded on the RNA level.

Whereas, according to the invention, the information to be coded is lined up as a sequence of amino acids.

Whereas, according to the invention, the information to be coded is lined up as a sequence of nitrogenous bases to be coded on the protein level.

Whereas, according to the invention, the molecular genetic system has a linear matrix representation.

Whereas, according to the invention, the molecular genetic system has a square matrix representation.

Whereas, according to the invention, the molecular genetic system has a rectangular matrix representation.

Whereas, according to the invention, the molecular genetic system has a circular matrix representation.

Whereas, according to the invention, the molecular genetic system has a volume representation.

Whereas, according to the invention, the molecular genetic system has a tree structure representation.

2018204252 14 Jun 2018

Whereas, according to the invention, the binary indices of the molecular genetic system correspond to the selected characteristic of nitrogenous bases for the DNA.

Whereas, according to the invention, the binary indices of the molecular genetic system correspond to the selected characteristic of nitrogenous bases for the RNA.

Whereas, according to the invention, the binary indices of the molecular genetic system correspond to the selected characteristic of nitrogenous bases for amino acids.

Whereas, according to the invention, the binary indices of the molecular genetic system correspond to the selected characteristic of nitrogenous bases forming recessive and dominant traits on the genetic level.

Whereas, according to the invention, the synthetic DNA formed based on the obtained sequence of amino acids and containing the information to be coded shall be put in a storage medium.

Whereas, according to the invention, the synthetic DNA formed based on the obtained sequence of nitrogenous bases and containing the information to be coded shall be put in medium for logical and mathematical calculations.

The specified result shall be obtained by means of decoding, in which a machinereadable sequence is divided into chunks consisting of logically complete fragments of information to be decoded; the chunks include combinations of i data bits and k check bits, and each logically complete fragment is assigned at least one multiplet - which is an element of the molecular genetic system used for decoding - from the n number of nitrogenous bases or corresponding amino acids; whereas, according to the invention, the molecular genetic system consists of a base representing a four-nitrogenous-base matrix

G C

A T/U, ; the nitrogenous bases are grouped according to the number of hydrogen bonds and the number of condensed rings contained in the molecular structures of nitrogenous bases; the matrix is formed in advance by raising tensors of each previousgeneration matrix to the second power multiple times in order to obtain a next-generation matrix. The system binary indices that form the i data bits correspond to the selected characteristic of the nitrogenous bases, and the indices that form the k check bits are defined based on the combinations of i data bits.

Whereas, according to the invention, the discrete parts of the machine-readable sequence, consisting of logically complete chunks of information to be decoded, include

2018204252 14 Jun 2018 combinations of m control bits; the indices forming the m control bits are defined based on the combinations of i data bits;

Whereas, according to the invention, on the assumption of the position of each multiplet in the matrix of the molecular genetic system, one shall identify whether the corresponding logically complete chunk has a recessive or dominant trait.

In molecular biology, the central dogma is a flow of information from DNA to RNA and from nucleic acids to protein. The transition of genetic information from DNA to RNA and from RNA to amino acids forming protein complexes is standard for all of the organisms that consist of cells and is the basis of the biosynthesis of macromolecules. DNA, RNA, amino acids and proteins are linear polymers, which means that each monomer in their structure is linked up, as a rule, to two other monomers. The sequence

2018204252 14 Jun 2018 of monomers encodes the information, whose transition is described by the central dogma.

The reproduction of DNA molecules and synthesis of RNA molecules is carried out in such a way where the DNA strand serves as a matrix (template) for the construction of a daughter molecule (template-based synthesis). This method ensures that hereditary information is copied and implemented in the amino acid (protein) synthesis. The flow of information includes three types of template-based syntheses: DNA synthesis - replication; RNA synthesis - transcription; and protein synthesis - translation. Besides, there are a template-based synthesis that corrects errors in the structure of DNA (RNA) and a variation of limited replication (SOS-reparation) that restores the original structure of DNA (RNA). The template nature of the nucleic acid and protein synthesis ensures the high accuracy of data reproduction.

Currently, researching the principles of genetic code noise-immunity is of great importance for the technical development, since it can help us to solve the problem of providing information systems with noise-immunity. If information is coded using nitrogenous bases of DNA (RNA) or amino acids, it will represent a very complex binary combination that is defined through the arrangement of multiplets in the system. Decoding information represented in such a way will require an elaborate mathematical approach. The hierarchical system used for coding and decoding allows us to define the stability of each multiplet, its general molecular structure and many other parameters.

The molecules of proteins, nucleic acids (DNA, RNA) and polysaccharides that build up tissues, organs, the intracellular structure (cytoskeleton) and the extracellular matrix, membrane channels, receptors and molecular machines intended for the synthesis, packing and disposal of proteins and nucleic acids relate to biological nanoobjects. The size of protein molecules can be from 1 to 1,000 nm. The diameter of the helix of DNA is 2 nm and its length can reach up to several centimeters. Protein complexes that form the fibers of the cytoskeleton are 7-25 nm thick and are up to several microns long. This feature makes it possible to keep large volumes of information using relatively small amounts of physical storage medium.

The set of the four nitrogenous bases is usually considered to be the elementary alphabet of the genetic code. The genetic information transferred by the hereditary molecules (DNA and RNA) defines the initial structure of proteins of a living organism.

2018204252 14 Jun 2018

Each protein that is being coded represents a strand of 20-22 types of amino acids. The sequence of amino acids in the protein strand is defined by the sequence of triplets (threeletter “words”). A triplet (or a codon) is a combination of three nitrogenous bases located in a sequence on a strand of DNA (or RNA).

Computers usually store information in the form of matrices and process it using matrices of 27 unitary transformations, primarily standardized Hadamard matrices that act as logical units performing various steps after receiving various conditions.

The suggested inventions will allow us to discover structural characteristics of the genetic code based on the mathematics of matrices; the genetic code will be structured as the suggested molecular genetic system representing a hierarchical system of the elements of the genetic code. Whereas, patterns in the suggested system, which can be observed through various characteristics of the nitrogenous bases for DNA, RNA and amino acids, ensure the noise-immunity of the suggested system judging from the viewpoint of the matrix and mathematical methods of the theory of discrete signals and the digital field. Moreover, these patterns can also be observed at the lower (atomic and subatomic) and upper (protein, genetic) levels. Besides, this approach to structuring does not feature the necessary distinction between the DNA and RNA levels. In addition, structuring can be carried out using both major and minor nitrogenous bases.

In addition, one can observe a harmonious and well-adjustable fractal connection in the molecular genetic system used for coding/ decoding on the following levels:

DNA — RNA — Amino acids — Proteins — Genetic traits

Whereas, one can observe clear mathematical relationships on each structural level of the molecular genetic system, when counting the atomic and subatomic components of the nitrogenous bases.

Whereas, it is possible to conduct a color analysis of the molecular genetic system on the structural level, using a black and white chart developed based on the indexing information (1 - black, 0 - white) and color charts, i.e. a fractal color chart and a chart based on the length of the waves of the visible spectrum.

Whereas, apart from the square and circular charts of the molecular genetic system, a linear representation, a rectangular representation, a volume (cubical) representation and a tree structure representation are used.

Claimed inventions are clarified with illustrations, where:

2018204252 14 Jun 2018

- fig. 1 features the linear matrix representation of the matrix used for the coding, which is formed by 64 triplets, for DNA, RNA and amino acids with binary indexing information assigned to each element of the matrix;

- fig. 2 features the square matrix representation of the triplet matrix for DNA, RNA and amino acids;

- fig. 3 features the rectangular matrix representation of the triplet matrix for DNA, RNA and amino acids;

- fig. 4 features the circular matrix representation of the triplet matrix for DNA, RNA and amino acids;

- fig. 5, 5A feature the volume (cubical) representation of the triplet matrix for DNA/ RNA;

- fig. 6, 6A feature the tree structure matrix representation of the triplet matrix representation for DNA;

- fig. 7 features the alteration in the number of hydrogen bonds in the linear representation of the triplet matrix for DNA;

- fig. 8 features the alteration in the number of hydrogen bonds in the rectangular representation of the triplet matrix for DNA/ RNA;

- fig. 9 features the alteration in the total number of the atoms of carbon (C), hydrogen (H), nitrogen (N), and oxygen (0) for each nitrogenous base in the linear representation of the triplet matrix for DNA;

- fig. 10 features the alteration in the total number of the atoms of carbon (C) for each nitrogenous base of DNA in the linear representation of the triplet matrix;

- fig. 11 features the alteration in the total number of the atoms of carbon (C) for each nitrogenous base of RNA in the linear representation of the triplet matrix;

- fig. 12 features the total number of the atoms of oxygen (Ο) for each nitrogenous base of DNA in the linear representation of the triplet matrix;

- fig. 13 features the total number of the atoms of oxygen (Ο) for each nitrogenous base of RNA in the linear representation of the triplet matrix;

- fig. 14 features the comparative analysis of the major and minor nitrogenous bases for DNA and RNA;

ίο

2018204252 14 Jun 2018

- fig. 15 features the table of relationships between information (variable) and reference (resulting) values of the logical operation obtained through adding under the module 2 in the form of 64 triplets of DNA (RNA);

- fig. 16 features the table of relationships between the resulting vectors and their triplet interpretation;

- fig. 17 features the alphabet completed based on the molecular genetic system of triplets.

In practice, one can come across variations of digital, text, symbol, graphic and mixed information. The purpose of the inventions is to transform any type of information into a sequence of code combinations consisting of i data bits; each code combination shall be supplemented by data bits in order to provide an opportunity to restore the information using the code combinations of k check bits, whose combination shall be defined based on the combinations of the i data bits. Whereas, the combination of the k check bits shall be selected in accordance with numerical representations that either coincide with the complete orthogonal system of Walsh functions applied in the noiseimmune coding for discrete signal processing, or have a constant value.

The array of information to be coded shall be divided into logically complete chunks (a series of symbols, images, sounds or biological sequences at the genetic and somatic levels), each of which shall be assigned at least one element of the molecular genetic system consisting of the n number of nitrogenous bases. The set of elements in the system is supplemented by indexing information consisting of i data bits; the indexing information shall correspond to the selected characteristic of the nitrogenous bases for DNA, RNA or amino acids. Based on the indexing information, a code check combination consisting of k check bits shall be formed, e.g., this can be a combination of the values of the Boolean function of adding variables under the module 2, whose numerical representations coincide with the complete orthogonal system of Walsh functions [I.V. Agafonova “Cryptographic Properties of Non-Linear Boolean Functions,” 2007 http://dha.spb.ru/PDF/cryptoBOOLEAN.pdf], For instance, the number of corresponding nitrogenous bases for one discrete element of the information to be coded may be equal to 6 (X, Y, Z is the code combination of the element and ΧΎ'Ζ' is the code check combination). By means of Z-coding, the information being coded is represented as a machine-readable sequence in binary (discrete) form or as a unique nucleotide sequence

2018204252 14 Jun 2018 made of the n number of multiplets. The n number and the information about the selected method of transforming the matrix into a sequence shall be recorded together with the information being coded. Thanks to the fractal nature of the molecular genetic system suggested for coding, the multiplet sequence may be made up for RNA, DNA, amino acids and proteins.

DNA and RNA include nucleotides that consist of sugar, a phosphate group and nitrogenous bases: cytosine (C), adenine (A), guanine (G), thymine (T) for DNA and uracil (U) for RNA. Whereas, the nitrogenous bases represent specific structures with special biochemical properties. Since the structure of the sugar and phosphate backbone remains the same, it is the characteristics of the five major nitrogenous bases that define the position of the nucleotides, when the basis of the molecular genetic system of the elements is being constructed; the system is made in the form of a rectangular matrix F¹ of the size 2x2.

F¹

G C

A T/U_ , where T/U means the application in the construction of the system of thymine (for DNA) or uracil (for RNA).

The basis made of four nitrogenous bases for DNA and four nitrogenous bases for RNA is the medium of a content-rich symmetrical system of distinctive and uniting features that splits the four nitrogenous bases into equivalent pairs based on one of these features. For DNA the basis (F¹) shall be made up with thymine (T), and for the RNA the basis shall be made up with uracil (U).

The term “molecular genetic system” used in this context means a complex of structures and mechanisms of transition of hereditary information (genetic material) characteristic of this specific type [Arefyev, V.A., Lisovenko, L.A. English-Russian

Dictionary of Genetic Terms 1995, 407 p.j. Whereas, the system is an object, whose integrity is provided by a cluster of connections and relations between groups of elements united by structures expanded in the space and time.

The arguments in support of this structure as follows:

The double helix of the usual DNA consists of two mutually intertwined polynucleotide strands, whose nitrogenous bases are linked up through hydrogen bonds in pairs. The rigidity of the three-dimensional configuration of the DNA is primarily ensured due to the great number of hydrogen bonds between the opposite bases of the strands, which means that the adenine (A) of one of the strands is always opposite to the thymine (T) of the other one, or than guanine (G) is opposite to cytosine (C), which

2018204252 14 Jun 2018 makes one strand of DNA complementary (supplementary) to another one based on the positions of the nitrogenous bases. The adenine (A) of one strands is connected to the thymine (T) of the other one, and guanine (G) is connected to cytosine (C). RNA represents a single-strand molecule structured just like one of the DNA strands. The RNA nucleotides are very similar to those of DNA, although they are not identical. Three of the nitrogenous bases are exactly the same as in the DNA: adenine (A), guanine (G) and cytosine (C). However, instead of thymine (T), the RNA contains another nitrogenous base similar to thymine in structure which is uracil (U).

The nitrogenous bases are horizontally arranged in the basis (F¹) presented here depending on the number of their hydrogen bonds: the first line of the matrix features a complementary pair with a strong hydrogen bond G - C: it has three hydrogen bonds; the second line features a complementary pair with a weak hydrogen bond A - T (for DNA)/ U (for RNA): it has two hydrogen bonds.

Secondly, a feature that is specific of nitrogenous bases is the fact that they are subdivided into two types: purine bases that include adenine (A) and guatnine (G) and pyrimidine bases that include cytosine (C), thymine (T) and uracil (U). The molecules of purine and pyrimidine bases are based on the two aromatic heterocyclic compounds pyramidine and purine. The molecule of purine consists of two condensed rings, and the molecule of pyramidine consists of one condensed ring. Therefore, the nitrogenous bases in the basis (F¹) presented here are vertically arranged depending on the number of their condensed rings: the first column of the matrix features purine bases that have a greater number of condensed rings in their molecular structure, and the second column features pyramidine bases that have a fewer number of rings.

Therefore, we obtain a basis (F¹) of a system of elements, in which nitrogenous bases are divided in pairs according to their primary characteristics horizontally - with the first line featuring elements with three hydrogen bonds, and the second line featuring elements with two hydrogen bonds - and vertically - with the first column featuring purine bases and the second column featuring pyramidine bases.

In this subdivision into equivalent pairs based on a specific feature, each nitrogenous base can be supplemented by a binary opposition index: e.g., by one when the feature is very pronounced or by zero when it is not very pronounced. In case of applying the subdivision into binary indices based on the biochemical characteristics of nitrogenous bases, the elementary basis (F¹) shall look as follows:

2018204252 14 Jun 2018 ! ₌ [G 11 C 10 ^h μ 01 T/U 00.

The positioning of the nitrogenous bases in the elementary basis graphically demonstrates that according to the first feature the horizontal pairs of the bases G=C (three hydrogen bonds, the strong position corresponding to the figure 1) and A=T/U (two hydrogen bonds, the weak position corresponding to the figure 0) are equivalent, and according to the second feature the vertical pairs G=A (two rings in the molecule, the strong position corresponding to the figure 1) and OT/U (one ring in the molecule, the weak position corresponding to the figure 0) are equivalent.

The molecular genetic system for coding was generated by means of getting the tensor (Kronecker) product of the base by itself. For instance, a next-generation matrix F²for DNA of 4x4 will look as follows:

GG	GC	CG	cc-
GA	GT	CA	CT
AG	AC	TG	TC
-AA	AT	TA	TT-

And a matrix of the F³ generation for DNA of 8x8 will look as follows:

GGG	GGG	GGG	GCC	CGG	CGC	CCG	CCC
GGA	GGT	GCA	GCT	CGA	CGT	CCA	CCT
GAG	GAG	GTG	GTC	CAG	CAC	CTG	CTC
GAA	GAT	GTA	GTT	CAA	CAT	CTA	CTT
AGG	AGG	ACG	ACC	TGG	TGC	TCG	TCC
AGA	AGT	ACA	ACT	TGA	TGT	TCA	TCT
AAG	AAC	ATG	ATC	TAG	TAC	TTG	TTC
AAA	AAT	ATA	ATT	TAA	TAT	TTA	TTT

The number of elements in a matrix is substantiated by the number of combinations of triplets made of the four nitrogenous bases. Each of the triplets has its own specific character because it acts an eigenvalue of the matrix and has its own corresponding vector of this matrix. For instance, each of the three triplets AAC, ACA and CAA is coupled with its own vector, and, therefore, these triplets are considerably different in this respect.

Therefore, the molecular genetic system was formed as a result of an infinite number of repetitions of Fⁿ = F Θ F^n_1, where each quadrant of the matrix F⁽ⁿ⁾completely reproduces the matrix F ^(n_1) of the previous generation or the previous power. Each new-generation matrix conceals in itself the information about all of the previous

2018204252 14 Jun 2018 generations (about the matrices in all of the previous powers). And the biggest multiplet matrix F^(x) contains the information about all of the matrices with shorter multiplets. The higher the power of the matrices gets, the bigger is the number of combinations of nitrogenous bases that define the set of the unique elements of the structured system, and this set can be infinite. Thus the suggested system of elements can be used to receive, transmit, store and reproduce a vast volume of information.

In order to have an opportunity to process information using technical means, this same system can be formed in the binary system based on the index values of their sets in the columns and lines based on the combinations of the purine and pyramidine bases and the number of hydrogen bonds.

For instance, each triplet of the triplet matrix for DNA will be featured in the binary system as a hexagram, each of which will contain, for example, 1 byte of information, i.e., 6 bits in each byte.

G	11	c	10	G	11	c	10	G	11	c	10	G	11	c	10
G	11	G	11	c	10	c	10	G	11	G	11	c	10	c	10
G	11	G	11	G	11	G	11	c	10	c	10	c	10	c	10
A	01	T	00	A	01	T	00	A	01	T	00	A	01	T	00
G	11	G	11	c	10	c	10	G	11	G	11	c	10	c	10
G	11	G	11	G	11	G	11	C	10	c	10	c	10	c	10
G	11	C	10	G	11	c	10	G	11	c	10	G	11	c	10
A	01	A	01	T	00	T	00	A	01	A	01	T	00	T	00
G	11	G	11	G	11	G	11	C	10	c	10	c	10	c	10
A	01	T	00	A	01	T	00	A	01	T	00	A	01	T	00
A	01	A	01	T	00	T	00	A	01	A	01	T	00	T	00
G	11	G	11	G	11	G	11	C	10	c	10	c	10	c	10
G	11	C	10	G	11	C	10	G	11	c	10	G	11	c	10
G	11	G	11	C	10	c	10	G	11	G	11	c	10	c	10
A	01	A	01	A	01	A	01	T	00	T	00	T	00	T	00
A	01	T	00	A	01	T	00	A	01	T	00	A	01	T	00
G	11	G	11	C	10	c	10	G	11	G	11	c	10	c	10
A	01	A	01	A	01	A	01	T	00	T	00	T	00	T	00
G	11	C	10	G	11	C	10	G	11	c	10	G	11	c	10
A	01	A	01	T	00	T	00	A	01	A	01	T	00	T	00
A	01	A	01	A	01	A	01	T	00	T	00	T	00	T	00
A	01	T	00	A	01	T	00	A	01	T	00	A	01	T	00
A	01	A	01	T	00	T	00	A	01	A	01	T	00	T	00
A	01	A	01	A	01	A	01	T	00	T	00	T	00	T	00

Whereas, each triplet is built in the binary system starting from the lowest nitrogenous base since the lowest base is the element of the initial matrix F¹ of 2x2 by means of combining the binary code first according to the feature “number of hydrogen bonds” and then according to the feature “purine- pyramidine.” A 5’-end is attached to the lowest base, and a 3’-end is attached to the upper base. It was also noticed that the

2018204252 14 Jun 2018 motion of the electrons, as well as the readout, goes from bottom to top, i.e., from 5’ -P04 (-) to the end 3’ -OH(+).

Besides, the described system of elements forms a “dominant/ recessive” trait which is important for the transfer of information. On the DNA and RNA level the dominant (stable) base is such a nitrogenous base that is present in both DNA and RNA, i.e., G, C, A. The recessive (changeable) base is a nitrogenous base that is contained either only in DNA (T) or only in RNA (U). Among the four nitrogenous bases, thymine (T) is set in opposition to the three other elements by nature because in the process of transition from DNA to RNA thymine (T) gets replaced by another nitrogenous base uracil (U) - and is recessive. Therefore, considering the above feature, the presented molecular genetic system is similar to the Pennett square (1906) for polyhybrid crossing of organisms which functions in accordance with the Mendel’s laws of polyhybrid crossing, which confirms the presence of natural multi-channel noise-immune coding of hereditary information in each organism. These squares are a graphic method of defining the genotype based on the combination of male and female gametes that was suggested by English biologist R. Punnett. However, Pennett’s square feature not the eigenvalues of the matrices and their combinations but similar combinations of dominant and recessive gene alleles from the parents’ reproductive cells/ gametes. In this situation, in cases where the information gets transferred the dominant traits will be coded with more reliable (stable) nitrogenous bases.

GGGi	GGCi	GCGi	GCCi	CGGi	CGCi	CCGi	CCCi
ggaF	GGT₂	~ GCAi	GCT₂	'cgaF	CGT₂	~ ccaF	CCT₂
GAGi~	GACi	GTGs	GTCs	CAGi	CACi	CTGs	CTCs
GAAi	GAT2	GTAs	GTT₄	CAA	CAT2	CTAs	CTT₄
AGGi	AGCi	ACGi	ACCi	TGGs	TGCs	TCGs	TCCs
agaF	AGT₂	~ACAi	ACT₂	TGAs	TGTs	TCAs	TCTs
aagF	AACi	ATGs	ATCs	TAGs	TACs	TTG₇	TTC?
αααΓ	aat₂	ATAs	ATT₄	TAAs	TAT6	τταΓ	TTTe

Each triplet is marked by a certain color depending on the degree of nitrogenous base stability: red (lowest index: 1) features those triplets that have only stable bases that will not change both for DNA and RNA; orange (lowest index: 2) features those triplets in which only the upper (daughter) base will change and the two lower bases will remain

2018204252 14 Jun 2018 stable; yellow (lowest index: 3) features those triplets in which the middle base will change, and the upper and lower bases will remain stable; green (lowest index: 4) features those triplets in which only the lower base will remain stable, and the upper and middle bases will change; light blue (lowest index: 5) features those triplets in which the lower (mother) base will change, and the upper and middle bases will remain stable; dark blue (lowest index: 6) features those triplets in which the lower and upper bases will change, and the middle base will remain stable; purple (lowest index: 7) features those triplets in which the lower and middle bases will change, and the upper base will remain stable; magenta (lowest index: 8) features those triplets in which all the bases will change.

The formed Pennett square visually reproduces the Sierpinsky carpet and illustrates the fractal structure of the elements in the molecular genetic system that repeats a part of itself. In other words, the present system has the same structure as its parts. Whereas, the Sierpinsky matrix reproduces the Hadamard matrix that defines the traits of noise-immune coding. Therefore, the elementary characteristics at all levels of life define the uniqueness of each element of the molecular genetic system, whose formation principle is based on, for instance, Walsh function that is widely used in noiseimmune coding of information.

The application of fractals - i.e., matrices, each part of which represents the whole - allows us to form a molecular genetic system at various levels of life, using as elements of the matrix not only nitrogenous bases but also atoms, subatoms, amino acids and proteins that in their turn form (tissues, organs and construction material).

Information is transferred from DNA to RNA via a transport sequence of nitrogenous bases constructed based on the principle of their complementarity toward each other.

When the operation Fⁿ = F ® F^{n l}is repeated multiple times on the matrix of RNA nitrogenous bases, we will get a system of elements that is identical in structure but in which thymine (T) is replaced with uracil (U) that is also a pyramidine base with two hydrogen bonds:

2018204252 14 Jun 2018

G	11	c	10	G	11	c	10	G	11	c	10	G	11	c	10
G	11	G	11	c	10	c	10	G	11	G	11	c	10	c	10
G	11	G	11	G	11	G	11	c	10	c	10	c	10	c	10
A	01	u	00	A	01	u	00	A	01	u	00	A	01	u	00
G	11	G	11	c	10	c	10	G	11	G	11	c	10	c	10
G	11	G	11	G	11	G	11	c	10	c	10	c	10	c	10
G	11	C	10	G	11	c	10	G	11	c	10	G	11	c	10
A	01	A	01	u	00	u	00	A	01	A	01	u	00	u	00
G	11	G	11	G	11	G	11	C	10	c	10	c	10	c	10
A	01	u	00	A	01	u	00	A	01	u	00	A	01	u	00
A	01	A	01	U	00	u	00	A	01	A	01	u	00	u	00
G	11	G	11	G	11	G	11	C	10	c	10	c	10	c	10
G	11	C	10	G	11	c	10	G	11	c	10	G	11	c	10
G	11	G	11	C	10	c	10	G	11	G	11	c	10	c	10
A	01	A	01	A	01	A	01	U	00	u	00	u	00	u	00
A	01	U	00	A	01	u	00	A	01	u	00	A	01	u	00
G	11	G	11	C	10	c	10	G	11	G	11	c	10	c	10
A	01	A	01	A	01	A	01	U	00	u	00	u	00	u	00
G	11	C	10	G	11	c	10	G	11	c	10	G	11	c	10
A	01	A	01	U	00	u	00	A	01	A	01	u	00	u	00
A	01	A	01	A	01	A	01	U	00	u	00	u	00	u	00
A	01	U	00	A	01	u	00	A	01	u	00	A	01	u	00
A	01	A	01	U	00	u	00	A	01	A	01	u	00	u	00
A	01	A	01	A	01	A	01	U	00	u	00	u	00	u	00

Therefore, the binary system of DNA triplets represented in form of hexagrams will be identical to the system of RNA triplets.

Similarly, information from RNA can be transferred via a transport sequence of nitrogenous bases of tRNA to amino acids that build proteins. The triplet matrix Fr_NA for RNA (as well all of the subsequent matrices divisible by 3) allows us to create a structural system for amino acids F1_a, by assigning each triplet (codon) a certain amino acid according to the existing classification table of amino acids (https://en.wikipedia.org/wiki/Genetic_code). There are 64 codons and 20 corresponding amino acids: Ala - alanine, Arg - arginin, Asn - aspargine, Asp - aspartic acid, Cys cystein, Gin - glutamine, Glu - glutamic acid, Gly - glycine, His - histidine, lie isoleucine, Leu - leucine, Lys - lysine, Met - methionine, Phe - phenylanine, Pro proline, Ser - serine, Thr - threonine, Trp - tryptophan, Tyr - tyrosine, Vai - valine.

Gly (G)	Gly (G)	Ala (A)	Ala (A)	Arg (R)	Arg (R)	Pro (P)	Pro (P)
Gly (G)	Gly (G)	Ala (A)	Ala (A)	Arg (R)	Arg (R)	Pro (P)	Pro (P)
Glu (E)	Asp (D)	Vai (V)	Vai (V)	Gln(Q)	His (H)	Leu (L)	Leu (L)
Glu (E)	Asp (D)	Vai (V)	Vai (V)	Gln(Q)	His (H)	Leu (L)	Leu (L)
Arg (R)	Ser(S)	Thr (T)	Thr (T)	Trp (W)	Cys (C)	Ser(S)	Ser(S)
Arg (R)	Ser(S)	Thr (T)	Thr (T)	Opal	Cys (C)	Ser(S)	Ser(S)

2018204252 14 Jun 2018

Lys (K)	Asn (N)	Met (M)	lie (1)	Amber	Tyr(Y)	Leu (L)	Phe (F)
Lys (K)	Asn (N)	lie (I)	lie (1)	Ochre	Tyr ( Y)	Leu (L)	Phe (F)

It is possible to use not only the standard genetic code but also known variations of the latter containing the 21^st (selenocysteine) and 22^nd (pyrrolysine) amino acids. The sequence of codons in a gene defines the sequence of amino acids in the polypeptide chain of protein that is being coded by this gene.

Any matrix formed based on the described rules with various own multiplet values is transformed into a sequence (linear representation) by means of known methods of element coding. In contrast to the hierarchical data structure, in which each element stores indicators about its previous generation, such indicators are not used in the linear representation. Among the methods of coding, the most widespread and effective one is the method with the Lebesgue curve (Z-order curve) and the Hilbert curve. The advantage of the Hilbert curve is its continuity, which means that the neighboring elements are positioned in a consecutive order. The advantage of the Z-order curve is its simplicity and fast calculation, that is why it is applied more often. In order to code elements using a Z-order curve the Morton code is applied, which is calculated for a Zorder curve by means of shifting and mixing bits in binary of each vector.

In fig. 1 features one of the possible examples of the embodiment of the inventions using the molecular genetic system, i.e., a triplet matrix for DNA/ RNA and amino acids with binary indexing information assigned to each element of the matrix; as it is a triplet system, the indexing information consists of 64 unique hexagrams, the location of each of which is based on the characteristics of the nitrogenous bases of DNA/ RNA. For the purposes of better perception of the structure of the coding system, the 64 elements of the matrix are divided into eight octets that are numbered starting from the less stable elements of the system. The matrix is in the linear representation, in which each of the eight octets is positioned in a consecutive order one by one from the eighth to the first octet.

In the system used for coding of information, one can observe numerical representations applied in the code combination of check bits that either coincide with the fully orthogonal system of Walsh functions used in noise-immune coding for the discrete signal processing or have a constant value used to check the correctness of the system structure. Depending on the application of the methods of coding/ decoding being described, various structural matrix representations, featuring the main property of the Hadamard matrices, i.e., FF^r = nl, where F is a matrix of n elements -1 and +1, can be used. Besides, the numerical representations described above and the property of the

2018204252 14 Jun 2018

Hadamard matrices are implemented in various structural representation of the triplet system:

1) in the square matrix representation of the triplet matrix for DNA, RNA and amino acids (fig.2);

2) in the rectangular matrix representation of the triplet matrix for DNA, RNA and amino acids (fig. 3);

3) in the circle matrix representation of the triplet matrix for DNA, RNA and amino acids (fig. 4);

4) in the volume (cubical) matrix representation of the triplet matrix for DNA/ RNA (fig. 5, 5A);

5) in the tree structure matrix representation of the triplet matrix for DNA (fig. 6, 6A). Besides, in this example the numerical values of nitrogenous bases in the decimal system from 0 to 63 are formed by means of standard transformation from the binary system.

There can also be color and sound representations used for encoding images and sounds, respectively.

For instance, in the linear representation F³ the number of hydrogen bonds for each of the 64 triplets changes from 9 to 6 with values decreasing toward the first octet of the matrix (fig. 7), besides, the number of hydrogen bonds in one octet is the same for all the triplets. The summarized values of the hydrogen bonds shall be either 24 or 16 in a horizontal line for each of the octets and coincide with the orthogonal system of Walsh functions. The total summarized values of all the hydrogen bonds in horizontal lines are the same and equal 160.

In the square representation F³ the number of hydrogen bonds for each of the 64 triplets also changes from 9 to 6 with values decreasing toward the first octet of the matrix (fig. 8). Besides, the summarized values of the hydrogen bonds shall be either 24 or 16 in a horizontal line for each of the octets and coincide with the orthogonal system of Walsh functions. The total summarized values of all the hydrogen bonds in the columns are the same and equal 60. In the circular representation F³ the number of hydrogen bonds for two triplets positioned in the opposite segments of the circle is the same when added up and equals 15 (fig. 4). For instance, nine hydrogen bonds of the

2018204252 14 Jun 2018 triplets of the 8^th octet added to the six hydrogen bonds of the triplets of the opposite 1^stoctet will be 15 in total.

Each nitrogenous base has a different number of atoms, among which are the different numbers of the atoms of carbon (C), hydrogen (H), nitrogen (N) and oxygen (O):

Nitrogenous bases	Symbol of nitrogenous base	Number of atoms	Total number of atoms
C	H	N	0
Adenine	A	5	5	5	0	15
Cytosine	C	4	5	3	1	13
Guanine	G	5	5	5	1	16
Thymine	T	5	6	2	2	15
Uracil	U	4	4	2	2	12

The triplet system F³ based on the number the atoms has a pattern featured in the fig. 9. After adding up all the values of all the atoms of carbon (C), hydrogen (H), nitrogen (N) and oxygen (Ο) of each of the octets the value will be either 116 or 120 and coincide with the orthogonal system of Walsh functions. The total number of the atoms of carbon (C), hydrogen (H), nitrogen (N) and oxygen (0) in each horizontal line is the same and equals 944.

When considering the total number of the atoms of carbon (C) for DNA (fig. 10) in the linear representation of the triplet matrix, the total number of the atoms in one horizontal line is either 36 or 40 and coincides with the orthogonal system of Walsh functions. Whereas, when one considers the total number of the atoms of carbon (C) for RNA (fig. 11) in the linear representation of the triplet matrix, the total number of the atoms in each horizontal line is the same and equals 36.

When considering the total number of the atoms of oxygen (Ο) for NA (fig. 12) and RNA (fig. 13) in the linear representation of the triplet matrix, the total numbers of the atoms in each horizontal line are the same both for DNA and RNA and equal 8.

Similarly, one can follow a pattern in the constructed system in its various representations based on the following characteristics:

1. The values of the electron and proton density in the atoms of the nitrogenous bases in general and in the atoms of carbon (C), hydrogen (H), nitrogen (N) and oxygen (O) of the nitrogenous bases in particular;

2018204252 14 Jun 2018

2. The values of the neutron density in the atoms of the nitrogenous bases, including the atoms of carbon (C), hydrogen (H), nitrogen (N) and oxygen (0) of the nitrogenous bases;

3. The difference between the values of the proton and neutron densities in the atoms of the nitrogenous bases, including the atoms of carbon (C), hydrogen (H), nitrogen (N) and oxygen (0) of the nitrogenous bases;

4. The “charge” of the nitrogenous bases (“+1” is “aMino” and “-1” is “Keto”). Each element of the system has its own electric charge: aMino (the positive charge) and Keto (the negative charge). Adenine and cytosine have the positive aMino charge, and guanine, thymine and uracil have the negative Keto charge;

5. The molecular density in the atoms of the nitrogenous bases, including the atoms of carbon (C), hydrogen (H), nitrogen (N) and oxygen (0) of the nitrogenous bases;

6. The density of the quarks in the protons of the atoms of the nitrogenous bases, including the atoms of carbon (C), hydrogen (H), nitrogen (N) and oxygen (0) of the nitrogenous bases;

7. The density of the quarks in the neutrons of the atoms of the nitrogenous bases, including the atoms of carbon (C), hydrogen (H), nitrogen (N) and oxygen (0) of the nitrogenous bases;

8. The density of the quarks in the protons and neutrons of the atoms of the nitrogenous bases, including the protons and neutron of the atoms of carbon (C), hydrogen (H), nitrogen (N) and oxygen (0) of the nitrogenous bases;

9. The density of the electronic valence band in the atoms of the nitrogenous bases.

Thus, by using at least one of the above listed characteristics additionally, one can also obtain numerical representations that will coincide with the Walsh function and could be used to check code combinations. As a result of the combination of nitrogenous bases and due to their characteristics, one will obtain a structural system of unique (nonrecurrent) elements that shall characterize the whole system. The logical structure is attested with the specific features explained above.

This example features major (commonly encountered) nitrogenous bases, however, one can also use minor (rarely encountered) nitrogenous bases in practical application. Each minor base that corresponds to a certain major base has similar

2018204252 14 Jun 2018 characteristics used for the construction of a molecular genetic system. Therefore, the structure of a molecular genetic system will not be disrupted, in cases where a minor base is used. An example of the comparison between major and minor bases is featured in fig. 14. In cases where major and minor bases are used simultaneously, the information capacity of a molecular genetic system will increase.

Currently, the position of nucleotides in the codons at 5’-3’ of the DNA strand is commonly indicated with X, Y and Z, where X is the prefix, Y is the root and Z is the ending. Therefore, a triplet on this strand will look as follows:

5-X-Y-Z-3'.

A triplet on the strand 3-5' complementary to the above example will, in its turn, look as follows:

The triplet can be presented as the sum of the three vectors:

2, where

K - is the vector of the physical and biochemical “triplet” (codon) system;

X ^- is the first element of the physical and biochemical “triplet” (codon) system (nitrogenous base) presented in the form of a “prefix” unit vector;

“ is the second element of the physical and biochemical “triplet” (codon) system (nitrogenous base) presented in the form of a “root” unit vector;

Z - is the third element of the physical and biochemical “triplet” (codon) system (nitrogenous base) presented in the form of an “ending” unit vector;

m - is the ordinal number of a triplet (or its elements) on the 5’-3’ strand;

n -is the total number of triplets on the 5’-3’ strand.

Each unit element of the vector “triplet” can, in its turn, be split into its components and written in the complex form as follows:

x= x + ix , where

X ^- is the first element of the physical and biochemical “triplet” system (nitrogenous base) presented in the form of a “prefix” unit vector;

x - is the real part of a complex number which corresponds to the axis Hb (axis of hydrogen bonds);

2018204252 14 Jun 2018 i - is the imaginary unit for the “prefix” element;

ix - is the imaginary part of a complex number which corresponds to the axis PP (class of the nitrogenous base).

The set of unit vectors can be presented in a unit circle on a complex plane. Similarly, the elements of a Y-type triplet (“root”) and a Z-type triplet (“ending”) can be presented.

Therefore, one is able to write down the triplet in the m position expressed as follows:

K⁼ x+ y+ ~Z⁼ + + (y + jy) + (^z + kz) , where /, j, k- are the imaginary units with the following properties:

in the matrix form _ Xm Vm Zm I m ^— * - * r

IXm fCZm

Since the number of allocations with repetition from n for k is calculated based on the formula A£ = n^k, the three nitrogenous bases combined in a triplet form 64 different codons which is equivalent to the logarithmic expression: log404 = 3.

Example 1. The implementation in the method of coding described above for recording check bits of the logical operation “modulo addition” (ab), in which the following rule is applied for binary modulo addition: the result equals “0”, when both operands (a) and (b) are different, and in all other cases the result equals “1”:

a b ab

00

01

11

10

For the purposes of the ternary modulo 2 addition (Χ,Υ,Ζ) the following rule is used: the result equals “0”, when there are no operands that equal “1” or when their number is even, in all other cases the result equals “0”:

X Y Z ®(X,Y,Z)

0 00

01

101

110 0

0 11

2018204252 14 Jun 2018

1ο

Oilο

1111

This method of recording is based on splitting major nitrogenous bases (A, C, G, T, U) or their minor equivalents into three bits:

• “p” bit of the nitrogenous base class (purine or pyrimidine);

• “h” bit of the number of hydrogen bonds of a nitrogenous base (2 or 3);

• “e” bit of the nitrogenous base group (aMino or Keto).

Whereas, the two first data bits (p - purines or pyrimidines and h - the number of hydrogen bonds) represent variable values, and the third bit - the e check bit - represents the result of the logical operation.

The table below features the correlations of the bit values of nitrogenous bases for the implementation of the method of recording logical elements, where “p” is the class, i.e., purine or pyrimidine, “h” is the number of hydrogen bonds which can be 2 or 3, and “e” is the group, i.e., Keto or aMino.

Table

#	Nitrogenous base	Class	Number of hydrogen bonds	Group	P (a)	h (b)	(ab)
1	A	purine	2	aMino	1	0	1
2	C	pyrimidine	3	aMino	0	1	1
3	G	purine	3	Keto	1	1	0
4	T	pyrimidine	2	Keto	0	0	0
5	U	pyrimidine	2	Keto	0	0	0

As one can see from the table, one can record 3 (two data bits and one check bit) bits of information using one nitrogenous base, moreover, when this method is used to record logical elements as triplets (codons), the binary and ternary modulo 2 additions proceed simultaneously.

Nitrogenous base

3'

Z

Y

X

p h	e
(a) (b)	(ab)
pz hz	Σ z
py hy	Sy
px hx	Σχ

2018204252 14 Jun 2018

XYZ

ΣΣ

5’

Therefore, 9 bits of information (6 data bits of variable values and 3 check (result) bits) can be recorded with one nitrogenous base in the above example.

When comparing the total number of the values of the Boolean function (fig. 15) for the resultant triplet vector e (ΣχΣγΣζ), which is 8 (from 000 to 111) with the total number of values for the resultant triplet vector XYZ (ΣρΣΙιΣΣ), which is 4 (000; 011; 101; 110), we obtain the total number of bit-to-bit values in the triplet expression of nitrogenous bases, which is 16 (fig. 16).

By the virtue of the fact that 16 combinations are obtained in total for 64 triplets, there is an opportunity to pack the expression of 16 sets “tighter” using duplets instead of triplets (see tables 2a and 2b).

Table 2a.

Forming 16 duplets (by two nitrogenous based)

	G	A	C	T
G	GG	GA	GC	GT
A	AG	AA	AC	AT
C	CG	CA	CC	CT
T	TG	TA	TC	TT

Table 2b.

Forming 16 duplets (the duplets are numbered in accordance with their position in the table 2a)

	G	A	C	T
G	15	14	13	12
A	11	10	9	8
C	7	6	5	4
T	3	2	1	0

2018204252 14 Jun 2018

As a result of this operation, we make one spot available for a nitrogenous base in the triplet (Table 3) and can use it as a control value of one of the four states: G (11); C (01); A (10) and T (00).

For instance,

G (11) - is the symbol given in bold

C (01) - is the symbol given in italics (oblique font)

A (10) - is the underlined symbol

T (00) - is the standard symbol

Table 3.

Correlation between the values of the resultant vectors (16 sets of codons) and the triplet and duplet expressions

#	Triplet	Numerical code of triplet	Duplet
1	T	T	T	0	TT
2	T	c	c	1	TC
3	T	A	G	2	TA
4	T	G	A	J	TG
5	c	T	C	4	CT
6	c	C	T	5	CC
7	c	A	A	6	CA
8	c	G	G	7	CG
9	A	T	G	8	AT
10	A	C	A	9	AC
11	A	A	T	10	AA
12	A	G	C	11	AG
13	G	T	A	12	GT
14	G	C	G	13	GC
15	G	A	C	14	GA
16	G	G	T	15	GG
	-	-	>
5'	X	Y	Z	3'

Example 2 Coding the word “MHP” (Russian: “WORLD”) (without checking for errors)

Fig. 17 features the correlations between the triplet codes of DNA (RNA) and the symbols of the Latin and Russian alphabets. The record shall proceed from the 5’-end to

2018204252 14 Jun 2018 the 3’-end. In the featured example, three nitrogenous bases correspond to one logical element.

The code combination can be expressed in different variations:

1) linear:

5' - phe - phe - phe - 3'

5' - X - Z - X' - 3'

2) block (vertical):

3' 3’

Z phe

Y phe

X p h e

5' 5'

3) block (horizontal):

Z

5'- Y -3'

X phe

5'-phe-3’ phe

The word “MHP” is split into logical elements “Μ”, “Π” and “P”. Each element is associated with its unique symbol of the alphabet completed on the basis of the molecular genetic system of triplets (fig. 17) constructed in accordance with the principle described above. Each symbol is assigned an element of the system and supplemented by indexing information.

5' -ATCCTCAAG-3'

5' -101 000 011- 3’

5'- A - T - C -3'

M

5' - 011 000 011- 3’

5'- C - T - C -3'

M

2018204252 14 Jun 2018

5' - 1 0 1 10 1 110-3'

5'- A - A - G -3'

P

Then a code combination is defined based on the values of the function of the modulo 2 addition:

	C	P 0	h 1	e 1
M		T		0	0	0
		A		1	0	1
				1	1	0
				px py	PZ
			P	h	e
	C		0	1	1	hz
M	T		0	0	0	hy
	c		0	1	1	hx
			0	0	0

px py pz

		P	h	e
	G	1	1	0	hz
P	A	1	0	1	hy
	A	1	0	1	hx
		1	1	0

px py pz

Due to the fact that when symbols are recorded using nitrogenous bases, this procedure is not subject to error checking, each nitrogenous base will actually correlate with 2 bits of information. Therefore, the word “MHP” will have 9 nitrogenous bases or bits.

Example 3 Coding the word “MHP” with checking for errors. Just like in the previous example, this record shall also proceed from the 5’-end to the 3’-end. Whereas, the given example features that one logical element correlates with six nitrogenous bases (X, Y, Z, X’, Y’, Z’): 3 nitrogenous bases (X, Y, Z) are intended for recording the

2018204252 14 Jun 2018 information, and the other 3 nitrogenous bases (X’, Y’, Z’) are intended for recording a control (check) code.

In order for us to be able to verify the way in which a record is read accurately, we shall refer to the data specified in fig. 16 that features the values of the resultant vectors and their triplet interpretation made of 16 sets of codons.

Therefore, coding shall proceed based on the following algorithm:

1. The array of information to be coded is divided into logically complete chunks: the word specified in the example is divided into letters;

2. Each chunk is assigned a triplet DNA (RNA) code 5’-XYZ-3’ according to fig. 17, and the code combinations of data bits (variable values that encode a particular characteristic of the nitrogenous base) are arranged in the form of a matrix.

3. Each code combination of data bits is supplemented by check bits that are the resulting values for binary modulo 2 addition of the data bit code combinations, for the ternary modulo 2 addition of the data bit code combinations, and the sum of the sums (the check of checks).

4. The triplet 5’-X’Y’Z’-3’ is correlated to the check bits (fig. 16), and the sequence of the check bits of the binary and ternary addition and the final check (the sum of sums) is formed.

5. The final expression of the symbol is formed out of six nitrogenous bases; the expression shall consist of the primary (5’-XYZ-3’) and supplementary triplets 5’X’Y’Z’-3’):

5'— phe — phe — phe — ph — ph — ph — 3'

5' - X - Y - Z - X' - Y' - Z' - 3' character (symbol) result of check

In order to prove the noise-immunity of the suggested method that allows us to minimize the losses of information in case of errors, one shall use error detection methods. One of the most popular error detection methods is the even parity check that is done for lines and columns of the matrix formed based on the code combinations of the

2018204252 14 Jun 2018 data bits - example 4. If the number of units is even, one shall add “0”, if the number of units is odd, one shall add “1”. The final step of the even parity check is the combined check of all the lines and columns.

If the result of the all checks is “0”, the combination is correct. If there is an error, the result of the check will be “1”. The incorrect bit will be at the intersection of the line and the column in which “1” was discovered as a result of the check.

The word “MHP” was coded with the possibility of an error check in the following sequence:

5'

A T

C G A

OCT

CCTCAAGGGT

3'

5'-	1 o \| 1	0 0 0	0.1 1		1 ¹		1 0	I o 1
5'-	A	- T -	c	—	G	—	A	- I c

3'

M check data

5'-	o 1 I 1	0 0 0	0 1 1		⁰1 ¹	0 0	0 1
5'-	c	- T -	c	—	c	- T -	c

3'

3' check data

5'-	1 o \| 1		1 0 1		1 1 0		1 1		1 1	0 0
5'-	A	—	A	—	G	—	G	—	G	T

3'

P check data

The matrix of data bits and check bits, as well as the matrix for an even parity check look as follows:

	C	P 0	h 1	e 1	hz	1	0	1	h	0	1	1	0
M	T	0	0	0	hy	1	1	0	P	0	0	0	0
	A	1	0	1	hx	G	A	c		1	0	1	0
		1	1	0						1	1	0
		px	py	pz						0	0	0
		P	h	e
	C	0	1	1	hz	1	0	1	h	0	1	1	0
H	T	0	0	0	hy	0	0	0	P	0	0	0	0
	c	0	1	1	hx	c	T	c		0	1	1	0

2018204252 14 Jun 2018

0 0 px py pz p h e

	G	1	1	0	hz	1
P	A	1	0	1	hy	1
	A	1	0	1	hx	G
		1	1	0
		px	py	pz

1	0	h	1	1	0	0
1	0	P	1	0	1	0
G	T		1	0	1	0
			1	1	0
			0	0	0

The example 4 features a method of recording symbols using nitrogenous bases (with 4 states of the symbol and a check for errors based on a matrix code). One logical element correlates with six nitrogenous bases (X, Y, Z, X’, Y’, Z’) containing 15 bits of information, among which are 3 nitrogenous bases (X, Y, Z) intended for the recording of the information and containing 9 bits, 2 nitrogenous bases (X’, Y’) containing 4 bits of information and intended for the recording o the check code, and one nitrogenous base (Z’) containing 2 bits and intended for the control code.

The density of the information being recorded here is 15 bits/ 6 nitrogenous bases and is therefore 2.5 bits per a nitrogenous base.

The code information and the check algorithm shall be formed according to the same method that has been shown in the example 3. The example is special since the check values are formed using not a triplet code, but a duplet code (fig. 16). As a result, one nitrogenous base (Z’) becomes available for the control code that is used for supplementary information.

For instance, the control code that falls on the nitrogenous base (Z’) may be presented in the form of the following coding, as in the example 1:

G (11) - is the symbol given in bold

C (01) - is the symbol given in italics (oblique font)

A (10) - is the underlined symbol

T (00) - is the standard symbol in the text

p h e	- p h e -	P	h e

X	- Y -	Z

P h

X'

Y'

Z'

3'

3' result of check contr. code character

Thus, the word “MHP” can be encoded with the possibility of error checking and the possibility of being written in a special font (we used italics in the example).

2018204252 14 Jun 2018

Word: « Μ Η P »

5'-ATCGACCTCCTCAAGGGC-3'

5'

1 0 1	0 0 0	0 1 1		1 1	1 0	0 1
A	- T -	c	-	G	- A -	C

3’

M check

5'

0 1 1	0 0 0	0 1 1		0 1	0 0	0 1
c	- T -	c	-	c	- T -	c

3’

3'

H check

5'

1 0 1	1 0 1	1 1 0		1 1	1 1	0 1
A	- A -	G	-	G	- G -	c

3'

P	check
	P	h	e
	C	0	1	1	hz	1	0	1	h	0	1	1	0
M	T	0	0	0	hy	1	1	0	P	0	0	0	0
A	1	0	1	hx	G	A	c		1	0	1	0
	1	1	0						1	1	0
	px py	pz						0	0	0
	P	h	e
	C	0	1	1	hz	1	0	1	h	0	1	1	0
H	T	0	0	0	hy	0	0	0	P	0	0	0	0
	C	0	1	1	hx	c	T	c		0	1	1	0
	0	0	0						0	0	0
	px	py	pz						0	0	0
	P	h	e
	G	1	1	0	hz	1	1	0	h	1	1	0	0
P	A	1	0	1	hy	1	1	0	P	1	0	1	0
A	1	0	1	hx	G	G	T		1	0	1	0
	1	1	0						1	1	0
	px py	pz						0	0	0

The information being coded in accordance with the described method shall be recorded, among other things, by means of splitting it into short fragments of DNA, in which information about the beginning and the end of the chunk of data is recorded. Whereas, one nucleotide on one level of the hydrogen bonds and base shall code at least two bits of data.

2018204252 14 Jun 2018

When information is being recorded, blocks shall be synthesized from separate nucleotides by means of an inkjet DNA printer. In order for specific information to be recorded, the required fragments of DNA shall be synthesized; such fragments shall be preliminary multiplied and identified by means of a sequencing device, e.g. Illumina HiSeq. Thanks to the fact that each block has its own address, information can be stored in the form of a mix of short nucleotide sequences and not as a single strand of DNA. Such a method allows us to store an almost unlimited amount of data. In order to read synthesized fragments of DNA, one can use known sequencing technologies and special software intended for the translation of the genetic code back into the binary data based on the suggested molecular genetic system.

The molecular genetic system described above shall also be used for the decoding of information. For the purposes of decoding, machine-readable information shall be divided into equal parts consisting of an i+k+m number of zeros and ones that would be the indices of the logically complete chunks of the information to be decoded; these parts shall consist of i data bits, k check bits and m control bits. Each part shall then be assigned a multiplet consisting of the n number of nitrogenous bases and being an element of the molecular genetic system, thus defining the sequence of logically complete chunks of the information being decoded.

Therefore, when implementing the suggested method of coding and decoding information using the molecular genetic system in the form of square matrices, structured according to the characteristics of the nitrogenous bases of nucleotides, one can reach the claimed technical result, which includes handling a vase volume of information without it being distorted or changed and ensuring that a digital signal is coded and decoded in such a way that it remains immune to the noise when being transmitted via software tools.

Claims

What is claimed:

METHODS OF CODING AND DECODING INFORMATION

1. Method of coding information according to which

- the array of the information to be coded shall be divided into logically complete chunks each of which is assigned at least one element of the molecular genetic system used for the coding;

- the set of the elements in the system is supplemented by relevant indexing information consisting of i data bits;

- each code combination of data bits shall be supplemented by a combination of k check bits defined based on the combination of the i data bits;

the claimed invention is different from the solutions known in the prior art in that

- each i+k bit data in binary form shall be recorded as a multiplet consisting of the n number of nitrogenous bases or corresponding amino acids; the multiplet shall be an element of the molecular genetic system,

- consisting of a base representing a four-nitrogenous-base matrix

G C ‘

A T/U_ the nitrogenous bases are grouped according to the number of hydrogen bonds and the number of condensed rings contained in the molecular structures of nitrogenous bases;

- the matrix is formed in advance by raising tensors of each previous-generation matrix to the second power multiple times in order to obtain a next-generation matrix;

- the system binary indices that form the i data bits correspond to the selected characteristic of the nitrogenous bases;

- the value n is recorded together with the information to be coded since the size of the matrix to be used for the coding will depend on the n value; the information about the selected method of transforming the matrix into a sequence and the reading order are recorded as well.
2. The method of coding specified in the item 1 that is different from the solutions known in the prior art in that the information to be coded shall be arranged as a machine-readable sequence in binary form, including binary indices for each multiplet.

2018204252 14 Jun 2018
3. The method of coding specified in the item 1 that is different from the solutions known in the prior art in that the information to be coded shall be arranged as a sequence of nitrogenous bases.
4. The method of coding specified in the item 1 that is different from the solutions known in the prior art in that each code combination of i data bits is supplemented by a combination of m control bits based on the combination of the i and k bits;
5. The method of coding specified in the item 1 that is different from the solutions known in the prior art in that based on the position of each multiplet in the matrix of the molecular genetic system, one can assess the recessive or dominant trait of the corresponding logically complete chunk of information.
6. The method of coding specified in the item 1 that is different from the solutions known in the prior art in that the sequence of nitrogenous bases shall be divided into logically complete chunks, in which the information about the beginning and the end of the chunk of information is coded.
7. The method of coding specified in the item 3 that is different from the solutions known in the prior art in that the information to be coded shall be arranged as a sequence of nitrogenous bases to be coded on the DNA level.
8. The method of coding specified in the item 3 that is different from the solutions known in the prior art in that the information to be coded shall be arranged as a sequence of nitrogenous bases to be coded on the RNA level.
9. The method of coding specified in the item 3 that is different from the solutions known in the prior art in that the information to be coded shall be arranged as a sequence of amino acids.
10. The method of coding specified in the item 3 that is different from the solutions known in the prior art in that the information to be coded shall be arranged as a sequence of nitrogenous bases to be coded on protein level.
11. The method of coding specified in the item 1 that is different from the solutions known in the prior art in that the molecular genetic system has a linear matrix representation.

2018204252 14 Jun 2018
12. The method of coding specified in the item 1 that is different from the solutions known in the prior art in that the molecular genetic system has a square matrix representation.
13. The method of coding specified in the item 1 that is different from the solutions known in the prior art in that the molecular genetic system has a rectangular matrix representation.
14. The method of coding specified in the item 1 that is different from the solutions known in the prior art in that the molecular genetic system has a circular matrix representation.
15. The method of coding specified in the item 1 that is different from the solutions known in the prior art in that the molecular genetic system has a volume matrix representation.
16. The method of coding specified in the item 1 that is different from the solutions known in the prior art in that the molecular genetic system has a tree structure matrix representation.
17. The method of coding specified in the item 1 that is different from the solutions known in the prior art in that the binary indices of the molecular genetic system correspond to the selected characteristic of the nitrogenous bases for DNA.
18. The method of coding specified in the item 1 that is different from the solutions known in the prior art in that the binary indices of the molecular genetic system correspond to the selected characteristic of the nitrogenous bases for RNA.
19. The method of coding specified in the item 1 and item 18 that is different from the solutions known in the prior art in that the binary indices of the molecular genetic system correspond to the selected characteristic of the nitrogenous bases for amino acids.
20. The method of coding specified in the item 1 that is different from the solutions known in the prior art in that the binary indices of the molecular genetic system correspond to the selected characteristic of the nitrogenous bases forming the recessive and dominant traits on the genetic level.
21. The method of coding specified in the item 1 that is different from the solutions known in the prior art in that the synthetic DNA formed from the obtained sequence of nitrogenous bases and containing the information being coded is placed into a storage medium.

2018204252 14 Jun 2018
22. The method of coding specified in the item 1 that is different from the solutions known in the prior art in that the synthetic DNA formed from the obtained sequence of nitrogenous bases and containing the information being coded is placed into a medium for logical and mathematical calculations.
23. Method of decoding information according to which

- the machine-readable sequence is divided into logically complete chunks of information to be decoded; the chunks shall include combinations of i data bits and k check bits; each logically complete chunk of information shall be assigned at least one multiplet consisting of the n number of nitrogenous bases or corresponding amino acids and being an element of the molecular genetic system used for the decoding, the claimed invention is different from the solutions known in the prior art in that

- the molecular genetic system consists of a base representing a four-nitrogenous-base matrix

G C

A T/U, ; the nitrogenous bases are grouped according to the number of hydrogen bonds and the number of condensed rings contained in the molecular structures of nitrogenous bases;

- the matrix is formed in advance by raising tensors of each previous-generation matrix to the second power multiple times in order to obtain a next-generation matrix;

- the system binary indices that form the i data bits correspond to the selected characteristic of the nitrogenous bases, and the indices that form the k check bits shall be defined based on the combination of the i data bits.
24. The method of decoding specified in the item 23 that is different from the solutions known in the prior art in that the discrete parts of the machine-readable sequence consisting of logically complete chunks of information to be decoded include combinations of m control bits and that indices that form the m control bits are defined based on the combinations of i data bits.
25. The method of decoding specified in the item 23 that is different from the solutions known in the prior art in that based on the position of each multiplet in the matrix of the molecular genetic system, one can assess the recessive or dominant trait of the corresponding logically complete chunk of information.

2018204252 14 Jun 2018
26. The method of decoding specified in the item 23 that is different from the solutions known in the prior art in that the molecular genetic system has a linear matrix representation.
27. The method of decoding specified in the item 23 that is different from the solutions known in the prior art in that the molecular genetic system has a square matrix representation.
28. The method of decoding specified in the item 23 that is different from the solutions known in the prior art in that the molecular genetic system has a rectangular matrix representation.
29. The method of decoding specified in the item 23 that is different from the solutions known in the prior art in that the molecular genetic system has a circular matrix representation.
30. The method of decoding specified in the item 23 that is different from the solutions known in the prior art in that the molecular genetic system has a volume matrix representation.
31. The method of decoding specified in the item 23 that is different from the solutions known in the prior art in that the molecular genetic system has a tree structure matrix representation.
32. The method of decoding specified in the item 23 that is different from the solutions known in the prior art in that the binary indices of the molecular genetic system correspond to the selected characteristic of the nitrogenous bases for DNA.
33. The method of decoding specified in the item 23 that is different from the solutions known in the prior art in that the binary indices of the molecular genetic system correspond to the selected characteristic of the nitrogenous bases for RNA.
34. The method of decoding specified in the item 23 and item 33 that is different from the solutions known in the prior art in that the binary indices of the molecular genetic system correspond to the selected characteristic of the nitrogenous bases for amino acids.
35. The method of decoding specified in the item 23 that is different from the solutions known in the prior art in that the binary indices of the molecular genetic system correspond to the selected characteristic of the nitrogenous bases forming the recessive and dominant traits on the genetic level.