WO2020179962A1

WO2020179962A1 - Dna coding method and biomedical engineering application of same coding method

Info

Publication number: WO2020179962A1
Application number: PCT/KR2019/003570
Authority: WO
Inventors: 손인식; 김현주
Original assignee: 손인식
Priority date: 2019-03-05
Filing date: 2019-03-27
Publication date: 2020-09-10
Also published as: CN113614834B; JP7275301B2; JP2022525042A; CN113614834A; US20220139500A1; EP3937177A4; KR20200106761A; EP3937177A1; KR102252977B1

Abstract

The present invention relates to a method for code standardization of DNA, the method comprising: (a) assigning codes 00, 01, 10, and 11 to the four bases C, T, A, and G, respectively; and (b) coding base pairs between G and C and between A and T by providing 1100 for G and C, 0011 for C and G, 1001 for A and T, and 0110 for T and A in the 5' to 3' direction. The method for code standardization of DNA according to the present invention provides a method that makes it easy to detect specific patterns present in base sequences such as DNA fragments or aptamers, for example, a method that makes it easy to detect specific patterns and secondary structures in base sequences, mutations of base sequences, etc., and to predict a disease by using disease-specific sequence variations such as SNP, etc.

Description

DNA coding method and biomedical application of the coding method

The present invention relates to a DNA code standardization method and an optimized biomedical engineering application of the method.

DNA (DeoxyriboNucleic Acid), which exists as a genetic material in living organisms, consists of a gene site expressed as a protein and a nongenic site. The chemical structure of DNA is that the phosphate group is linked to the 5'carbon of the pentose, which is deoxyribose, and the base is linked to the 1'carbon to form a unit called Nucleotide. Is determined.

There are two types of bases, a purine base with two ring structures and a pyrimidine series with one ring structure. In the purine series, there are guanine (G) and adenine (A), and the pyrimidine series are cytosine (C) and thymine (T). In the case of RNA, the -OH group is connected to the 2'carbon of the pentose and the composition of the base is There is a difference that uracil (U) is substituted for thymine. Purine-series G forms a complementary pair with pyrimidine C through a hydrogen bond, and A forms a pair with T. At this time, since the complementary bonds of G and C are connected by three hydrogen bonds, a stronger bond is formed than the bonds of A and T that form two hydrogen bonds.

In the nucleotide unit of DNA, a 5'carbon-linked phosphate group is linked to another unit's 3'carbon-OH group by a phosphate diester bond to form a single strand. Two complementary single strands connected by a phosphoric acid diester bond form a double helix structure by hydrogen bonding of a complementary base. This double helix was introduced in 1953 by Watson and Crick. [Watson, JD, & Crick, FH (1953). Molecular structure of nucleic acids. Nature , 171 (4356), 737-738.]

The nucleotide sequence of the gene site in DNA plays an important role in the synthesis of the protein as the three nucleotide codes are translated and linked into one amino acid constituting the protein. DNA is transcribed into mRNA and then translated into 20 amino acids according to the sequence of nucleotide sequences.When the translated amino acids are linked by tRNA, proteins are formed and exist as constituents in cells, and are enzymes that mediate various reactions in vivo. It also works.

Human DNA has 3 billion base pairs (bp) and has a data capacity of GB per person. When this capacity is converted into the number of people, it is insufficient even in PB units. Therefore, rather than analyzing all human DNA sequences, disease-specific SNP (Single Nucleotide Polymorphism, nucleotide polymorphism) sites, etc., are analyzed to predict diseases based on the sequence of short DNA fragments. It is not a reality, and it is necessary to develop various programs to analyze this.

[Prior patent literature]

Republic of Korea Patent Publication 10-2016-0001455

The present invention solves the above problems, and is conceived by the necessity of the above. An object of the present invention is to standardize a DNA base into a binary code (2 bits per base) in which the molecular weight of each base is considered. It provides a method that is optimized for identifying specific patterns.

Another object of the present invention is to provide an easy method for identifying whether or not complementary binding and pattern using the code sum of nucleotide sequences, and to provide an easy method for predicting the pattern and function of DNA fragments or DNA aptamers.

Another object of the present invention is to provide an easy method for determining the molecular weight ratio between sequences and the ratio of each base only by the code of the base sequence.

Another object of the present invention is to provide an easy method for identifying variations in nucleotide sequences and to provide an easy method for predicting diseases by using disease-specific sequence variations such as SNPs.

In order to achieve the above object, the present invention provides a method for standardizing the DNA code, including the following steps: (a) C, T, A, and G are designated as 00, 01, 10, 11, respectively. And (b) when each base is a base pair of G and C and A and T, in the direction of 5'to 3', respectively, 1100 for G and C, 0011 for C and G, and 0011 for A and T, respectively. In the case of the case, it is designated as 1001, and in the case of T and A, it is designated as 0110.

In addition, the present invention provides a method of providing information optimized to identify a specific pattern or secondary structure of a specific DNA fragment or aptamer using standardization of DNA codes including the following steps: (a) C of a specific DNA fragment sequence, Naming T, A, and G as 00, 01, 10, 11, respectively; And (b) comparing the arrangement of codes named by the numerical values with the arrangement of each code sum.

In one embodiment of the present invention, the step of comparing the arrangement of the codes and the arrangement of the sum of the codes comprises converting the binary number arrangement of 00, 01, 10, and 11 in the step (a) to decimal, and then each sequence It is judged that a stem structure can be formed when the sequence of codes whose sum is 3 is arranged at both ends of two or more pairs, and the sum of the codes of the sequences facing each other is greater than or less than 3, so that complementary bonding cannot be achieved. A method of providing information optimized to identify specific patterns or secondary structures of specific DNA fragments or aptamers using DNA code standardization, characterized in that it is determined to form a loop structure when three or more sequences are connected to the center, is preferable. However, it is not limited thereto.

In addition, the present invention provides a method of providing information on the presence or absence of a nucleotide sequence variation of a specific DNA fragment using the DNA code standardization comprising the following steps: (a) C, T, A, and G of the nucleotide sequence of a specific

DNA fragment Naming

00, 01, 10 and 11 respectively; And (b) comparing the sum of codes named by the numerical values.

In one embodiment of the present invention, the step of comparing the sum of the codes comprises converting the number sequence of the binary numbers of 00, 01, 10, and 11 in the step (a) to decimal, and calculating the sum, It is preferable to determine that the mutation exists when there is a difference of 1 to 3 compared to, but is not limited thereto.

In another embodiment of the present invention, the method comprises comparing the values of the codes obtained by naming C, T, A, and G of the nucleotide sequence of a specific DNA fragment as 00, 01, 10, 11, respectively. It is desirable to be able to check the location, but is not limited thereto.

In addition, the present invention is a computer program for providing information optimized for identifying a specific pattern or secondary structure of a specific DNA fragment or aptamer, which is stored in a computer-readable medium and allows a computer to perform the following steps, They are: (a) naming C, T, A, and G of the base sequence of a specific DNA fragment as 00, 01, 10, 11, respectively; And (b) if the sequence of codes in which the sum of each sequence is 3 is arranged at both ends of two or more pairs after converting the binary number sequence of 00, 01, 10, and 11 in step (a) to decimal. Determining that a stem structure can be formed, and determining that a loop structure is formed when three or more sequences that cannot achieve complementary binding are connected to the center of the code sum of the sequences facing each other is greater than or less than 3 Including, it provides a computer program stored in a computer-readable medium.

In addition, the present invention is a computer program for providing information on the presence or absence of a nucleotide sequence mutation of a specific DNA fragment, stored in a computer-readable medium, for causing a computer to perform the following steps, the steps: (a) specific Naming C, T, A, and G of the nucleotide sequence of the DNA fragment as 00, 01, 10, 11, respectively; And (b) converting the number sequence of binary numbers in step (a) into decimal numbers, calculating the sum, and comparing it with the normal sequence to determine that a mutation exists when there is a difference of 1 to 3 A computer program stored on a computer-readable medium is provided, including determining that it exists.

In addition, the present invention is stored in a computer-readable medium, as a computer program for providing information on the position of the nucleotide sequence mutation sequence of a specific DNA fragment for causing a computer to perform the following steps, the steps: (a ) Naming C, T, A, and G of the nucleotide sequence of a specific DNA fragment as 00, 01, 10, 11, respectively; And (b) comparing the values of the codes obtained by naming C, T, A, and G of the nucleotide sequence of the specific DNA fragment in step (a) as 00, 01, 10, 11, respectively, to determine the position of the mutant sequence. A computer program stored on a computer-readable medium is provided, comprising the step of verifying.

The present invention will be described below.

In the present invention, each of the four bases of C, T, A, and G in the order of the smallest molecular weight of DNA is named by codes of 00, 01, 10, 11, respectively, and each base is a base pair of G and C and A and T It provides a method of naming the code so that the sum of the molecular weights coincides with the ratio of the code sum when each is achieved.

In addition, the present invention constructs a system capable of predicting by using SELEX to identify a specific pattern that binds to a reactive group present in each compound by standardizing the aptamer specific to each compound as a code and utilizing it as big data.

In addition, the present invention provides a method of standardizing the sequence of DNA into a code, converting the value of each sequence to a decimal number, and deriving the sum thereof to check the presence or absence of mutations in each sequence and quickly determine the presence of SNPs in a specific disease. .

The present invention provides an easy method for identifying a specific pattern existing in a nucleotide sequence by standardizing DNA into a code.

In the present invention, a DNA sequence pattern that binds to a specific target and chemical structure is identified and used as big data to predict an aptamer that binds to a corresponding chemical structural unit, and a SELEX (Systematic evolution of ligands by exponential enrichment) simulation program Provide the necessary information for anger.

In addition, the present invention provides a method optimized for determining the molecular weight ratio between sequences and the ratio of each base by standardizing DNA into a code reflecting the base molecular weight.

In addition, the present invention provides an easy method for identifying variations within a nucleotide sequence by standardizing DNA into a code reflecting the base molecular weight, and providing an optimized method for comparing the sum and sequence of codes, thereby enabling identification of disease-specific mutations such as SNPs. It provides an easy way to predict disease.

As can be seen from the present invention, the DNA code standardization method of the present invention provides an easy method for identifying variations in nucleotide sequences and facilitates prediction of diseases by using disease-specific sequence variations such as SNPs. It provides an easy method for identifying a specific pattern present in a sequence.

1 shows that the code values designated by reflecting the principle of the molecular structure and binding mass ratio of DNA are designated as binary numbers of 00, 01, 10, and 11 values for C, T, A, and G in the order of the lowest molecular weight base. Drawing,

FIG. 2 is a diagram showing that when a designated binary code is paired with the bases of G and C, and A and T, respectively, the ratio of the sum of the codes is 1:1 and is designed to have the same ratio as the actual mass ratio.

Figure 3 shows the code conversion values of six sequences, a picture showing the comparison of the code sum of each sequence and the molecular weight of each sequence,

Figure 4 is a check of the pattern of exemplary sequences using the code of the DNA sequence, confirming whether complementary binding is possible according to the code sum of each sequence, and forming a stem-loop structure according to the number of bonds and the number of linked bases. The picture that confirmed the pattern, and

5 shows the code standardization efficiency of the present invention by applying the code to the SNP sequence identified in breast cancer patients. The SNP sequence in which the A base at the 14th from Exon 2 is mutated to G is converted into a code, and the number of binary numbers is arranged. Figure that compares the code sum of the normal sequence and the mutant sequence by calculating the code sum after arranging it with.

Hereinafter, the present invention will be described in detail through non-limiting examples. However, the following examples are described with the intention of illustrating the present invention, and the scope of the present invention is not to be construed as being limited by the following examples.

Example 1: Code standardization according to the molecular weight of each base

Each of the four bases determining the sequence of the DNA is expressed in a two-digit binary code, which is a computer language, and the molecular weight of each base is analyzed and indicated in FIG. Each base G, A, T, C and deoxyribonucleotide linked to one phosphate group were denoted as dGMP, dAMP, dTMP, and dCMP, respectively.

Each base has a large value in the order of G, A, T, and C, and as a result of comparing the molecular weights of C that are paired with G by hydrogen bonds and T that are complementary to A and are compared, 654.4 (=347.2+307.2) and 653.4 It was confirmed that they were paired with each other with an equivalent molecular mass of approximately 1:1 as (=331.2+322.2). When the sum of the molecular weights of A and T is 1 less than the sum of the molecular weights of G and C, G≡ is nitrogen (N), A=T is carbon (C), hydrogen (H) is 1 each compared to other bond pairs. There is more, because there is a difference (=1) of the sum of molecular weights of each pair as much as the difference of the sum of molecular weights of N and C+H (14>12+1). Therefore, A and T have two hydrogen bonds in the absence of O or N capable of hydrogen bonding and form a weaker bond than the G≡ bond, which forms three hydrogen bonds.

Therefore, the code of each base was designated by reflecting the principle of the molecular structure and binding mass ratio of the DNA. In the code of each given base, C, T, A, and G were designated as binary numbers of 00, 01, 10, 11 values in the order of the smallest molecular weight base. (Figure 1)

The value of the designated code is designed so that when the bases of G and C, and A and T are paired, the sum of the codes is 1:1, which is the same as the actual mass ratio. (Figure 2)

The code sum represents the sum of each code value after converting the code of each base to a decimal number. The code sum of each of G and C, A and T is equal to '3'.

Example 2: DNA fragment and Of aptamer Optimization of reflecting molecular weight ratio

Since codes were assigned from lowest to highest in mass according to the molecular weight of each base of DNA, the total code sum of the DNA fragment was calculated by reflecting the ratio of the molecular weights of each sequence. (Fig. 3) The ratio of the molecular weight reflection of the code was checked, and the code sum and molecular weight were compared with six exemplary sequences.

The exemplary sequence is a sequence exemplified with the intention of confirming the ratio of molecular weight reflection of the code, and the range is not interpreted as being limited to the sequences of SEQ ID NOs: 1 to 6.

The sequences of SEQ ID NOs: 1 to 6 are as follows.

5'AGAGCTCGCGCCGGAGTTCTCAATGCAAGAGC 3'(SEQ ID NO: 1)

5'GCGGCGGTGGCCTGAAGTCTGGCGGTGGCCCC 3'(SEQ ID NO: 2)

5'GCGGCGGTGGCCAGAAGTCTCGCGGTGGCGGC 3'(SEQ ID NO: 3)

5'GTGGAGGCGGTGGCCAGTCTCGCGGTGGCGGC 3'(SEQ ID NO: 4)

5'GTGGCGGTGGCCAGCATAGTGGCGGTGGCCAG 3'(SEQ ID NO: 5)

5'GTGGAGGCGGTGGCCGTGGAGGCGGAGGCCGC 3'(SEQ ID NO: 6)

The six exemplary sequences are 32 mer nucleotide sequences, and the lengths of the bases are the same, but the types and sequences of bases are various, and the code conversion values of each base are shown in FIG. 3. The code sum was calculated by converting the code of each base into a decimal number and then calculating the total sum. The code sum was also calculated by reflecting the molecular weight of each sequence according to the base composition of each sequence.

When compared with the molecular weight (Mw) of each sequence, the smaller the molecular weight, the smaller the value of the code sum. In the case of the sequence with a higher molecular weight, the code sum was calculated as a larger value. (Fig. 3)

In this way, the code was designated by reflecting the ratio of the molecular weight and was optimized to compare the ratio of the molecular weight of each sequence by using the resultant code sum.

Example 3: DNA fragment and Aptamer Optimization of pattern checking

The sequence of the DNA fragment and the aptamer was converted into a binary base code and optimized to identify specific patterns and secondary structures contained in the sequence by comparing each sequence. To understand this, a DNA sequence consisting of 9 base sequences was used as an exemplary sequence. (Fig. 4)

The above exemplary sequence is described with the intention of illustrating the pattern of the code, and the range is not to be construed as being limited to the exemplary sequence of SEQ ID NO: 7.

An exemplary sequence of SEQ ID NO: 7 is as follows.

5'GCGGTGGCG 3'(SEQ ID NO: 7)

The number listed by converting the example sequence to a base code is as follows.

11 00 11 11 01 11 11 00 11 (example sequence code 1)

Each base is designed to have a code sum of '3' with a complementary base capable of forming hydrogen bonds, and the arrangement of these sequences can form a stem structure in the DNA aptamer sequence. (Fig. 4; Stem)

The pattern of the stem-loop structure of DNA is mostly composed of two or more bases that can form a stem structure at both ends, and the sum of the codes of the sequences facing each other is greater or less than 3 to form a complementary bond. There is a characteristic that a loop structure can be formed when three or more sequences that cannot be connected to the center.

The exemplary sequence can form two stem-loop structures, which can be simply confirmed by nucleotide code arrangement. The sequence capable of forming a complementary bond with the first 11 nucleotide code is the base of the eighth 00 code excluding the 00 code next to it (Fig. 4; ① red arrow), and the base capable of complementary bonding with the second 00 code is 6 There are the 11th code (Fig. 4; ③ green arrow) and the 7th 11th and 9th 11th codes. In the same way, the base of the 3rd 11th code is complementary to the 8th 00 (Fig. 4; ② blue arrow) code. At this time, since the stem portion of the stem-loop structure forms a structure when two or more bases are connected, the complementary bonds of the bases connected to the red arrow in FIG. 3 or the complementary bonds of the bases connected to the blue arrow in FIG. ; Dotted round circle), and the complementary bond of the green arrow cannot form a stem structure with a single complementary bond. In both cases that can form the stem structure, it is predicted that the stem-loop structure can be formed because four bases that can form a loop structure exist in the middle.

By standardizing each base into a code in this way, it is possible to predict whether or not complementary bonding with each base is possible according to the sum of the base codes, and the secondary structure and pattern of the DNA sequence according to the number of complementary bonds in each sequence and the number of bases linked thereto. It was confirmed that it was easy to predict.

Example 4: Optimization of SNP identification due to code standardization

By converting the DNA sequence into a code and comparing the code sum of each sequence, it was optimized to determine whether the nucleotide sequence of a specific DNA fragment was changed. Since the SNP sequence is a DNA fragment sequence in which one base is mutated, it was confirmed that it was easy to identify the presence and location of the mutation by applying the code to the SNP sequence and comparing it with the normal sequence. It is one of various SNP sequences and was applied to the SNP sequence of the CD44 gene, which is identified in 84% of breast cancer patients, to confirm the efficiency of code standardization. [Zhou, J., Nagarkatti, PS, Zhong, Y., Creek, K., Zhang, J., & Nagarkatti, M. (2010). Unique SNP in CD44 intron 1 and its role in breast cancer development. Anticancer research , 30 (4), 1263-1272.]

The SNP sequence of the breast cancer patient is that the A base at the 14th from the exon 2 among the sequences present at the position of the first intron 1 of the gene has been mutated to G, and this sequence is converted into a code to be binary. After arranging in an arrangement of, the code sum was calculated, and the code sum of the normal sequence and the mutant sequence was compared. (Fig. 5)

When the codes of the normal sequence and the mutant sequence were respectively transformed into decimal numbers and then summed, the normal sequence was 39, the mutant sequence was 40, and the mutant sequence was identified as a value of 1 greater than the normal sequence. As such, it is possible to determine whether a mutation exists in a DNA fragment only by the code sum, and at this time, the code sum may vary by 1 to 3 depending on the type of the mutated base. In addition, it is possible to confirm the position of the sequence by comparing the respective values of the mutated codes.

By converting the DNA fragment sequences identified in the normal control group and the specific mutant sequence identified in the disease test group into a code and comparing the code sum, the difference between sequences can be quickly checked and the presence of SNPs can be easily searched. By applying the code sum to the sequence, it can be used for disease diagnosis.

Claims

A method for standardizing the code of DNA comprising the following steps:

(a) Four bases of C, T, A, and G are named 00, 01, 10, 11, respectively,

(b) When each base is a base pair of G and C and A and T, it is 1100 for G and C, 0011 for C and G, and 0011 for A and T in the direction of 5'to 3', respectively. A method of standardizing DNA, which is designated as 1001 and 0110 for T and A, as a code.
A method of providing information optimized to identify a specific pattern or secondary structure of a specific DNA fragment or aptamer using standardization of the DNA code comprising the following steps:

(a) naming C, T, A, and G of a specific DNA fragment sequence as 00, 01, 10, 11, respectively; And

(b) comparing an array of codes named by the numerical values and an array of sums of codes.
The method of claim 2, wherein the step of comparing the sequence of the codes with the sequence of the sum of the codes comprises transforming the binary number sequence of 00, 01, 10, and 11 of the step (a) into a decimal number, and then the sum of each sequence is It is judged that a stem structure can be formed when two or more pairs of 3 codes are arranged at both ends, and a sequence that cannot achieve complementary binding is 3 if the sum of the codes of the sequences facing each other is greater or less than 3 A method of providing information optimized for identifying a specific pattern or secondary structure of a specific DNA fragment or aptamer using DNA code standardization, characterized in that it is determined that a loop structure is formed when connected to more than two centers.
A method of providing information on the presence or absence of a nucleotide sequence variation of a specific DNA fragment using standardization of DNA codes comprising the following steps:

(a) naming C, T, A, and G of a specific DNA fragment sequence as 00, 01, 10, 11, respectively; And

(b) comparing the sum of codes named by the numerical values.
The method of claim 4, wherein the step of comparing the sum of the codes comprises converting the binary number sequence of 00, 01, 10, and 11 of the step (a) to decimal, and then calculating the sum and comparing it with a normal sequence. A method of providing information on whether or not there is a nucleotide sequence mutation of a specific DNA fragment, characterized in that it is determined that a mutation exists when there is a difference of 1 to 3.
The method according to claim 4, wherein the position of the mutant sequence is confirmed by comparing the values of the codes obtained by naming C, T, A, and G of the base sequence of a specific DNA fragment as 00, 01, 10, 11, respectively. A method of providing information on the presence or absence of a nucleotide sequence variation of a specific DNA fragment, characterized in that it can be.
A computer program for providing information that is stored in a computer-readable medium and optimized to identify a specific pattern or secondary structure of a specific DNA fragment or aptamer for causing a computer to perform the following steps, the steps:

(a) naming C, T, A, and G of the nucleotide sequence of a specific DNA fragment as 00, 01, 10, 11, respectively; And

(b) In the case where the sequence of codes in which the sum of each sequence is 3 is arranged at both ends of two or more pairs after converting the binary number sequence of 00, 01, 10, and 11 in step (a) to decimal It is determined that the stem structure can be formed, and the step of determining that a loop structure is formed when three or more sequences that cannot achieve complementary binding are connected to the center of the code sum of the sequences facing each other is greater than or less than 3 A computer program stored on a computer-readable medium containing.
A computer program for providing information on the presence or absence of a nucleotide sequence variation of a specific DNA fragment stored in a computer-readable medium to cause a computer to perform the following steps, the steps:

(a) naming C, T, A, and G of the nucleotide sequence of a specific DNA fragment as 00, 01, 10, 11, respectively; And

(b) After converting the number sequence of the binary numbers in step (a) to decimal, and after calculating the sum, compared with the normal sequence, it is determined that a mutation exists in the case of a difference of 1 to 3 A computer program stored on a computer-readable medium comprising the step of determining that it is.
A computer program for providing information on the position of a sequence variation sequence of a specific DNA fragment, stored in a computer-readable medium, for causing a computer to perform the following steps, the steps:

(a) naming C, T, A, and G of the nucleotide sequence of a specific DNA fragment as 00, 01, 10, 11, respectively; And

(b) Identify the position of the mutant sequence by comparing the values of the codes obtained by naming C, T, A, and G of the nucleotide sequence of the specific DNA fragment in step (a) as 00, 01, 10, 11, respectively. A computer program stored on a computer-readable medium comprising the step of: