KR20210022606A

KR20210022606A - A method coding standardization of dna and a biotechnological use of the method

Info

Publication number: KR20210022606A
Application number: KR1020210023395A
Authority: KR
Inventors: 손인식; 김현주
Original assignee: 손인식
Priority date: 2019-03-05
Filing date: 2021-02-22
Publication date: 2021-03-03
Also published as: KR102280758B1

Abstract

The present invention relates to a method of code standardizing DNA, which (a) names 00, 01, 10, and 11 for the four bases C, T, A, and G, respectively, and (b) when each base is paired with G and C and A and T, in the 5′ to 3′ direction, names 1100 for G and C, 0011 for C and G, 1001 for A and T, and 0110 for T and A. The DNA code standardization method of the present invention provides an easy method for identifying specific patterns and secondary structures in the nucleotide sequence, nucleotide sequence variation, or the like. By using disease-specific sequence mutations such as SNPs or the like, the provided method is easy to identify a nucleotide sequence such as a DNA fragment, an aptamer, or the like, by facilitating the prediction of the disease.

Description

DNA coding method and its application to biomedical engineering {A METHOD CODING STANDARDIZATION OF DNA AND A BIOTECHNOLOGICAL USE OF THE METHOD}

본 발명은 DNA의 코드 표준화 방법 및 그 방법의 최적화된 의생명공학적 응용에 관한 것이다.The present invention relates to a method for standardizing DNA codes and to an optimized biomedical engineering application of the method.

생명체에서 유전물질로 존재하는 DNA(DeoxyriboNucleic Acid)는 단백질로 발현되는 유전자 부위와 비유전자 부위로 구성되어 있다. DNA의 화학 구조는 Deoxyribose인 오탄당의 5'탄소에 인산기와 1'탄소에 염기(base)가 연결되어 뉴클레오티드(Nucloeotide)라는 단위체를 형성하는데 이 때, 뉴클레오티드에 연결된 염기의 종류에 따라 DNA의 서열이 결정된다. DNA (DeoxyriboNucleic Acid), which exists as a genetic material in living organisms, is composed of a gene site expressed as a protein and a nongenic site. The chemical structure of DNA is that the phosphate group is linked to the 5'carbon of the pentose, which is deoxyribose, and the base is linked to the 1'carbon to form a unit called nucleotide. Is determined.

염기의 종류는 2가지 계열로 구분되며 고리 구조가 2개인 퓨린 계열의 염기와 고리구조가 1개인 피리미딘 계열이 있다. 퓨린 계열은 다시 구아닌(G)과 아데닌(A), 피리미딘 계열은 시토신(C)과 티민(T)등이 있으며 RNA의 경우 오탄당의 2'탄소에 -OH기가 연결되어 있는 것과 염기의 구성이 티민 대신 우라실(U)로 치환되어 있는 차이가 있다. 퓨린계열의 G는 피리미딘인 C와 수소결합으로 상보적인 쌍을 이루며 A는 T와 쌍을 이룬다. 이 때, G와 C의 상보 결합은 3개의 수소결합으로 연결되어 있기 때문에 2개의 수소결합을 이루는 A와 T의 결합보다 더 강한 결합을 이루고 있다.There are two types of bases, a purine base with two ring structures and a pyrimidine series with one ring structure. In the purine series, there are guanine (G) and adenine (A), and the pyrimidine series are cytosine (C) and thymine (T). There is a difference in that uracil (U) is substituted for thymine. Purine-series G forms a complementary pair with pyrimidine C through a hydrogen bond, and A forms a pair with T. At this time, since the complementary bonds of G and C are connected by three hydrogen bonds, a stronger bond is formed than the bonds of A and T that form two hydrogen bonds.

DNA의 뉴클레오티드 단위체는 5'탄소에 연결된 인산기가 또 다른 단위체의 3'탄소 -OH기와 인산다이에스터 결합(Phosphodiester bond)으로 연결되어 하나의 가닥을 이룬다. 인산 다이에스터 결합으로 연결된 2개의 상보적인 단일가닥들은 상보 염기의 수소결합으로 이중 나선 구조를 형성하고 있다. 이러한 이중나선구조는 1953년 왓슨과 크릭에 의해 소개되었다. [Watson, J. D., & Crick, F. H. (1953). Molecular structure of nucleic acids. Nature, 171(4356), 737-738.]In the nucleotide unit of DNA, a 5'carbon-linked phosphate group is linked to another unit's 3'carbon-OH group through a phosphate diester bond to form a single strand. Two complementary single strands connected by a phosphoric acid diester bond form a double helix structure by hydrogen bonding of a complementary base. This double helix was introduced in 1953 by Watson and Crick. [Watson, JD, & Crick, FH (1953). Molecular structure of nucleic acids. Nature , 171 (4356), 737-738.]

DNA 중 유전자 부위의 염기서열은 3개의 염기 코드가 단백질을 구성하는 하나의 아미노산(Amino acid)으로 번역되어 연결되면서 단백질이 합성되는데 중요한 역할을 한다. DNA는 mRNA로 전사된 후 염기서열의 순서에 따라 20가지의 아미노산으로 번역되는데 번역되는 아미노산이 tRNA에 의해 연결되면서 단백질이 형성되어 세포 내의 구성 물질로 존재하고, 생체 내 여러 반응을 매개하는 효소로써 작용하기도 한다.The nucleotide sequence of the gene site in DNA plays an important role in the synthesis of the protein as the three nucleotide codes are translated into one amino acid constituting the protein and linked. DNA is transcribed into mRNA and then translated into 20 amino acids according to the sequence of nucleotide sequences.When the translated amino acids are linked by tRNA, proteins are formed and exist as constituents in cells, and as enzymes that mediate various reactions in vivo. It also works.

인간의 DNA의 경우 30억 개의 염기쌍(bp)을 가지며 한 사람당 GB단위의 데이터 용량을 가진다. 이 용량을 인구 수로 환산하면 PB단위로도 부족한 실정이다. 때문에 인간의 모든 DNA sequence를 분석하기보다는 질병 특이적인 SNP(Single Nucleotide polymorphism, 염기다형성)부위 등을 분석함으로써 짧은 DNA 절편의 sequence로 질병 예측 분석이 이루어지고 있지만, 이마저도 모든 유전자의 SNP 부위를 분석해내지 못한 실정이며, 이를 분석하기 위한 다양한 프로그램 개발이 필요하다. In the case of human DNA, it has 3 billion base pairs (bp) and has a data capacity of GB per person. When this capacity is converted into the number of people, it is insufficient even in PB units. Therefore, rather than analyzing all human DNA sequences, disease-specific SNP (Single Nucleotide Polymorphism, nucleotide polymorphism) sites, etc., are analyzed to predict disease using the sequence of short DNA fragments. It is not a reality, and it is necessary to develop various programs to analyze this.

[선행 특허 문헌][Prior patent literature]

대한민국 공개특허 10-2016-0001455Republic of Korea Patent Publication 10-2016-0001455

본 발명은 상기 문제점을 해결하고, 상기의 필요성에 의해 안출된 것으로 본 발명의 목적은 DNA 염기를 각 염기의 분자량이 고려된 2진수 코드(1 염기당 2 bit)로 표준화하여 염기 서열 내에 존재하는 특정 패턴 파악에 최적화된 방법을 제공하는 것이다. The present invention solves the above problems and is conceived by the necessity of the above. An object of the present invention is to standardize a DNA base into a binary code (2 bits per base) in which the molecular weight of each base is considered. It provides a method that is optimized for identifying specific patterns.

본 발명의 다른 목적은 염기서열의 코드합을 이용한 상보 결합 여부 및 패턴 파악에 용이한 방법을 제공하고 DNA 단편이나 DNA 압타머의 패턴 및 기능을 예측하는데 용이한 방법을 제공하는 것이다.Another object of the present invention is to provide an easy method for determining whether or not complementary binding and pattern using the nucleotide sequence code sum, and to provide an easy method for predicting the pattern and function of a DNA fragment or a DNA aptamer.

본 발명의 또 다른 목적은 염기서열의 코드만으로 서열 간의 분자량 비율과 각 염기의 비율 등을 파악하는데 용이한 방법을 제공하는 것이다.Another object of the present invention is to provide an easy method for determining the molecular weight ratio between sequences and the ratio of each base only by the code of the base sequence.

본 발명의 또 다른 목적은 염기 서열 내의 변이 파악에 용이한 방법을 제공하고 SNP 등의 질병 특이적인 서열 변이를 이용함으로써 질병 예측에 용이한 방법을 제공하는 것이다.Another object of the present invention is to provide an easy method for identifying mutations in a nucleotide sequence and to provide an easy method for predicting diseases by using disease-specific sequence mutations such as SNPs.

상기의 목적을 달성하기 위하여 본 발명은 다음 단계를 포함하는 DNA의 코드 표준화하는 방법을 제공한다: (a) C, T, A, G인 4가지 염기에 각각 00, 01, 10, 11로 명명하고, (b) 각 염기가 G와 C 그리고 A와 T의 염기 쌍을 이루었을 때는 5'에서 3'방향으로 각각 G와 C의 경우에는 1100, C와 G의 경우에는 0011, A와 T의 경우에는 1001, T와 A의 경우에는 0110으로 명명한다.In order to achieve the above object, the present invention provides a method for standardizing DNA codes including the following steps: (a) C, T, A, and G are designated as 00, 01, 10, 11, respectively. And (b) when each base is a base pair of G and C, and A and T, in the direction of 5'to 3', respectively, 1100 for G and C, 0011 for C and G, and 0011 for A and T, respectively. In the case of the case, it is named 1001, and in the case of T and A, it is named 0110.

또한 본 발명은 다음 단계를 포함하는 DNA의 코드 표준화를 이용한 특정 DNA 단편이나 압타머의 특정패턴이나 2차 구조 확인하는데 최적화된 정보 제공 방법을 제공한다:(a) 특정 DNA 단편 염기서열의 C, T, A, 및 G를 각각 00, 01, 10, 11로 명명하는 단계; 및 (b) 상기 수치로 명명화된 코드의 배열과 각 코드 합의 배열을 비교하는 단계.In addition, the present invention provides a method of providing information optimized to identify a specific pattern or secondary structure of a specific DNA fragment or aptamer using standardization of DNA codes including the following steps: (a) C of a specific DNA fragment sequence, Naming T, A, and G as 00, 01, 10, 11, respectively; And (b) comparing an arrangement of codes named by the numerical values with an arrangement of each code sum.

본 발명의 일 구현예에 있어서, 상기 코드의 배열과 각 코드 합의 배열을 비교하는 단계는 상기 (a) 단계의 00, 01, 10, 및 11의 이진수의 수 배열을 십진수로 변형한 후에 각 서열의 합이 3이 되는 코드의 배열이 2 쌍 이상 양 끝에 배열되어 있는 경우에 스템 구조를 형성할 수 있다고 판단하며, 서로 마주보고 있는 서열의 코드합이 3보다 크거나 작아 상보 결합을 이룰 수 없는 서열이 3개 이상 중심에 연결되어 있을 때 루프 구조를 형성한다고 판단하는 것을 특징으로 하는 DNA의 코드 표준화를 이용한 특정 DNA 단편이나 압타머의 특정패턴이나 2차 구조 확인하는데 최적화된 정보 제공 방법이 바람직하나 이에 한정되지 아니한다.In one embodiment of the present invention, the step of comparing the arrangement of the codes with the arrangement of the sum of the codes comprises converting the binary number arrangement of 00, 01, 10, and 11 of the step (a) into decimal numbers, and then each sequence It is judged that a stem structure can be formed when the sequence of codes whose sum is 3 is arranged at both ends of two or more pairs, and the sum of the codes of the sequences facing each other is greater than or less than 3, so that complementary bonding cannot be achieved. A method of providing information optimized to identify a specific pattern or secondary structure of a specific DNA fragment or aptamer using DNA code standardization, characterized in that it is determined to form a loop structure when three or more sequences are connected to the center, is preferable. However, it is not limited thereto.

또한 본 발명은 다음 단계를 포함하는 DNA의 코드 표준화를 이용한 특정 DNA 단편의 염기서열 변이 존재 여부에 대한 정보제공 방법을 제공한다:(a) 특정 DNA 단편 염기서열의 C, T, A, 및 G를 각각 00, 01, 10, 11로 명명하는 단계; 및 (b) 상기 수치로 명명화된 코드의 합을 비교하는 단계.In addition, the present invention provides a method of providing information on the presence or absence of a nucleotide sequence variation of a specific DNA fragment using the DNA code standardization including the following steps: (a) C, T, A, and G of the nucleotide sequence of a specific DNA fragment. Naming 00, 01, 10, 11, respectively; And (b) comparing the sum of codes named by the numerical values.

본 발명의 일 구현예에 있어서, 상기 코드의 합을 비교하는 단계는 상기 (a) 단계의 00, 01, 10, 및 11의 이진수의 수 배열을 십진수로 변형한 후 그 합을 구한 후에 정상 서열과 비교하여 1 내지 3의 차이가 있는 경우에 변이가 존재한다고 판단하는 것을 특징으로 하는 것이 바람직하나 이에 한정되지 아니한다.In one embodiment of the present invention, the step of comparing the sum of the codes is a normal sequence after converting the number sequence of the binary numbers of 00, 01, 10, and 11 in step (a) into decimal numbers and calculating the sum. It is preferable to determine that a mutation exists when there is a difference of 1 to 3 compared to, but is not limited thereto.

본 발명의 다른 구현예에 있어서, 상기 방법은 특정 DNA 단편의 염기서열의 C, T, A, 및 G를 각각 00, 01, 10, 11로 명명하여 얻어진 코드의 각각 수치를 비교함으로써 변이 서열의 위치를 확인할 수 있는 것이 바람직하나 이에 한정되지 아니한다.In another embodiment of the present invention, the method comprises comparing the values of the codes obtained by naming C, T, A, and G of the nucleotide sequence of a specific DNA fragment as 00, 01, 10, 11, respectively. It is desirable to be able to check the location, but is not limited thereto.

또 본 발명은 컴퓨터-판독가능 매체에 저장되어, 컴퓨터로 하여금 이하의 단계들을 수행하도록 하기 위한 특정 DNA 단편이나 압타머의 특정패턴이나 2차 구조 확인하는데 최적화된 정보제공용 컴퓨터 프로그램으로서, 상기 단계들은:(a) 특정 DNA 단편의 염기서열의 C, T, A, 및 G를 각각 00, 01, 10, 11로 명명하는 단계; 및 (b) 상기 (a) 단계의 00, 01, 10, 및 11의 이진수의 수 배열을 십진수로 변형한 후에 각 서열의 합이 3이 되는 코드의 배열이 2 쌍 이상 양 끝에 배열되어 있는 경우에 스템 구조를 형성할 수 있다고 판단하며, 서로 마주보고 있는 서열의 코드합이 3보다 크거나 작아 상보 결합을 이룰 수 없는 서열이 3개 이상 중심에 연결되어 있을 때 루프 구조를 형성한다고 판단하는 단계를 포함하는, 컴퓨터-판독가능 매체에 저장된 컴퓨터 프로그램을 제공한다.In addition, the present invention is a computer program for providing information optimized to identify a specific pattern or secondary structure of a specific DNA fragment or aptamer, which is stored in a computer-readable medium and allows a computer to perform the following steps, the step They are: (a) naming C, T, A, and G of the nucleotide sequence of a specific DNA fragment as 00, 01, 10, 11, respectively; And (b) if the sequence of codes in which the sum of each sequence is 3 is arranged at both ends of two or more pairs after modifying the binary number sequence of 00, 01, 10, and 11 in step (a) to decimal. Determining that a stem structure can be formed, and determining that a loop structure is formed when three or more sequences that cannot achieve complementary binding are connected to the center of the code sum of the sequences facing each other is greater than or less than 3 Including, it provides a computer program stored in a computer-readable medium.

또한 본 발명은 컴퓨터-판독가능 매체에 저장되어, 컴퓨터로 하여금 이하의 단계들을 수행하도록 하기 위한 특정 DNA 단편의 염기서열 변이 존재 여부에 대한 정보제공용 컴퓨터 프로그램으로서, 상기 단계들은:(a) 특정 DNA 단편의 염기서열의 C, T, A, 및 G를 각각 00, 01, 10, 11로 명명하는 단계; 및 (b) 상기 (a) 단계의 이진수의 수 배열을 십진수로 변형한 후 그 합을 구한 후에 정상 서열과 비교하여 1 내지 3의 차이가 있는 경우에 변이가 존재한다고 판단하는 단계를 포함하는, 컴퓨터-판독가능 매체에 저장된 컴퓨터 프로그램을 제공한다.In addition, the present invention is a computer program for providing information on the presence or absence of a nucleotide sequence mutation of a specific DNA fragment, which is stored in a computer-readable medium and allows a computer to perform the following steps, wherein the steps are: (a) specific Naming C, T, A, and G of the nucleotide sequence of the DNA fragment as 00, 01, 10, 11, respectively; And (b) converting the number sequence of the binary numbers in step (a) into decimal numbers, and after obtaining the sum thereof, comparing with the normal sequence, determining that a mutation exists when there is a difference of 1 to 3, A computer program stored on a computer-readable medium is provided.

또한 본 발명은 컴퓨터-판독가능 매체에 저장되어, 컴퓨터로 하여금 이하의 단계들을 수행하도록 하기 위한 특정 DNA 단편의 염기서열 변이 서열에 대한 위치에 대한 정보제공용 컴퓨터 프로그램으로서, 상기 단계들은:(a) 특정 DNA 단편의 염기서열의 C, T, A, 및 G를 각각 00, 01, 10, 11로 명명하는 단계; 및 (b) 상기 (a)단계의 특정 DNA 단편의 염기서열의 C, T, A, 및 G를 각각 00, 01, 10, 11로 명명하여 얻어진 코드의 각각 수치를 비교함으로써 변이 서열의 위치를 확인하는 단계를 포함하는, 컴퓨터-판독가능 매체에 저장된 컴퓨터 프로그램을 제공한다.In addition, the present invention is stored in a computer-readable medium, a computer program for providing information on the position of the nucleotide sequence mutation sequence of a specific DNA fragment for causing a computer to perform the following steps, the steps are: (a ) Naming C, T, A, and G of the nucleotide sequence of a specific DNA fragment as 00, 01, 10, 11, respectively; And (b) comparing the values of the codes obtained by naming C, T, A, and G of the nucleotide sequence of the specific DNA fragment in step (a) as 00, 01, 10, 11, respectively, to determine the position of the mutant sequence. A computer program stored on a computer-readable medium is provided, comprising the step of verifying.

이하 본 발명을 설명한다.Hereinafter, the present invention will be described.

본 발명은 DNA의 각각 분자량이 작은 순으로 C, T, A, G인 4가지 염기에 각각 00, 01, 10, 11의 코드로 명명하고, 각 염기가 G와 C 그리고 A와 T의 염기 쌍을 이루었을 때 각각 분자량의 합이 코드합의 비율과 일치하도록 코드를 명명하는 방법을 제공한다.In the present invention, each of the four bases of C, T, A, and G in the order of the smallest molecular weight of the DNA is named by the codes of 00, 01, 10, and 11, respectively, and each base is a base pair of G and C and A and T. It provides a method of naming the codes so that the sum of the molecular weights coincides with the ratio of the sum of the codes.

또한 본 발명은 SELEX를 이용하여 확인된 각 화합물에 특이적인 압타머를 코드로 표준화함으로써 각 화합물에 존재하는 반응기와 결합하는 특정 패턴을 파악하고 빅데이터로 활용하여 예측할 수 있는 시스템을 구축한다.In addition, the present invention constructs a system capable of predicting by using SELEX to standardize the aptamer specific to each compound as a code to identify a specific pattern that binds to the reactive group present in each compound and utilize it as big data.

또한 본 발명은 DNA의 서열을 코드로 표준화한 후 각 서열의 값을 십진수로 변환하고 그의 합을 도출함으로써 각 서열의 변이 유무를 확인하고 특정 질병의 SNP존재 여부를 빠르게 파악할 수 있는 방법을 제공한다.In addition, the present invention provides a method of standardizing the DNA sequence into a code, converting the value of each sequence to a decimal number, and deriving the sum thereof, thereby confirming the presence or absence of mutations in each sequence and quickly determining the presence or absence of a SNP of a specific disease. .

본 발명은 DNA를 코드로 표준화함으로써 염기 서열 내에 존재하는 특정 패턴 파악에 용이한 방법을 제공한다.The present invention provides an easy method for identifying a specific pattern existing in a nucleotide sequence by standardizing DNA into a code.

본 발명은 특정 타겟 및 화학구조와 결합하는 DNA Sequence 패턴을 파악하고 이를 빅데이터로 활용함으로써 해당 화학 구조 단위에 결합하는 압타머(Aptamer)를 예측하고 SELEX(Systematic evolution of ligands by exponential enrichment) 시뮬레이션 프로그램화에 필요한 정보를 제공한다.In the present invention, a DNA sequence pattern that binds to a specific target and chemical structure is identified and used as big data to predict an aptamer that binds to a corresponding chemical structural unit, and SELEX (Systematic evolution of ligands by exponential enrichment) simulation program Provide the information necessary for the anger.

또 본 발명은 DNA를 염기 분자량이 반영된 코드로 표준화함으로써 염기서열의 코드만으로 서열 간의 분자량 비율과 각 염기의 비율 등을 파악하는데 최적화한 방법을 제공한다.In addition, the present invention provides a method optimized to determine the molecular weight ratio between sequences and the ratio of each base by standardizing DNA into a code reflecting the base molecular weight.

또한 본 발명은 DNA를 염기 분자량이 반영된 코드로 표준화함으로써 염기 서열내 변이 파악에 용이한 방법을 제공하고 코드의 합과 배열 순서 비교에 최적화된 방법을 제공함으로써 SNP등의 질병 특이적인 변이 파악 가능하며 질병 예측에 용이한 방법을 제공한다. In addition, the present invention provides an easy method for identifying variations in nucleotide sequences by standardizing DNA into a code reflecting the base molecular weight, and providing an optimized method for comparing the sum and sequence of codes, thereby allowing for identification of disease-specific mutations such as SNPs. It provides an easy way to predict disease.

본 발명을 통하여 알 수 있는 바와 같이, 본 발명의 DNA 코드 표준화 방법은 염기 서열 내의 변이 파악에 용이한 방법을 제공하고 SNP 등의 질병 특이적인 서열 변이를 이용함으로써 질병의 예측을 용이하게 하는 등 염기 서열 내에 존재하는 특정 패턴 파악에 용이한 방법을 제공한다.As can be seen from the present invention, the DNA code standardization method of the present invention provides an easy method for identifying variations in nucleotide sequences and facilitates prediction of diseases by using disease-specific sequence variations such as SNPs. It provides an easy method for identifying a specific pattern present in a sequence.

도 1은 DNA의 분자 구조 및 결합 질량비의 원리를 반영하여 지정한 코드 값을 분자량이 작은 염기에서 큰 순으로 C, T, A, G를 00, 01, 10, 11 값의 2진수로 지정한 것을 나타낸 그림,
도 2는 지정된 2진수의 코드가 각각 G와 C, A와 T의 염기가 쌍을 이룰 때 각 코드 합의 비율이 1:1로 실제 질량비와 동일한 비율을 가지도록 설계한 것을 나타낸 그림,
도 3은 6가지 서열의 코드 변환 값을 나타낸 것으로, 각 서열의 코드 합과 각 서열의 분자량을 비교하여 나타낸 그림,
도 4는 DNA 서열의 코드를 이용하여 예시 서열의 패턴을 확인한 것으로 각 서열의 코드 합에 따라 상보 결합의 가능 여부를 확인하고, 그 결합의 수와 연결된 염기의 수에 따라 스템-루프 구조 형성과 패턴을 확인한 그림, 및
도 5는 유방암 환자에게서 확인되는 SNP서열에 코드를 적용하여 본 발명의 코드 표준화 효율성을 확인한 것으로 Exon 2로부터 14번째에 있는 A염기가 G로 변이되어 있는 SNP 서열을 코드로 변환하고 이진수의 수 배열로 배치한 후 코드합을 구하여 정상 서열과 변이 서열의 코드 합을 비교한 그림.1 shows that the code values designated by reflecting the principle of the molecular structure and binding mass ratio of DNA are designated as binary numbers of 00, 01, 10, and 11 values for C, T, A, and G in the order of the lowest molecular weight base. Drawing,
FIG. 2 is a diagram showing that when a designated binary code is paired with the bases of G and C, and A and T, respectively, the ratio of the sum of the codes is 1:1, which is designed to have the same ratio as the actual mass ratio.
Figure 3 shows the code conversion values of six sequences, a picture showing the comparison of the code sum of each sequence and the molecular weight of each sequence,
4 is a check of the pattern of exemplary sequences using the code of the DNA sequence. It is confirmed whether or not complementary bonding is possible according to the code sum of each sequence, and the formation of a stem-loop structure according to the number of bonds and the number of linked bases. The picture that confirmed the pattern, and
5 shows the code standardization efficiency of the present invention by applying the code to the SNP sequence identified in breast cancer patients. The SNP sequence in which the A base at the 14th from Exon 2 is mutated to G is converted into a code, and the number of binary numbers is arranged. Figure that compares the code sum of the normal sequence and the mutant sequence by calculating the code sum after arranging it with.

이하 본 발명을 비한정적인 실시예를 통하여 상세하게 설명한다. 단 하기 실시예는 본 발명을 예시하기 위한 의도로 기재된 것으로서 본 발명의 범위는 하기 실시예에 의하여 제한되는 것으로 해석되지 아니한다. Hereinafter, the present invention will be described in detail through non-limiting examples. However, the following examples are described with the intention of illustrating the present invention, and the scope of the present invention is not to be construed as being limited by the following examples.

실시예 1: 각 염기의 분자량에 따른 코드 표준화Example 1: Code standardization according to the molecular weight of each base

DNA의 서열을 결정하는 각 4가지의 염기를 컴퓨터 언어인 이진법 두자리의 수로 나타내어 코드로 표준화하기 위해 각 염기의 분자량을 분석하여 도 1에 표기하였다. 각각의 염기 G, A, T, C와 1개의 인산기가 연결된 데옥시리보뉴클레오타이드(deoxyribonucleotide)를 각각 dGMP, dAMP, dTMP, dCMP로 표기하였다. Each of the four bases for determining the sequence of the DNA is expressed in a two-digit binary code, which is a computer language, and the molecular weight of each base is analyzed and indicated in FIG. 1 in order to standardize the code. Each base G, A, T, C and deoxyribonucleotide linked to one phosphate group were denoted as dGMP, dAMP, dTMP, and dCMP, respectively.

각 염기는 G, A, T, C 순으로 큰 값을 가지며, G와 수소결합으로 쌍을 이루는 C 그리고 A와 상보 결합하는 T의 분자량을 각각 합하여 비교한 결과 654.4(=347.2+307.2)와 653.4(=331.2+322.2)로 대략 1:1의 동등한 분자 질량을 가진 채 서로 쌍을 이루고 있는 것을 확인하였다. G와 C의 분자량의 합보다 A와 T의 분자량의 합이 1이 적은 것은 G≡는 질소(N)가, A=T는 탄소(C), 수소(H)가 다른 결합쌍에 비해 1개씩 더 있으며, N의 분자량과 C+H의 분자량 합의 차이만큼(14>12+1) 각 쌍의 분자량 합의 차이(=1)가 존재하기 때문이다. 따라서 A와 T는 수소 결합이 가능한 O나 N의 부재로 2개의 수소결합을 이뤄 3개 수소결합을 이루고 있는 G≡결합보다는 약한 결합을 이루는 특성이 있다.Each base has a large value in the order of G, A, T, and C, and as a result of comparing the molecular weights of C that are paired with G by hydrogen bonds and T that are complementary to A and are compared, 654.4 (=347.2+307.2) and 653.4 It was confirmed that they were paired with each other with an equivalent molecular mass of approximately 1:1 as (=331.2+322.2). When the sum of the molecular weights of A and T is 1 less than the sum of the molecular weights of G and C, G≡ is nitrogen (N), A=T is carbon (C), hydrogen (H) is one by one compared to other bond pairs. There is more, because there is a difference (=1) of the sum of molecular weights of each pair as much as the difference of the sum of molecular weights of N and C+H (14>12+1). Therefore, A and T have two hydrogen bonds in the absence of O or N that can be hydrogen bonded and form a weaker bond than the G≡ bond, which forms three hydrogen bonds.

따라서 각 염기의 코드는 상기 DNA의 분자 구조 및 결합 질량비의 원리를 반영하여 지정하였다. 부여된 각 염기의 코드는 분자량이 작은 염기에서 큰 순으로 C, T, A, G를 00, 01, 10, 11 값의 2진수로 지정하였다. (도 1)Therefore, the code of each base was designated by reflecting the principle of the molecular structure and binding mass ratio of the DNA. In the code of each given base, C, T, A, and G were designated as 00, 01, 10, 11 values in the order of the lowest molecular weight base. (Fig. 1)

지정된 코드의 값은 각각 G와 C, A와 T의 염기가 쌍을 이룰 때 각각의 코드합 비율이 1:1로 실제 질량비와 동일한 비율을 가지도록 설계하였다. (도 2)The value of the designated code was designed so that when the bases of G and C, and A and T are paired, the sum ratio of each code is 1:1, and has the same ratio as the actual mass ratio. (Figure 2)

코드합은 각 염기의 코드를 십진수로 변환한 뒤 각 코드 값의 합을 나타낸 것으로 G와 C, A와 T의 각각의 코드합은 '3'으로 동일하다. The code sum represents the sum of each code value after converting the code of each base to a decimal number, and the code sum of each of G and C, and A and T is equal to '3'.

실시예 2: DNA 단편 및 압타머(Aptamer)의 분자량 비율 반영 최적화Example 2: Optimization of reflection of the molecular weight ratio of DNA fragments and aptamers

DNA의 각 염기 분자량에 따라 질량이 낮은 순에서 높은 순으로 코드를 지정하였기 때문에 DNA 단편의 총 코드 합은 각 서열의 분자량의 비율이 반영되어 계산되었다. (도 3) 코드의 분자량 반영 비율을 확인하여 6개의 예시 서열로 코드합과 분자량을 비교하였다.Since codes were assigned from lowest to highest in mass according to the molecular weight of each base of the DNA, the total code sum of the DNA fragment was calculated by reflecting the ratio of the molecular weights of each sequence. (Fig. 3) The ratio of the molecular weight reflection of the code was checked, and the code sum and molecular weight were compared with six exemplary sequences.

상기 예시서열은 코드의 분자량 반영 비율을 확인하기 위한 의도로 예시된 서열로서 범위는 서열번호 1 내지 6의 서열에 제한되는 것으로 해석되지 아니한다.The exemplary sequence is a sequence exemplified with the intention of confirming the ratio of molecular weight reflection of the code, and the range is not interpreted as being limited to the sequence of SEQ ID NOs: 1 to 6.

상기 서열번호 1 내지 6의 서열은 아래와 같다. The sequences of SEQ ID NOs: 1 to 6 are as follows.

5' AGAGCTCGCGCCGGAGTTCTCAATGCAAGAGC 3' (서열번호 1)5'AGAGCTCGCGCCGGAGTTCTCAATGCAAGAGC 3'(SEQ ID NO: 1)

5' GCGGCGGTGGCCTGAAGTCTGGCGGTGGCCCC 3' (서열번호 2)5'GCGGCGGTGGCCTGAAGTCTGGCGGTGGCCCC 3'(SEQ ID NO: 2)

5' GCGGCGGTGGCCAGAAGTCTCGCGGTGGCGGC 3' (서열번호 3)5'GCGGCGGTGGCCAGAAGTCTCGCGGTGGCGGC 3'(SEQ ID NO: 3)

5' GTGGAGGCGGTGGCCAGTCTCGCGGTGGCGGC 3' (서열번호 4)5'GTGGAGGCGGTGGCCAGTCTCGCGGTGGCGGC 3'(SEQ ID NO: 4)

5' GTGGCGGTGGCCAGCATAGTGGCGGTGGCCAG 3' (서열번호 5)5'GTGGCGGTGGCCAGCATAGTGGCGGTGGCCAG 3'(SEQ ID NO: 5)

5' GTGGAGGCGGTGGCCGTGGAGGCGGAGGCCGC 3' (서열번호 6)5'GTGGAGGCGGTGGCCGTGGAGGCGGAGGCCGC 3'(SEQ ID NO: 6)

상기 6개의 예시 서열은 32 mer의 염기서열이고, 염기의 길이는 동일하나 염기의 종류와 순서는 다양하게 구성한 것으로 각 염기의 코드 변환 값을 도 3에 표기하였다. 코드 합은 각 염기의 코드를 십진수로 변환한 후 총 합을 구한 것으로 각 서열의 염기 구성에 따라 코드 합 또한 각 서열의 분자량이 반영되어 계산되었다. The six exemplary sequences are 32 mer nucleotide sequences, and the lengths of the bases are the same, but the types and sequences of bases are various, and the code conversion values of each base are shown in FIG. 3. The code sum was calculated by converting the code of each base into a decimal number and then obtaining the total sum. The code sum was also calculated by reflecting the molecular weight of each sequence according to the base composition of each sequence.

각 서열의 분자량(Mw)과 비교하였을 때 분자량이 작을수록 코드 합의 값이 작은 값으로 확인되며 분자량이 큰 서열일 경우 코드 합은 큰 값으로 계산되었다. (도 3)When compared with the molecular weight (Mw) of each sequence, the smaller the molecular weight, the smaller the value of the code sum. In the case of the sequence with the higher molecular weight, the code sum was calculated as a larger value. (Fig. 3)

이와 같이 분자량의 비율을 반영하여 코드를 지정하고 변환한 결과 코드합을 이용함으로써 각 서열의 분자량의 비를 비교하는데 최적화하였다. In this way, the code was designated by reflecting the ratio of the molecular weight, and the conversion result was optimized to compare the ratio of the molecular weight of each sequence by using the code sum.

실시예 3: DNA 단편 및 압타머의 패턴 확인의 최적화Example 3: Optimization of DNA fragment and aptamer pattern identification

DNA 단편 및 압타머의 서열을 2진수 염기 코드로 변환하고 각 서열을 비교함으로써 서열 내에 포함되어 있는 특정 패턴 및 2차구조(secondary structure) 등을 파악하는데 최적화하였다. 이를 파악하기 위해 9개의 염기서열로 구성된 DNA 서열을 예시 서열로 활용하였다. (도 4) The sequence of the DNA fragment and the aptamer was converted into a binary base code and optimized to identify a specific pattern and secondary structure contained in the sequence by comparing each sequence. To understand this, a DNA sequence consisting of 9 nucleotide sequences was used as an exemplary sequence. (Fig. 4)

상기 예시 서열은 코드의 패턴을 예시하기 위한 의도로 기재된 것으로서 범위는 서열번호 7의예시 서열에 제한되는 것으로 해석되지 아니한다.The above exemplary sequence is described with the intention of illustrating the pattern of the code, and the range is not to be construed as being limited to the exemplary sequence of SEQ ID NO: 7.

상기 서열번호 7의 예시 서열은 아래와 같다. An exemplary sequence of SEQ ID NO: 7 is as follows.

5' GCGGTGGCG 3' (서열번호 7)5'GCGGTGGCG 3'(SEQ ID NO: 7)

상기 예시서열을 염기 코드로 변환하여 나열한 수는 아래와 같다.The number listed by converting the exemplary sequence to a base code is as follows.

11 00 11 11 01 11 11 00 11 (예시서열 코드 1) 11 00 11 11 01 11 11 00 11 (Example sequence code 1)

각 염기는 수소 결합을 이룰 수 있는 상보 염기와의 코드합이 '3'이 되도록 코드가 설계되어 있으며, 이러한 서열의 배열은 DNA 압타머 서열에서 스템 구조를 이룰 수 있다. (도 4; Stem)Each base is designed to have a code sum of '3' with a complementary base capable of forming hydrogen bonds, and the arrangement of these sequences can form a stem structure in the DNA aptamer sequence. (Fig. 4; Stem)

DNA의 스템-루프(Stem-loop) 구조의 패턴은 대부분 양 끝에 스템 구조를 이룰 수 있는 염기가 2개 이상 연결되어 있으며, 서로 마주보고 있는 서열의 코드합이 3보다 크거나 작아 상보 결합을 이룰 수 없는 서열이 3개 이상 중심에 연결되어 있을 때 루프 구조가 형성될 수 있는 특성이 있다.The pattern of the stem-loop structure of DNA mostly has two or more bases connected to each end that can form a stem structure, and the sum of the codes of the sequences facing each other is greater than or less than 3 to form a complementary bond. There is a characteristic that a loop structure can be formed when three or more sequences that cannot be connected to the center.

상기 예시 서열은 두 가지의 스템-루프 구조를 이룰 수 있으며 이는 염기 코드 배열로 간단히 확인할 수 있다. 첫번째 11 염기 코드와 상보결합을 이룰 수 있는 서열은 바로 옆의 00 코드를 제외한 8번째 00 코드의 염기(도 4; ①붉은색 화살표)이며, 두번째의 00 코드와의 상보결합이 가능한 염기는 6번째 11(도 4; ③초록색 화살표)과 7번째 11, 9번째 11 코드가 있다. 이와 동일하게 3번째 11 코드의 염기는 8번째 00 (도 4; ②푸른색 화살표)코드와 상보 결합이 가능하다. 이 때, 스템-루프 구조의 스템 부위는 2개 이상의 염기가 연결되어야 구조를 이루기 때문에 도3에 붉은색 화살표에 연결된 염기의 상보결합이나 푸른색 화살표에 연결된 염기의 상보 결합이 스템 구조(도 4; 점선 둥근 원)를 이룰 수 있으며 초록색 화살표의 상보결합은 단일 상보결합으로 스템 구조를 이룰 수 없다. 스템 구조를 이룰 수 있는 두 가지의 경우 모두 루프 구조를 형성할 수 있는 4개의 염기가 가운데에 존재하므로 스템-루프 구조 형성이 가능한 것으로 예측된다. The exemplary sequence can form two stem-loop structures, which can be simply confirmed by nucleotide code arrangement. The sequence capable of complementary bonding with the first 11 nucleotide code is the base of the 8th 00 code excluding the 00 code next to it (Fig. 4; ① red arrow), and the base capable of complementary bonding with the second 00 code is 6 There are the 11th code (Fig. 4; ③ green arrow) and the 7th 11th and 9th 11th codes. In the same way, the base of the 3rd 11th code can be complementarily combined with the 8th 00 (Fig. 4; ② blue arrow) code. At this time, since the stem portion of the stem-loop structure forms a structure only when two or more bases are connected, the complementary bonds of the bases connected to the red arrow in Fig. 3 or the complementary bonds of the bases connected to the blue arrow are the stem structure (Fig. 4). ; Dotted round circle), and the complementary bond of the green arrow cannot form a stem structure with a single complementary bond. In both cases that can form a stem structure, it is predicted that the stem-loop structure can be formed because four bases that can form a loop structure exist in the middle.

이와 같이 각 염기를 코드로 표준화함으로써 염기 코드 합에 따라 각 염기와의 상보 결합 가능 여부를 예측할 수 있으며 각 서열의 상보 결합의 수와 그에 연결된 염기의 수에 따라 DNA 서열의 2차 구조 및 패턴 등을 예측하는데 용이한 것으로 확인하였다. By standardizing each base into a code in this way, it is possible to predict whether or not complementary bonding with each base is possible according to the sum of the base codes, and the secondary structure and pattern of the DNA sequence, etc. It was confirmed that it was easy to predict.

실시예 4: 코드 표준화로 인한 SNP 파악의 최적화Example 4: Optimization of SNP identification due to code standardization

DNA 서열을 코드로 변환하고 각 서열의 코드합을 비교함으로써 특정 DNA 단편의 염기서열 변이 여부를 파악하는데 최적화하였다. SNP서열은 염기 1개가 변이된 DNA 단편 서열이기 때문에 코드를 SNP 서열에 적용하고 정상 서열과 비교함으로써 변이 존재 여부와 위치를 파악하는데 용이한 것을 확인하였다. 다양한 SNP 서열 중에 하나이며 84%의 유방암 환자에게서 확인되는 CD44유전자의 SNP 서열에 적용하여 코드 표준화의 효율성을 확인하였다. [Zhou, J., Nagarkatti, P. S., Zhong, Y., Creek, K., Zhang, J., & Nagarkatti, M. (2010). Unique SNP in CD44 intron 1 and its role in breast cancer development. Anticancer research, 30(4), 1263-1272.]By converting the DNA sequence into a code and comparing the code sum of each sequence, it was optimized to determine whether the nucleotide sequence of a specific DNA fragment was changed. Since the SNP sequence is a DNA fragment sequence in which one base is mutated, it was confirmed that it was easy to identify the presence and location of the mutation by applying the code to the SNP sequence and comparing it with the normal sequence. It is one of various SNP sequences and applied to the SNP sequence of the CD44 gene, which is identified in 84% of breast cancer patients, to confirm the efficiency of code standardization. [Zhou, J., Nagarkatti, PS, Zhong, Y., Creek, K., Zhang, J., & Nagarkatti, M. (2010). Unique SNP in CD44 intron 1 and its role in breast cancer development. Anticancer research , 30 (4), 1263-1272.]

상기 유방암 환자의 SNP 서열은 유전자의 첫번째 인트론(intron 1)의 위치에 존재하는 서열 중 엑손(Exon 2)으로부터 14번째에 있는 A염기가 G로 변이되어 있는 것이며, 이 서열을 코드로 변환하여 이진수의 배열로 배치한 후 코드합을 구하여 정상 서열과 변이 서열의 코드 합을 비교하였다. (도 5) The SNP sequence of the breast cancer patient is that the A base at the 14th from the exon 2 among the sequences present at the position of the first intron 1 of the gene has been mutated to G, and this sequence is converted into a code to be binary. After arranging in the arrangement of, the code sum was calculated, and the code sum of the normal sequence and the mutant sequence was compared. (Fig. 5)

정상 서열과 변이 서열의 코드를 각각 10진수로 변형한 후 합을 구하였을 때 정상 서열은 39이며, 변이 서열은 40으로 변이 서열이 정상 서열보다 1이 큰 값으로 확인되었다. 이와 같이 코드합만으로 DNA 절편 내에 변이 존재 여부를 학인 할 수 있으며 이때 변이된 염기의 종류에 따라 코드합은 1~3정도 차이 날 수 있다. 또한 변이된 코드의 각각 수치를 비교함으로써 서열의 위치까지 확인할 수 있다.When the codes of the normal sequence and the mutant sequence were respectively modified into decimal numbers and then summed, the normal sequence was 39, the mutant sequence was 40, and the mutant sequence was identified as a value of 1 greater than the normal sequence. As such, it is possible to determine whether a mutation exists in a DNA fragment only by the code sum, and at this time, the code sum may vary by 1 to 3 depending on the type of the mutated base. In addition, it is possible to confirm the position of the sequence by comparing each value of the mutated code.

이와 같이 정상 대조군에서 확인되는 DNA 단편 서열들과 질병 실험군에서 확인되는 특정 변이 서열을 코드로 변환하고 코드합을 비교함으로써 서열 간의 차이를 빠르게 확인하고 SNP 존재 여부를 간편하게 탐색할 수 있으며, 확인된 SNP 서열에 코드합을 적용하여 질병 진단에 활용할 수 있다. As described above, by converting the DNA fragment sequences identified in the normal control group and the specific mutant sequence identified in the disease test group into a code and comparing the code sum, the difference between sequences can be quickly checked and the presence of SNPs can be easily searched, and the identified SNPs By applying the code sum to the sequence, it can be used for disease diagnosis.

<110> SON, In sik <120> A METHOD CODING STANDARDIZATION OF DNA AND A BIOTECHNOLOGICAL USE OF THE METHOD <130> P19-0005HS <160> 7 <170> KopatentIn 2.0 <210> 1 <211> 32 <212> DNA <213> Artificial Sequence <220> <223> Oligonucleotide <400> 1 agagctcgcg ccggagttct caatgcaaga gc 32 <210> 2 <211> 32 <212> DNA <213> Artificial Sequence <220> <223> Oligonucleotide <400> 2 gcggcggtgg cctgaagtct ggcggtggcc cc 32 <210> 3 <211> 32 <212> DNA <213> Artificial Sequence <220> <223> Oligonucleotide <400> 3 gcggcggtgg ccagaagtct cgcggtggcg gc 32 <210> 4 <211> 32 <212> DNA <213> Artificial Sequence <220> <223> Oligonucleotide <400> 4 gtggaggcgg tggccagtct cgcggtggcg gc 32 <210> 5 <211> 32 <212> DNA <213> Artificial Sequence <220> <223> Oligonucleotide <400> 5 gtggcggtgg ccagcatagt ggcggtggcc ag 32 <210> 6 <211> 32 <212> DNA <213> Artificial Sequence <220> <223> Oligonucleotide <400> 6 gtggaggcgg tggccgtgga ggcggaggcc gc 32 <210> 7 <211> 9 <212> DNA <213> Artificial Sequence <220> <223> Oligonucleotide <400> 7 gcggtggcg 9 <110> SON, In sik <120> A METHOD CODING STANDARDIZATION OF DNA AND A BIOTECHNOLOGICAL USE OF THE METHOD <130> P19-0005HS <160> 7 <170> KopatentIn 2.0 <210> 1 <211> 32 <212> DNA <213> Artificial Sequence <220> <223> Oligonucleotide <400> 1 agagctcgcg ccggagttct caatgcaaga gc 32 <210> 2 <211> 32 <212> DNA <213> Artificial Sequence <220> <223> Oligonucleotide <400> 2 gcggcggtgg cctgaagtct ggcggtggcc cc 32 <210> 3 <211> 32 <212> DNA <213> Artificial Sequence <220> <223> Oligonucleotide <400> 3 gcggcggtgg ccagaagtct cgcggtggcg gc 32 <210> 4 <211> 32 <212> DNA <213> Artificial Sequence <220> <223> Oligonucleotide <400> 4 gtggaggcgg tggccagtct cgcggtggcg gc 32 <210> 5 <211> 32 <212> DNA <213> Artificial Sequence <220> <223> Oligonucleotide <400> 5 gtggcggtgg ccagcatagt ggcggtggcc ag 32 <210> 6 <211> 32 <212> DNA <213> Artificial Sequence <220> <223> Oligonucleotide <400> 6 gtggaggcgg tggccgtgga ggcggaggcc gc 32 <210> 7 <211> 9 <212> DNA <213> Artificial Sequence <220> <223> Oligonucleotide <400> 7 gcggtggcg 9

Claims

A computer program for providing information on the presence or absence of a nucleotide sequence variation of a specific DNA fragment, stored in a computer-readable medium, for causing a computer to perform the following steps, the steps:
(a) naming C, T, A, and G of the nucleotide sequence of a specific DNA fragment as 00, 01, 10, 11, respectively; And
(b) converting the number sequence of binary numbers in step (a) into decimal numbers, calculating the sum, and comparing with the normal sequence to determine that a mutation exists when there is a difference of 1 to 3 -A computer program stored on a readable medium.