KR100836166B1

KR100836166B1 - Apparatus for prediction of tertiary structure from the protein amino acid sequences and prediction method thereof

Info

Publication number: KR100836166B1
Application number: KR1020060082296A
Authority: KR
Inventors: 김동섭; 정찬석; 이민호
Original assignee: 한국과학기술원
Priority date: 2006-08-29
Filing date: 2006-08-29
Publication date: 2008-06-09
Also published as: KR20080019857A

Abstract

본 발명은 단백질의 아미노산 서열로부터 삼차 구조를 예측하기 위한 장치 및 이의 예측 방법에 관한 것으로, 구체적으로 입력수단; 상기 입력수단에서 입력된 미지 단백질 서열을 전처리하는 수단; 주형 단백질 데이터베이스; 짝 비교 수단; 단백질 유사도 네트워크 데이터베이스; 상기 단백질 유사도 네트워크 데이터베이스에 기초하여 미지 단백질과 주형 단백질을 전역적으로 비교하는 전역 비교 수단; 주형 선택 수단; 모델링 수단; 상기 예측된 단백질 구조를 검증하는 검증 수단; 및 출력수단을 포함하여 이루어지는 미지 단백질의 삼차 구조 예측 장치 및 이의 예측 방법에 관한 것이다. 본 발명은 기존의 단백질 예측 방법들보다 정확하고 정밀하게 단백질의 삼차 구조를 예측해줌으로써, 실험적으로 단백질 구조를 밝히는데 소비되는 비용과 시간을 절감하는 효과가 있으므로 각종 질병을 치료를 위한 연구 분야에 유용하게 사용될 수 있다.The present invention relates to an apparatus for predicting tertiary structure from an amino acid sequence of a protein and a method for predicting the same, and specifically, an input means; Means for preprocessing an unknown protein sequence inputted from said input means; Template protein database; Pair comparison means; Protein similarity network database; Global comparison means for comparing an unknown protein and a template protein globally based on the protein similarity network database; Mold selection means; Modeling means; Verification means for verifying the predicted protein structure; And an apparatus for predicting tertiary structure of an unknown protein comprising an output means and a prediction method thereof. The present invention predicts the tertiary structure of proteins more accurately and precisely than conventional protein prediction methods, thereby reducing the cost and time spent experimentally revealing the protein structure, which is useful in the field of research for treating various diseases. Can be used.

단백질, 삼차 구조, 예측 방법, 단백질 유사도 네트워크 데이터베이스 Protein, tertiary structure, prediction method, protein similarity network database

Description

Apparatus for prediction of tertiary structure from the protein amino acid sequences and prediction method

도 1은 본 발명의 일실시형태에 따른 미지의 단백질의 구조를 예측하기 위한 장치의 블록도이다.1 is a block diagram of an apparatus for predicting the structure of an unknown protein according to one embodiment of the present invention.

도 2는 본 발명의 일실시형태에 따른 주형 단백질 데이터베이스의 블록도이다.2 is a block diagram of a template protein database according to one embodiment of the invention.

도 3은 본 발명의 일실시형태에 따른 단백질 유사도 네트워크 데이터베이스의 블록도이다.3 is a block diagram of a protein similarity network database, according to one embodiment of the invention.

도 4는 본 발명의 일실시형태에 따른 미지의 단백질의 구조를 예측하는 방법을 설명하는 흐름도이다.4 is a flowchart for explaining a method for predicting the structure of an unknown protein according to one embodiment of the present invention.

도 5는 종래 단백질 삼차구조 예측 프로그램들의 성능을 비교한 그래프이다.5 is a graph comparing the performance of conventional protein tertiary structure prediction programs.

도 6은 종래 단백질 삼차구조 예측 프로그램과 본 발명에 따른 예측 장치의 성능을 비교한 그래프이다.6 is a graph comparing the performance of the conventional protein tertiary structure prediction program and the prediction device according to the present invention.

도 7은 종래 방법에 따라 예측된 단백질 구조와 실제 구조를 비교한 도면이다.7 is a diagram comparing the actual structure and the protein structure predicted according to the conventional method.

도 8은 본 발명의 방법에 따라 예측된 단백질 구조와 실제 구조를 비교한 도면이다.8 is a diagram comparing the actual structure and the protein structure predicted according to the method of the present invention.

본 발명은 단백질의 아미노산 서열로부터 삼차 구조를 예측하기 위한 장치 및 이의 예측 방법에 관한 것이다.The present invention relates to an apparatus for predicting tertiary structure from an amino acid sequence of a protein and a method of predicting the same.

단백질은 생명현상의 근간이며 생명현상을 일으키고 조절한다. 이러한 단백질의 기능은 단백질의 삼차 구조에 의해 결정된다. 따라서 단백질의 서열을 분석하여 삼차구조를 알아내어 단백질의 기능을 예측하는 것은 생명현상을 이해함과 동시에, 상기 단백질의 기능에 이상이 생겨 발생되는 각종 질병을 치료할 수 있는 열쇠가 된다. 그러므로 이러한 단백질의 삼차 구조를 밝혀내는 것은 대단히 중요하다.Proteins are the foundation of life phenomena and cause and regulate life phenomena. The function of these proteins is determined by their tertiary structure. Therefore, by analyzing the sequence of the protein to find the tertiary structure and predict the function of the protein is the key to understand the life phenomenon, and to treat various diseases caused by the abnormal function of the protein. Therefore, it is very important to identify the tertiary structure of these proteins.

단백질의 구조를 결정하는 방법은 실험적으로 구조를 밝혀내는 방법과 계산적으로 예측하는 방법이 있다. X-선 결정이나 핵자기공명(NMR)을 이용한 실험적인 방법은 정확한 단백질 구조를 밝혀낼 수 있지만 시간과 비용이 많이 소요되기 때문에 지놈 스케일로 적용하기에는 어렵다는 단점이 있다. 계산적인 방법은 단백질 아미노산 서열로부터 삼차 구조를 예측하는 방법으로 예상되는 구조가 비슷한 주형 단백질을 기반으로 구조를 예측하는 방법과 단백질의 물리화학적 성질을 바탕으로 구조를 예측하는 아비니시오(ab initio) 방법이 있다[Baker, D. and Sali, A. 2001, Science, 294, 93-96]. 특히 주형을 기반으로 예측하는 방법은 상대적으로 긴 단백질 서열에 대해서도 적용이 가능하고 적용범위가 넓기 때문에 지놈 스케일로 적용하는 것이 가능하다[McGuffin, L. J., et. al., 2004, Nucleic Acids Res., 2004, D196-D199].There are two ways to determine the structure of a protein: experimentally revealing the structure and predicting it computationally. Experimental methods using X-ray crystallography or nuclear magnetic resonance (NMR) can reveal the exact protein structure, but it is difficult to apply at the genome scale because it is time and costly. The computational method is to predict tertiary structure from protein amino acid sequence, and to predict structure based on template protein with similar expected structure and ab initio method to predict structure based on protein's physicochemical properties. (Baker, D. and Sali, A. 2001, Science, 294, 93-96). In particular, the template-based prediction method is applicable to a relatively long protein sequence and can be applied on a genome scale because of its wide range of application [McGuffin, LJ, et. al., 2004, Nucleic Acids Res., 2004, D196-D199].

주형을 기반으로 구조를 모델링하는 방법에 있어서, 미지 단백질과 가장 비슷한 구조를 갖는 주형 단백질을 선택하는 과정이 전체 성능에 큰 영향을 미친다.In the method of modeling a structure based on a template, the process of selecting a template protein having the structure most similar to an unknown protein has a great influence on the overall performance.

대체로 30% 이상의 서열이 일치하는 주형을 선택하면 비교적 정확한 구조를 예측할 수 있으며, 위의 조건을 만족하는 주형 단백질을 찾을 수 없는 경우에는 아미노산의 물리화학적 성질이나 단백질 서열의 프로파일을 추가적으로 이용하기도 한다[Altschul, S. F., et. al. 1997, Nucleic Acids Res., 25, 3389-3402; Sadreyev, R. and Grishin, N. 2003, J. Mol. Biol., 2003, 317-336].In general, selecting a template with more than 30% identical sequence predicts a relatively accurate structure, and if a template protein that meets the above criteria cannot be found, the physicochemical properties of the amino acids or the profile of the protein sequence may be additionally used. Altschul, SF, et. al. 1997, Nucleic Acids Res., 25, 3389-3402; Sadreyev, R. and Grishin, N. 2003, J. Mol. Biol., 2003, 317-336.

종래의 단백질 구조 예측 프로그램 중에서 많은 프로그램들이 미지의 단백질 구조를 예측하는데 필요한 주형 구조를 찾아낼 때 주로 서열 정렬을 이용하였다.Many of the conventional protein structure prediction programs used sequence alignment mainly to find the template structure needed to predict the unknown protein structure.

일반적으로 단백질의 유사성을 검색하는 도구로는 BLAST라는 수단을 사용하고 있다. 상기 수단은 입력 서열과 기존에 알려진 DNA 및 아미노산 서열을 비교하 여 주형 구조를 찾아내는 프로그램이다. 상기 BLAST는 우선 질의서열(query sequence)과 정렬을 이루었을 때 임계치(threshold) 이상의 점수를 기록하는 문자 블록의 목록을 만든다. 다음 단계로 미리 계산이 된 표를 이용하여 질의 서열과 서열 데이터베이스간의 서열 유사성이 없어질 때까지 유사영역을 횡으로 늘려 가는 방법으로 국부정렬(local alignment)을 수행한다. 상기 BLAST는 국부정렬이 높은 스코어인 HSP(High-scoring Segment Pairs)순서로 정렬하기 때문에, 유전자의 선두에서 말미까지의 위치순서로 정렬하는 것은 아니다.Generally, a tool called BLAST is used as a tool for searching for protein similarity. The means is a program that finds the template structure by comparing the input sequence with known DNA and amino acid sequences. The BLAST first creates a list of character blocks that record scores above a threshold when aligned with a query sequence. In the next step, local alignment is performed by using a table calculated in advance to increase the similar region laterally until there is no similarity between the query sequence and the sequence database. Since the BLAST is sorted in the order of High-scoring Segment Pairs (HSP), which is a local score with high scores, the BLAST is not sorted in order from the top to the end of the gene.

짝 정렬(Pairwise Sequence Alignment)에 특정 도메인(Domain)이나 모티프(Motif)를 활용하는 PSI-BLAST(Position-Specific Iterated BLAST)는 단일 서열로 시작하여 갭을 허용하는 국부적인 다중 정렬을 사용하는 것을 특징으로 하며, BLAST와 더불어 널리 이용되고 있다.Position-Specific Iterated BLAST (PSI-BLAST), which utilizes specific domains or motifs for pairwise sequence alignment, features local multiple alignments that start with a single sequence and allow gaps. It is widely used together with BLAST.

한국 공개특허 제2005-0064644호에서는 아미노산 서열을 알고 있는 미지 단백질의 구조를 예측하기 위한 방법에 있어서, (a) 기지 단백질들의 서열정보가 저장된 데이터베이스에 기초하여 상기 미지 단백질의 서열정보와 비교하여 서열의 유사성 정도에 따라 주형 단백질 후보를 결정하는 단계; (b) 기지 단백질들의 특성 정보가 저장된 데이터베이스에 기초하여 상기 미지 단백질이 어떠한 특성의 그룹에 속하는 지를 판별하는 단계; 및 상기 (a) 및 (b) 단계의 결과에 기초하여 상기 미지 단백질의 구조를 예측하는 단계를 포함하여 이루어지는 미지 단백질의 구조 예측방법 및 장치를 개시하고 있다. Korean Patent Laid-Open Publication No. 2005-0064644 discloses a method for predicting the structure of an unknown protein having an amino acid sequence, the method comprising: (a) comparing the sequence with the sequence information of the unknown protein based on a database storing sequence information of the known proteins; Determining a template protein candidate according to the degree of similarity of; (b) determining which group of properties the unknown protein belongs to based on a database in which the property information of known proteins is stored; And it discloses a method and apparatus for predicting the structure of an unknown protein comprising the step of predicting the structure of the unknown protein based on the results of step (a) and (b).

그러나 이와 같은 방법들은 미지 단백질과 각 주형 단백질 사이의 짝 비 교(pairewise comparison)에만 의존하기 때문에 단백질 구조 영역에서 지엽적인 정보에만 의존하게 되어 그 결과, 서열 유사도가 30% 미만인 경우에는 구조나 기능이 유사한 단백질(remote homolog)을 효과적으로 찾지 못하는 문제가 있다.However, since these methods rely only on pairewise comparisons between the unknown and each template protein, they rely only on local information in the region of protein structure, resulting in structure or function failures when sequence similarity is less than 30%. There is a problem that does not effectively find a similar homolog.

이러한 문제를 해결하기 위하여, 단백질 네트워크를 도입한 방법이 개발되고 있으며, 현재 단백질 네트워크를 도입한 방법으로는 대표적으로 Rank Prop이 있다. 상기 Rank Prop은 기존의 PSI-BLAST 알고리즘에 기반을 두고 있으며, 여기에 데이터베이스상의 주형 단백질 사이의 서열 유사도를 측정하여 가중치를 가하는 방식으로 구성되어 있다. 그러나 이러한 경우에도, 서열의 정보만을 의존하기 때문에 구조나 기능이 유사한 단백질을 효과적으로 찾지 못하는 문제가 있다.In order to solve this problem, a method of introducing a protein network has been developed. Currently, a method of introducing a protein network is Rank Prop. The Rank Prop is based on the existing PSI-BLAST algorithm, and is configured by weighting the sequence similarity between template proteins in a database. However, even in such a case, there is a problem in that a protein having a similar structure or function cannot be effectively found because it depends only on the information of the sequence.

이에, 본 발명자들은 서열 유사도가 30% 미만인 경우에도 구조나 기능이 유사한 단백질을 효과적으로 찾기 위하여 연구하던 중, 미지 단백질과 주형 단백질 사이의 짝 비교 후에 단백질 유사도 네트워크 데이터베이스를 이용하여 전역적으로 구조적 유사성을 비교함으로써 서열 유사도가 낮을 때에도 구조나 기능이 유사한 단백질을 찾을 수 있음을 알아내고 본 발명을 완성하였다.Accordingly, the present inventors have been studying to find a protein having similar structure or function even when the sequence similarity is less than 30%, and using the protein similarity network database after pair comparison between an unknown protein and a template protein, Comparing the structure and function can be found even when the sequence similarity is low, the present invention was completed.

본 발명의 목적은 미지 단백질의 삼차 구조를 예측하기 위한 장치를 제공하는 데 있다.It is an object of the present invention to provide a device for predicting tertiary structure of an unknown protein.

본 발명의 목적은 미지 단백질의 삼차 구조를 예측하기 위한 방법을 제공하는 데 있다. It is an object of the present invention to provide a method for predicting tertiary structure of an unknown protein.

상기 목적을 달성하기 위하여, 본 발명은 단백질의 아미노산 서열로부터 삼차 구조를 예측하기 위한 장치를 제공한다.In order to achieve the above object, the present invention provides an apparatus for predicting tertiary structure from the amino acid sequence of a protein.

또한, 본 발명은 단백질의 아미노산 서열로부터 삼차 구조를 예측하기 위한 방법을 제공한다.The present invention also provides a method for predicting tertiary structure from the amino acid sequence of a protein.

이하, 본 발명을 상세히 설명한다.Hereinafter, the present invention will be described in detail.

본 발명은The present invention

입력수단;Input means;

상기 입력수단에서 입력된 미지 단백질 서열을 전처리하는 서열 전처리 수단;Sequence preprocessing means for preprocessing the unknown protein sequence inputted from the input means;

주형 단백질의 정보가 저장된 주형 단백질 데이터베이스;A template protein database in which information of the template protein is stored;

상기 주형 단백질 데이터베이스에 기초하여 미지 단백질과 주형 단백질을 비교하는 짝 비교 수단;Pair comparison means for comparing the unknown protein with the template protein based on the template protein database;

상기 주형 단백질들 사이의 구조적 유사성으로 구성되어 저장된 단백질 유사도 네트워크 데이터베이스;A protein similarity network database composed and stored of structural similarities between the template proteins;

상기 단백질 유사도 네트워크 데이터베이스에 기초하여 미지 단백질과 주형 단백질을 전역적으로 비교하는 전역 비교 수단; Global comparison means for comparing an unknown protein and a template protein globally based on the protein similarity network database;

상기 짝 비교 수단 및 상기 전역 비교 수단의 결과에 기초하여 미지 단백질과 구조가 유사한 하나 이상의 주형 단백질을 선택하는 주형 선택 수단;Template selection means for selecting one or more template proteins that are similar in structure to an unknown protein based on the results of the pair comparison means and the global comparison means;

상기 주형 선택 수단의 결과에 기초하여 미지 단백질의 입체 구조를 예측하는 모델링 수단; Modeling means for predicting the steric structure of the unknown protein based on the result of the template selection means;

상기 예측된 단백질 구조를 검증하는 검증 수단; 및Verification means for verifying the predicted protein structure; And

상기 예측된 단백질 구조를 출력하는 출력수단을 포함하는 미지 단백질의 삼차 구조 예측 장치를 제공한다.It provides an apparatus for predicting tertiary structure of an unknown protein comprising an output means for outputting the predicted protein structure.

이하, 도 1을 참조하여 본 발명을 더욱 상세히 설명한다.Hereinafter, the present invention will be described in more detail with reference to FIG. 1 .

도 1은 본 발명의 일실시형태에 따른 미지 단백질의 구조를 예측하기 위한 장치의 블록도이다. 1 is a block diagram of an apparatus for predicting the structure of an unknown protein according to an embodiment of the present invention.

도 1에 나타낸 바와 같이, 본 발명의 일실시형태에 따른 미지 단백질의 구조를 예측하기 위한 장치는 입력수단(10);서열 전처리 수단(100); 주형 단백질 데이터베이스(102); 상기 주형 단백질 데이터베이스(102)의 개개의 주형 단백질과 미지 단백질의 짝 비교를 수행하는 짝 비교 수단(101); 단백질 유사도 네트워크 데이터베이스(104); 상기 단백질 유사도 네트워크 데이터베이스(104)를 이용하여 전역 비교를 수행하는 전역 비교 수단(103); 짝 비교 점수와 전역 비교 점수를 고려해서 주형 단백질을 선택하는 주형 선택 수단(105); 서열 정렬과 모델링 툴을 이용해서 미지 단백질의 구조 모델을 구성하는 모델링 수단(106); 생성된 모델이 실제로 적 합한지 검증하는 모델 검증 수단(107); 및 출력수단(20) 등으로 구성되어 있다.As shown in FIG . 1 , an apparatus for predicting the structure of an unknown protein according to an embodiment of the present invention includes an input means 10; a sequence preprocessing means 100; Template protein database 102; Pair comparison means (101) for performing pair comparisons between individual template proteins and unknown proteins in the template protein database (102); Protein similarity network database 104; Global comparison means (103) for performing a global comparison using said protein similarity network database (104); Template selection means 105 for selecting a template protein in consideration of the pair comparison score and the global comparison score; Modeling means 106 for constructing a structural model of the unknown protein using sequence alignment and modeling tools; Model verification means 107 for verifying that the generated model is indeed suitable; And an output means 20 or the like.

미지 단백질의 서열 데이터는 통상의 입력수단(10)으로 본 발명에 따른 단백질 3차 구조 예측 장치의 서열 입력창에 입력할 수 있다. 상기 입력수단(10)에는 대표적으로 키보드가 사용되며, 파일 또는 인터넷을 이용하여 단백질 서열의 데이터를 구하는 경우, 키보드 조작 없이 마우스만으로도 입력이 가능하다. 상기 키보드 및/또는 마우스 외에, 타블릿, 트랙볼, 전자펜, 스캐너 등을 사용할 수 있다. 상기와 같이 입력된 서열 데이터는 파일형태로 저장될 수 있어, 추후에 당해 파일을 불러들여 작업을 할 수 있다.The sequence data of the unknown protein can be input to the sequence input window of the protein tertiary structure predicting device according to the present invention by the conventional input means 10. A keyboard is typically used for the input means 10, and in the case of obtaining data of a protein sequence using a file or the Internet, it is possible to input using only a mouse without keyboard manipulation. In addition to the keyboard and / or mouse, tablets, trackballs, electronic pens, scanners, and the like may be used. The sequence data input as described above may be stored in a file form, so that the file may be loaded and operated later.

상기 서열 전처리 수단(100)은 자체 프로그램이나 외부 프로그램을 이용하여 입력된 미지 단백질의 서열로부터 2차 구조, 용매 노출도, 프로파일 등의 정보를 추출하고 전처리하는 역할을 한다. 이때 상기 미지 단백질의 서열로부터 2차 구조의 정보를 추출하는 데에는 Psipred 등의 프로그램을 사용할 수 있고[Jones, D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices, J Mol Biol, 292, 195-202], 용매 노출도 예측에는 인공신경망을 이용한 방법 등을 이용할 수 있으며[Ahmad, S. and Gromiha, M. M. (2002) NETASA: neural network based prediction of solvent accessibility, Bioinformatics, 18, 819-824], 프로파일 생성에는 PSI-BLAST 등의 프로그램을 사용할 수 있다[Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res, 25, 3389-3402].The sequence pretreatment means 100 extracts and preprocesses information such as secondary structure, solvent exposure, profile, etc. from an unknown protein sequence inputted using its own program or an external program. At this time, Psipred et al. Program can be used to extract the information of the secondary structure from the sequence of the unknown protein [Jones, DT (1999) Protein secondary structure prediction based on position-specific scoring matrices, J Mol Biol , 292 , 195-202], and artificial neural networks can be used to predict solvent exposure [Ahmad, S. and Gromiha, MM (2002) NETASA: neural network based prediction of solvent accessibility, Bioinformatics , 18 , 819-824], a program such as PSI-BLAST can be used for profile generation [Altschul, SF, Madden, TL, Schaffer, AA, Zhang, J., Zhang, Z., Miller, W. and Lipman, DJ ( 1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res , 25 , 3389-3402.

상기 주형 단백질 데이터베이스(102)는 주형 단백질과 관련된 데이터들을 저장하는 공간으로서, 상기 주형 단백질 데이터베이스에 저장될 수 있는 데이터로는 단백질의 서열(200), 단백질의 실제 구조(201), 상기 단백질의 서열로부터 예측되는 구조 데이터(202), 프로파일(203) 등이 있다(도 2 참조). 상기 주형 단백질 데이터베이스(102)는 제한되지 않으며, 예를 들면 단백질 데이터베이스(Protein DataBase; 이하 PDB), 단백질 구조적 분류(Structural Classification of Proteins; 이하 SCOP) 등을 사용할 수 있다[Murzin, A.G., Brenner, S.E., Hubbard, T. and Chothia, C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures, J Mol Biol, 247, 536-540]. 상기 SCOP은 일반적으로 사용되는 삼차구조 데이터베이스 중 하나로서 단백질의 유사성과 범위에 따라 폴드(fold), 슈퍼패밀리(super family), 패밀리(family)의 세 단계로 이루어지며, 일반적으로 동일한 폴드 또는 슈퍼패밀리 내의 단백질이 사용된다.The template protein database 102 is a space for storing data related to the template protein, and the data that can be stored in the template protein database includes the protein sequence 200, the actual structure of the protein 201, and the sequence of the protein. Structure data 202, profile 203, and the like, which are predicted from (see FIG. 2 ). The template protein database 102 is not limited, and for example, a Protein Database (PDB), a Structural Classification of Proteins (SCOP), or the like may be used [Murzin, AG, Brenner, SE, etc.]. , Hubbard, T. and Chothia, C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures, J Mol Biol , 247 , 536-540. The SCOP is one of the commonly used tertiary databases. It is composed of three stages, fold, super family, and family, depending on the similarity and range of proteins. Generally, the same fold or superfamily is the same. Inner protein is used.

상기 짝 비교 수단(101)은 서열 정렬에 기반해서 미지 단백질과 상기 주형 단백질 데이터베이스에 있는 개개의 단백질의 유사도를 수치로 나타내며, 사용되는 짝 비교 수단로는 BLAST[Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool, J Mol Biol, 215, 403-410], PSI-BLAST[Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res, 25, 3389-3402], 동적 프로그래밍(dynamic programming)에 기반한 얼라인먼트(alignment)[Needleman, S. B. and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol . Biol ., 48, 443-453; Smith, T. F. and Waterman, M. S. (1981) Identification of common molecular subsequences, J. Mol . Biol ., 147, 195-197], 프로파일에 기반한 얼라인먼트[Wallner, B., Fang, H., Ohlson, T., Frey-Skott, J., and Elofsson, A. (2004) Using evolutionary information for the query and target improves fold recognition, Proteins, 54, 342-350], 쓰레딩 방법[Shi, J., Blundell, T. L., and Mizuguchi, K. (2001) J. Mol. Biol., 310, 243-257., Sadreyev, R. and Grishin, N. (2003) COMPASS: A tool for comparison of multiple protein alignments with assessment of statistical significance, J. Mol. Biol., 326, 317-336] 등을 사용할 수 있으며, 이에 한정하지 않는다.The pair comparison means 101 numerically represents the similarity between the unknown protein and the individual proteins in the template protein database based on the sequence alignment, and the pair comparison means used include BLAST [Altschul, SF, Gish, W., Miller, W., Myers, EW and Lipman, DJ (1990) Basic local alignment search tool, J Mol Biol , 215, 403-410], PSI-BLAST [Altschul, SF, Madden, TL, Schaffer, AA, Zhang, J., Zhang, Z., Miller, W. and Lipman, DJ (1997) Gapped BLAST and PSI -BLAST: a new generation of protein database search programs, Nucleic Acids Res , 25, 3389-3402], alignment based on dynamic programming [Needleman, SB and Wunsch, CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol . Biol . , 48, 443-453; Smith, TF and Waterman, MS (1981) Identification of common molecular subsequences, J. Mol . Biol . , 147, 195-197], profile based alignment [Wallner, B., Fang, H., Ohlson, T., Frey-Skott, J., and Elofsson, A. (2004) Using evolutionary information for the query and target improves fold recognition, Proteins , 54, 342-350], threading method [Shi, J., Blundell, TL, and Mizuguchi, K. (2001) J. Mol. Biol., 310, 243-257., Sadreyev, R. and Grishin, N. (2003) COMPASS: A tool for comparison of multiple protein alignments with assessment of statistical significance, J. Mol. Biol., 326, 317-336, and the like, but is not limited thereto.

상기 단백질 유사도 네트워크 데이터베이스(104)는 상기 주형 단백질 데이터베이스(102)에 있는 단백질들을 노드로 구성하고, 각 노드 사이의 구조나 기능이 유사한 경우에 이웃할 수 있도록 주형 단백질들 사이의 구조 및 기능 중 적어도 하나에 기초하여 링크로 연결되는 것이 바람직하다(도 3 참조).The protein similarity network database 104 organizes the proteins in the template protein database 102 into nodes, and at least one of the structures and functions between the template proteins so that they can be neighbored if the structures or functions between each node are similar. It is preferable to be connected by a link based on one (see FIG. 3 ).

상기 전역 비교 수단(103)은 상기 단백질 유사도 네트워크 데이터베이스(104)를 이용하여 전역 비교를 수행하며, 구체적으로 상기 단백질 유사도 네트워크 데이터베이스(104)에서 이웃한 단백질의 짝 비교 점수를 유사도에 따라서 가중치합을 취해서 전역 비교 점수를 계산한다. 이때 이웃한 단백질 사이의 가중치는 두 단백질의 구조 또는 기능의 유사성을 기초로 하여 유사성이 높을수록 높은 가중치를 두는 것이 바람직하다. The global comparison means 103 performs a global comparison by using the protein similarity network database 104, and specifically, a weighted sum of pair comparison scores of neighboring proteins in the protein similarity network database 104 according to the similarity. To calculate the global comparison score. At this time, the weight between neighboring proteins is based on the similarity of the structure or function of the two proteins, the higher the weight, the higher the weight is preferable.

예를 들면, 단백질 네트워크상에서 주형 단백질 P와 D 사이의 가중치로 P와 D 사이의 구조의 유사성을 나타내는 Z-스코어(Z_P _,D)를 사용하여 전역 비교 점수를 계산한다면, 상기 전역 비교 점수는 하기 수학식 1로 표현할 수 있다.For example, if a global comparison score is calculated using a Z-score (Z _P _{, D} ) representing the similarity of the structure between P and D as a weight between template proteins P and D on a protein network, the global comparison score is It can be expressed by the following equation (1).

상기 식에서,Where

는 미지 단백질 Q와 주형 단백질 P 사이의 전역 비교 점수이고,

Is the global comparison score between unknown protein Q and template protein P,

는 미지 단백질 Q와 주형 단백질 P 사이의 짝 비교 점수이고,

Is the pair comparison score between unknown protein Q and template protein P,

는 미지 단백질 Q와 주형 단백질 D 사이의 짝 비교 점수이고,

,

는 단백질 네트워크상에서 주형 단백질 P와 D 사이의 가중치를 의미하고,

는 구조의 유사성을 나타내는 Z-스코어로서, 단백질의 구조비교 원점수로부터 얻어지는 통계적인 수치를 의미한다.

Is the pair comparison score between unknown protein Q and template protein D,

,

Is the weight between template proteins P and D on the protein network,

Is a Z-score representing the similarity of the structures, and refers to a statistical value obtained from the structural comparison raw score of the protein.

삭제delete

상기 주형 선택 수단(105)은 미지 단백질과 가장 유사한 주형 단백질을 선택하는 수단이며, 이때 주형 단백질은 상기 계산된 전역 비교 점수가 가장 큰 것을 선택하는 것이 바람직하다.The template selection means 105 is a means for selecting a template protein most similar to an unknown protein, wherein the template protein preferably selects the one with the largest calculated global comparison score.

상기 모델링 수단(106)은 상기 주형 선택 수단의 결과에 기초하여 선택된 주형 단백질을 주형으로 하여 서열 정렬 방법과 모델링툴을 사용하여 미지 단백질의 입체 구조를 예측하여 구조 모델을 생성하는 수단이다. 이때, 서열 정렬 방법으로는 동적 프로그래밍 또는 쓰레딩 방법 등을 사용할 수 있으며, 모델링툴로는 모델러(modeller), SWISS-MODEL 등의 프로그램을 사용할 수 있다[Sali, A. and Blundell, T.L. (1993) Comparative protein modelling by satisfaction of spatial restraints, J Mol Biol, 234, 779-815; Arnold, K., Bordoli, L., Kopp, J., and Schwede, T. (2006) The SWISS-MODEL workspace: a web-based environment for protein structure modelling, Bioinformatics, 22, 195-201].The modeling means 106 is a means for generating a structural model by predicting the steric structure of the unknown protein using a sequence alignment method and a modeling tool based on the template protein selected based on the result of the template selection means. At this time, dynamic alignment or threading may be used as the sequence alignment method, and a program such as a modeler or a SWISS-MODEL may be used as a modeling tool. [Sali, A. and Blundell, TL (1993) Comparative protein modeling by satisfaction of spatial restraints, J Mol Biol , 234, 779-815; Arnold, K., Bordoli, L., Kopp, J., and Schwede, T. (2006) The SWISS-MODEL workspace: a web-based environment for protein structure modeling, Bioinformatics , 22, 195-201].

상기 모델 검증 수단(107)은 검증툴을 사용하여 예측한 모델이 실제로 안정 적으로 존재할 수 있는 구조인지 검사하는 수단이다. 사용가능한 검증툴로는 프로체크(Procheck), 왓체크(Whatcheck) 등을 사용할 수 있다[Laskowski, R. A., MacArthur, M. W., Moss, D. S., and Thornton, J. M. (1993) PROCHECK: a program to check the stereochemical quality of protein structures, J. Appl . Cryst., 26, 283-291; Hooft, R. W. W., Vriend, G., Sander, C., and Abola, E. E. (1996) Errors in protein structures, Nature, 381, 272-272].The model verification means 107 is a means for checking whether the model predicted using the verification tool is actually a structure that can exist stably. Available verification tools include Procheck, Whatcheck, etc. [Laskowski, RA, MacArthur, MW, Moss, DS, and Thornton, JM (1993) PROCHECK: a program to check the stereochemical quality of protein structures, J. Appl . Cryst. , 26, 283-291; Hooft, RWW, Vriend, G., Sander, C., and Abola, EE (1996) Errors in protein structures, Nature , 381, 272-272].

상기 검증된 단백질 삼차 구조 모델은 컴퓨터의 출력수단(20)를 통하여 그 결과가 출력된다. 가장 기본적인 출력 형태는 화면으로 출력되는 것이며, 작업과 동시에 실시간으로 당해 단백질 삼차 구조가 출력될 수 있으며, 상기 출력결과를 프린터를 통하여 인쇄할 수 있고, 이를 다양한 그래픽 포맷 파일 또는 분자를 표현하는 파일포맷으로 저장할 수 있다.The verified protein tertiary structure model is output through the output means 20 of the computer. The most basic output form is output to the screen, and the protein tertiary structure can be output in real time at the same time as the work, and the output result can be printed through a printer, and the file format representing various graphic format files or molecules. Can be stored as

상기 그래픽 포맷 파일은 특별히 이에 제한되는 것은 아니나, 비트맵 포맷, 이미지 포맷의 파일 또는 벡터 포맷의 파일 모두 적용가능하며 상기 비트맵 또는 이미지 포맷의 파일은 특별히 이에 제한되는 것은 아니나 JPEG, JPG, GIF, TIF, PIC 및 BMP로 구성된 그룹으로부터 선택되는 것이 바람직하며, 상기 벡터 포맷의 파일은 특별히 이에 제한되는 것은 아니나, CAD, WMF, DWG, CDR 및 AI로 구성된 그룹으로부터 선택되는 것이 바람직하다. The graphic format file is not particularly limited thereto, but a bitmap format, an image format file, or a vector format file may be applied. The bitmap or image format file is not particularly limited thereto, but JPEG, JPG, GIF, It is preferably selected from the group consisting of TIF, PIC, and BMP, and the file in the vector format is not particularly limited thereto, but is preferably selected from the group consisting of CAD, WMF, DWG, CDR, and AI.

상기 분자를 표현하는 파일포맷은 분자의 구조를 저장하는데 사용되는 파일포맷이 모두 적용가능하다. 상기 분자구조 파일포맷은 특별히 이에 제한되는 것은 아니나 PDB나 mmCIF로 포맷을 선택하는 것이 바람직하다.The file format representing the molecule may be any file format used to store the structure of the molecule. The molecular structure file format is not particularly limited, but it is preferable to select a format with PDB or mmCIF.

또한, 본 발명은 In addition, the present invention

(a) 입력수단을 통해 입력된 미지 단백질의 서열로부터 서열 전처리 수단을 통해 정보를 추출하고 전처리하는 단계(단계 1);(a) extracting and preprocessing information from the sequence of the unknown protein input through the input means through sequence preprocessing means (step 1);

(b) 짝 비교 수단을 통해 주형 단백질 데이터베이스에 저장된 개개의 주형 단백질과 미지 단백질을 짝 비교하여 각 단백질 사이의 유사도를 계산하는 단계(단계 2);(b) pairwise comparing each template protein and the unknown protein stored in the template protein database via pair comparison means to calculate similarity between each protein (step 2);

(c) 전역 비교 수단을 통해 단백질 유사도 네트워크 데이터베이스상에서 미지 단백질과 주형 단백질의 전역적인 유사도를 계산하는 단계(단계 3); (c) calculating global similarity of the unknown protein and the template protein on the protein similarity network database through global comparison means (step 3);

(d) 주형 선택 수단을 통해 상기 단계 2 및 단계 3의 결과에 기초하여 주형 단백질 후보를 선택하는 단계(단계 4);(d) selecting a template protein candidate based on the results of steps 2 and 3 through template selection means (step 4);

(e) 모델링 수단을 통해 선택된 주형 단백질을 기반으로 미지 단백질의 구조를 모델링하는 단계(단계 5); 및(e) modeling the structure of the unknown protein based on the template protein selected by the modeling means (step 5); And

(f) 검증 수단을 통해 예측된 단백질 구조를 검증한 뒤 출력수단을 통해 상기 예측된 단백질 구조를 출력하는 단계(단계 6)를 포함하는 미지 단백질의 구조 예측 방법을 제공한다.and (f) verifying the predicted protein structure through verification means and then outputting the predicted protein structure through output means (step 6).

이하, 도 4를 참조하여 본 발명을 상세히 설명한다.Hereinafter, the present invention will be described in detail with reference to FIG. 4 .

도 4는 본 발명의 일실시형태에 따른 미지 단백질의 구조를 예측하기 위한 방법의 흐름도이다. 4 is a flow chart of a method for predicting the structure of an unknown protein according to one embodiment of the invention.

도 4에 나타낸 바와 같이, 단계 1(400)은 입력수단을 통해 미지 단백질 서열이 입력되면 입력된 미지 단백질의 서열로부터 정보를 추출하고 전처리하는 단계이다. 이때, 상기 정보는 2차 구조, 용매 노출도, 프로파일 등이 있으며, 상기 미지 단백질의 서열로부터 2차 구조의 정보를 추출하는 데에는 Psipred 등의 프로그램을 사용할 수 있고[Jones, D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices, J Mol Biol, 292, 195-202], 용매 노출도 예측에는 인공신경망을 이용한 방법 등을 이용할 수 있으며[Ahmad, S. and Gromiha, M. M. (2002) NETASA: neural network based prediction of solvent accessibility, Bioinformatics, 18, 819-824], 프로파일 생성에는 PSI-BLAST 등의 프로그램을 사용할 수 있다[Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res, 25, 3389-3402].As shown in FIG . 4 , step 1 400 is a step of extracting and preprocessing information from an input unknown protein sequence when an unknown protein sequence is input through an input means. At this time, the information includes secondary structure, solvent exposure degree, profile, and the like, and a program such as Psipred can be used to extract secondary structure information from the sequence of the unknown protein [Jones, DT (1999) Protein secondary structure prediction based on position-specific scoring matrices, J Mol Biol , 292, 195-202], and using neural networks to predict solvent exposure [Ahmad, S. and Gromiha, MM (2002) NETASA: neural network based prediction of solvent accessibility, Bioinformatics , 18, 819-824], and programs such as PSI-BLAST can be used for profile generation [Altschul, SF, Madden, TL, Schaffer, AA, Zhang, J., Zhang, Z., Miller, W. and Lipman, DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res , 25, 3389-3402.

다음으로 단계 2(401)는 상기 전처리를 통해 나온 정보들을 입력하여 짝 비교 수단을 통해 주형 단백질 데이터베이스(407)에 저장된 개개의 주형 단백질과 미지 단백질을 짝 비교하여 각 단백질 사이의 유사도를 계산하는 단계이다.Next, step 2 401 calculates the similarity between each protein by pairing each template protein and the unknown protein stored in the template protein database 407 through pair comparison means by inputting the information obtained through the preprocessing. to be.

상기 단계에서는 미지 단백질과 상기 주형 단백질 데이터베이스에 있는 개개의 단백질의 유사도를 수치로 나타내며, 이때 주형 단백질과 미지 단백질의 유사도 계산은 상기 미지 단백질의 서열, 프로파일 중 적어도 하나에 근거하여 수행되는 것이 바람직하다. 상기 단백질의 유사도 계산 방법의 예로는 BLAST, PSI-BLAST, 동적 프로그래밍(dynamic programming)에 기반한 얼라인먼트(alignment), 프로파일에 기반한 얼라인먼트, 쓰레딩 방법 등을 사용할 수 있으며, 이에 한정하지 않는다[Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool, J Mol Biol, 215, 403-410., Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res, 25, 3389-3402; Needleman, S. B. and Wunsch, C. D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol . Biol ., 48, 443-453; Smith, T. F. and Waterman, M. S. (1981) Identification of common molecular subsequences, J. Mol . Biol ., 147, 195-197; Wallner, B., Fang, H., Ohlson, T., Frey-Skott, J., and Elofsson, A. (2004) Using evolutionary information for the query and target improves fold recognition, Proteins, 54, 342-350., Shi, J., Blundell, T. L., and Mizuguchi, K. (2001) J. Mol . Biol ., 310, 243-257; Sadreyev, R. and Grishin, N. (2003) COMPASS: A tool for comparison of multiple protein alignments with assessment of statistical significance, J. Mol . Biol ., 326, 317-336].In this step, the similarity between the unknown protein and the individual proteins in the template protein database is numerically expressed, and the similarity calculation between the template protein and the unknown protein is preferably performed based on at least one of the sequence and the profile of the unknown protein. . Examples of the protein similarity calculation method may include, but are not limited to, BLAST, PSI-BLAST, alignment based on dynamic programming, alignment based on profile, threading method, and the like. [Altschul, SF, Gish, W., Miller, W., Myers, EW and Lipman, DJ (1990) Basic local alignment search tool, J Mol Biol , 215 , 403-410., Altschul, SF, Madden, TL, Schaffer, AA, Zhang, J., Zhang, Z., Miller, W. and Lipman, DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res , 25 , 3389-3402; Needleman, SB and Wunsch, CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol . Biol . , 48 , 443-453; Smith, TF and Waterman, MS (1981) Identification of common molecular subsequences, J. Mol . Biol . , 147 , 195-197; Wallner, B., Fang, H., Ohlson, T., Frey-Skott, J., and Elofsson, A. (2004) Using evolutionary information for the query and target improves fold recognition, Proteins , 54 , 342-350. , Shi, J., Blundell, TL, and Mizuguchi, K. (2001) J. Mol . Biol . , 310 , 243-257; Sadreyev, R. and Grishin, N. (2003) COMPASS: A tool for comparison of multiple protein alignments with assessment of statistical significance, J. Mol . Biol . , 326 , 317-336.

다음으로 단계 3(402)은 전역 비교 수단을 통해 단백질 유사도 네트워크 데이터베이스(408)상에서 미지 단백질과 주형 단백질의 전역적인 유사도를 계산하는 단계이다.Step 3 402 then calculates the global similarity of the unknown protein and the template protein on the protein similarity network database 408 via global comparison means.

상기 단계에서는 상기 단백질 유사도 네트워크 데이터베이스(408)를 이용하여 전역 비교를 수행하며, 구체적으로 상기 단백질 유사도 네트워크 데이터베이스(408)상에서 연결된 최단 링크를 따른 이웃한 단백질 사이의 유사도에 따라서 주형 단백질과 미지 단백질의 짝 비교 결과의 가중치합을 취해서 전역 비교 점수를 계산한다. 이때 이웃한 단백질 사이의 가중치는 두 단백질의 구조 또는 기능의 유사성을 기초로 하여 유사성이 높을수록 높은 가중치를 두는 것이 바람직하다. In this step, a global comparison is performed using the protein similarity network database 408, and specifically, the template protein and the unknown protein are determined according to the similarity between neighboring proteins along the shortest link connected on the protein similarity network database 408. The global comparison score is calculated by taking the weighted sum of the pair comparison results. At this time, the weight between neighboring proteins is based on the similarity of the structure or function of the two proteins, the higher the weight, the higher the weight is preferable.

다음으로 단계 4(403)는 주형 선택 수단을 통해 상기 단계 2 및 단계 3의 결과에 기초하여 주형 단백질 후보를 선택하는 단계이다.Step 4 403 is followed by selecting a template protein candidate based on the results of steps 2 and 3 through template selection means.

상기 단계에서는 짝 비교 유사도와 전역 비교 유사도를 고려해서 주형 단백질을 선택하며, 바람직하게는 전역 비교 유사도 값이 높은 주형 단백질을 선택하는 것이 바람직하다.In this step, the template protein is selected in consideration of the pair comparison similarity and the global comparison similarity, and preferably a template protein having a high global comparison similarity value is selected.

다음으로, 단계 5(404)는 모델링 수단을 통해 선택된 주형 단백질을 기반으로 미지 단백질의 구조를 모델링하는 단계이다.Next, step 5 (404) is a step of modeling the structure of the unknown protein based on the template protein selected through the modeling means.

상기 단계에서는 선택된 주형 단백질을 주형으로 하여 서열 정렬 방법과 모델링툴을 사용하여 미지 단백질의 입체 구조를 예측하여 구조 모델을 생성한다. 이때, 서열 정렬 방법으로는 동적 프로그래밍 또는 쓰레딩 방법 등을 사용할 수 있으며, 모델링툴로는 모델러(modeller), SWISS-MODEL 등의 프로그램을 사용할 수 있다[Sali, A. and Blundell, T.L. (1993) Comparative protein modelling by satisfaction of spatial restraints, J Mol Biol, 234, 779-815; Arnold, K., Bordoli, L., Kopp, J., and Schwede, T. (2006) The SWISS-MODEL workspace: a web-based environment for protein structure modelling, Bioinformatics, 22, 195-201]. In this step, a structural model is generated by predicting the conformation of the unknown protein using a sequence alignment method and a modeling tool using the selected template protein as a template. At this time, dynamic alignment or threading may be used as the sequence alignment method, and a program such as a modeler or a SWISS-MODEL may be used as a modeling tool. [Sali, A. and Blundell, TL (1993) Comparative protein modeling by satisfaction of spatial restraints, J Mol Biol , 234, 779-815; Arnold, K., Bordoli, L., Kopp, J., and Schwede, T. (2006) The SWISS-MODEL workspace: a web-based environment for protein structure modeling, Bioinformatics , 22, 195-201].

다음으로 단계 6(405)은 검증 수단을 통해 예측된 단백질 구조를 검증한 뒤 출력수단을 통해 상기 예측된 단백질 구조를 출력하는 단계이다.Next, step 6 405 is a step of verifying the predicted protein structure through the verification means and outputting the predicted protein structure through the output means.

상기 단계에서는 검증툴을 사용하여 예측한 모델이 실제로 안정적으로 존재할 수 있는 구조인지 검사한다. 이때, 사용가능한 검증툴로는 프로체크(Procheck), 왓체크(Whatcheck) 등을 사용할 수 있다[Laskowski, R. A., MacArthur, M. W., Moss, D. S., and Thornton, J. M. (1993) PROCHECK: a program to check the stereochemical quality of protein structures, J. Appl. Cryst., 26, 283-291; Hooft, R. W. W., Vriend, G., Sander, C., and Abola, E. E. (1996) Errors in protein structures, Nature, 381, 272-272]. 상기 검증 과정을 통과한 최종 결과를 출력수단을 통해 출력하여, 미지 단백질의 삼차 구조를 예측할 수 있다(406).In this step, a verification tool is used to check whether the predicted model is actually a structure that can exist stably. At this time, the available verification tools may be Procheck, Whatcheck, etc. [Laskowski, RA, MacArthur, MW, Moss, DS, and Thornton, JM (1993) PROCHECK: a program to check the stereochemical quality of protein structures, J. Appl. Cryst. , 26 , 283-291; Hooft, RWW, Vriend, G., Sander, C., and Abola, EE (1996) Errors in protein structures, Nature , 381 , 272-272]. The final result of passing the verification process may be output through the output means to predict the tertiary structure of the unknown protein (406).

또한, 본 발명은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함할 수 있으며, 일례로 ROM, RAM, CD-ROM, 자기 테이프, 플라피 디스크, 광데이터 저장장치 등을 사용할 수 있으며, 캐리어 웨이브(예를 들면, 인터넷을 통한 전송)의 형태로 구현되는 것도 포함될 수 있다. 또한, 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.The present invention can also be embodied as computer readable code on a computer readable recording medium. The computer-readable recording medium may include all kinds of recording devices that store data that can be read by a computer system. For example, ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage. A device or the like may be used, and may be implemented in the form of a carrier wave (for example, transmission through the Internet). The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.

이하, 본 발명을 실시예에 의하여 보다 상세하게 설명한다. 단, 하기 실시예들은 본 발명을 예시하는 것으로, 본 발명의 내용이 실시예에 의하여 한정되는 것은 아니다.Hereinafter, the present invention will be described in more detail with reference to Examples. However, the following examples are illustrative of the present invention, and the content of the present invention is not limited by the examples.

<< 실시예Example 1> 1>

본 발명에 따른 미지 단백질의 삼차 구조 예측 장치의 정확도를 알아보기 위하여 다음과 같은 실험을 수행하였다.In order to determine the accuracy of the tertiary structure predicting apparatus of the unknown protein according to the present invention, the following experiment was performed.

도 1과 같이 미지 단백질의 삼차 구조 예측 장치를 만들어 종래 단백질 삼차구조 예측 프로그램과 본 발명에 따른 예측 장치의 성능을 측정하였다. 상기 성능 측정값은 ROC₅₀으로 나타내었으며, 이때, ROC₅₀이란, 미지 단백질의 성질에 따른 분 류가 올바른 패밀리에 속해 있으면 트루-포지티브(true-positive), 다른 패밀리에 속해 있으면 폴스-포지티브(false-positive)로 표현되는데, 이때 폴스-포지티브와 트루-포지티브의 관계 그래프에서 폴스 포지티브가 50개 누적될 때의 상기 그래프의 면적을 나타낸다. 종래 단백질 삼차구조 예측 프로그램과 본 발명에 따른 예측 장치의 성능을 ROC₅₀값을 점으로 표시하여 도 5 및 도 6에 나타내었다. As shown in FIG. 1 , the third protein structure predicting device of the unknown protein was measured, and the performance of the conventional protein protein structure predicting program and the predicting device according to the present invention was measured. The performance measurement is expressed as ROC ₅₀ , where ROC ₅₀ is true-positive if the classification according to the nature of the unknown protein belongs to the correct family, and false-positive if it belongs to another family. -positive), which represents the area of the graph when 50 false positives are accumulated in the relationship graph between false-positive and true-positive. Conventional protein tertiary structure prediction programs and showed the performance of the prediction unit in accordance with the present invention in Figs. 5 and 6 to ₅₀ show the ROC value to the point.

도 5는 종래 단백질 삼차구조 예측 프로그램인 PSI-BLAST와 RankProp의 성능을 비교한 그래프이고, 도 6은 상기 PSI-BLAST와 본 발명에 따른 방법의 성능을 비교한 그래프이다. 도 5에 나타낸 바와 같이, PSI-BLAST와 RankProp의 ROC₅₀ 분포는 대체적으로 직선에 가까운 분포를 나타내고 있으나, 단백질 네트워크를 도입하지 않은 PSI-BLAST보다 단백질 네트워크를 도입한 RankProp 알고리즘 쪽으로 분포가 약간 더 치우쳐 있음을 볼 수 있다. 5 is a graph comparing the performance of PSI-BLAST and RankProp of the conventional protein tertiary structure prediction program, Figure 6 is a graph comparing the performance of the PSI-BLAST and the method according to the present invention. As shown in FIG . 5 , ROC ₅₀ of PSI-BLAST and RankProp Although the distribution is generally close to a straight line, the distribution is slightly biased toward the RankProp algorithm in which the protein network is introduced than PSI-BLAST without the protein network.

그러나, 도 6에 나타낸 바와 같이, 본 발명에 따른 ROC₅₀ 값은PSI-BLAST의 ROC₅₀값에 관계없이 0.8 ~ 1에 다수 분포하고 있음을 알 수 있다.이로써, 본 발명에 따른 예측 방법은 주형 선택의 정확성이 높음을 확인하였다.But,6As shown, the ROC according to the present invention₅₀ The value isROC of PSI-BLAST₅₀Regardless of the value, it can be seen that a large number is distributed in 0.8 to 1.As a result, the prediction method according to the present invention confirmed that the accuracy of the mold selection is high.

<< 실시예Example 2> 2>

실제 구조가 알려져 있는 1bf2_2 도메인(데이터베이스: SCOP b.71.1.1)에 대해서 상기 실시예 1에서 제조된 미지 단백질의 삼차 구조 예측 장치를 통하여 도 4의 미지 단백질의 삼차 구조 예측 방법으로 주형을 선택하고 모델링을 수행하여 상 기 1bf2_2 도메인의 구조를 예측하였다.For the 1bf2_2 domain (database: SCOP b.71.1.1) of which the actual structure is known, a template was selected by the method for predicting the tertiary structure of the unknown protein of FIG. 4 through the tertiary structure predicting apparatus of the unknown protein prepared in Example 1 above . Modeling was performed to predict the structure of the 1bf2_2 domain.

서열정렬방법은 프로파일-프로파일 얼라인먼트를 사용하였고 모델링툴은 모델러(MODELLER)를 사용해서 모델을 생성했다. 생성된 모델에 대한 검증(evaluation)을 위해서는 프로체크(procheck)를 사용하였다.The sequence alignment method used profile-profile alignment and the modeling tool generated the model using a modeler. Procheck was used to evaluate the generated model.

<< 비교예Comparative example 1> 1>

상기 1bf2_2 도메인을 종래의 방법인 PSI-BLAST 프로그램을 사용하여 삼차 구조 예측을 하였다.The 1bf2_2 domain was subjected to tertiary structure prediction using the conventional method, PSI-BLAST program.

<분석><Analysis>

(1) 주형 선택의 비교(1) comparison of mold selection

주형 선택에 있어서, 본 발명에 따른 방법을 사용한 경우에는 동일한 데이터베이스 패밀리(b.71.1.1) 내의 단백질인 1m53a1이 가장 좋은 주형으로 선택되었으나, 종래 방법인 PSI-BLAST를 사용한 경우에는 상기 1bf2_2의 데이터베이스 패밀리(b.71.1.1)와는 다른 (b.1.1.4)에 속한 단백질인 1epfa1이 가장 좋은 주형으로 선택되었다. 구조적 유사성에 따른 점수로 환산된 주형 목록 중에서 10위 내의 주형 목록을 비교했을 때, 본 발명에 따른 방법은 상기 1bf2_2의 데이터베이스 패밀리(b.71.1.1)와 동일한 데이터베이스 패밀리 내의 단백질 주형이 8개가 선택된 반면, 종래 방법은 동일한 데이터베이스 패밀리 내의 단백질 주형이 4개가 선택되었고 나머지 6개는 다른 폴드(fold)로부터 선택되었다. 이로써, 본 발명에 따른 예 측 방법은 주형 선택의 정확성이 높음을 확인하였다.In the template selection, 1m53a1, which is a protein in the same database family (b.71.1.1), was selected as the best template when using the method according to the present invention.However, in the case of using the conventional method, PSI-BLAST, the database of 1bf2_2 was used. The best template was 1epfa1, a protein belonging to (b.1.1.4) different from the family (b.71.1.1). When comparing the list of templates in the top ten among the list of templates converted into scores according to structural similarity, the method according to the present invention selects eight protein templates in the same database family as the database family of 1bf2_2 (b.71.1.1). In contrast, the conventional method selected four protein templates within the same database family and the other six from different folds. Thus, the prediction method according to the present invention confirmed that the accuracy of the mold selection.

(2) 예측 모델과 실제 모델의 비교(2) Comparison between prediction model and real model

본 발명에 따른 방법(실시예 1) 또는 종래 방법인 PSI-BLAST에 의해서 가장 좋은 주형으로 선택된 1m53a1 또는 1epfa1를 기반으로 구조를 모델링하고, 최종 모델이 실제 1bf2_2의 구조와 얼마나 유사한지 비교하기 위해서 구조정렬 프로그램인 CE를 사용하여 실제 구조와 모델을 비교하였다. CE는 구조적 유사성을 Z-스코어로 표시해주는데, 3.7 이상의 값을 가지는 경우에는 생물학적으로 의미있는 유사도가 있다고 할 수 있다. 상기 결과를 표 1에 나타내었다.Model the structure based on 1m53a1 or 1epfa1 selected as the best template by the method according to the present invention (Example 1) or the conventional method PSI-BLAST, and compare how the final model is similar to the structure of the actual 1bf2_2 The alignment program CE was used to compare the model with the actual structure. CE expresses structural similarity in Z-scores, which can be said to have biologically meaningful similarities if they have a value above 3.7. The results are shown in Table 1.

주형template Z-스코어 값Z-score value 실시예 1Example 1 1m53a11m53a1 3.93.9 PSI-BLASTPSI-BLAST 1epfa11epfa1 2.02.0

표 1에 나타낸 바와 같이, 본 발명에 따른 예측 방법에 의해서 예측된 모델은 3.9의 Z-스코어 값을 나타내므로 실제 구조와 비교할 때 의미있는 유사도가 있으나, 종래의 방법인 PSI-BLAST에 의해서 생성된 모델은 2.0의 Z-스코어 값을 나타내기 때문에 실제 구조와 다름을 알 수 있다.As shown in Table 1, the model predicted by the prediction method according to the present invention shows a Z-score value of 3.9, but there is a significant similarity in comparison with the actual structure, but it is generated by the conventional method PSI-BLAST. The model shows a Z-score of 2.0, which is different from the actual structure.

상기 예측된 모델들은 실제 1bf2_2의 구조와 겹침(superimposition)한 모습을 도 8 및 도 9에 나타내었다. The predicted models are shown in FIG. 8 and FIG. 9 superimposition with the actual structure of 1bf2_2.

도 7은 실제 1bf2_2의 구조(실선)와 PSI-BLAST에 의해서 가장 좋은 주형으로 선택된 1epfa1를 기반으로 모델링 된 구조(굵은선)를 겹친 그림이다. 도 7에 나타낸 바와 같이, 두 구조는 서로 다른 모습을 하고 있는 것을 확인하였다. FIG. 7 illustrates the structure (solid line) of 1bf2_2 actually overlapped with the structure (thick line) modeled based on 1epfa1 selected as the best template by PSI-BLAST. As shown in FIG . 7 , it was confirmed that the two structures had different shapes.

도 8은 실제 1bf2_2의 구조(실선)와 본 발명에 따른 방법에 의해서 가장 좋은 주형으로 선택된 1m53a1를 기반으로 모델링 된 구조(굵은선)를 겹친 그림이다. 도 8에 나타낸 바와 같이, 실제 1bf2_2의 2차구조(α-나선구조, β-병풍구조 등)와 모델링된 구조의 2차구조가 유사한 구조를 보임을 확인하였다. FIG. 8 is a diagram of an overlapping structure (solid line) modeled on the basis of the actual structure of 1bf2_2 (solid line) and 1m53a1 selected as the best template by the method according to the present invention. As shown in FIG . 8 , it was confirmed that the secondary structure of the actual structure 1bf2_2 (α-helix structure, β-screen structure, etc.) and the secondary structure of the modeled structure showed similar structures.

따라서, 본 발명에 따른 예측 방법은 기존의 단백질 예측 방법들보다 정확하고 정밀하게 단백질의 삼차 구조를 예측할 수 있다.Therefore, the prediction method according to the present invention can predict the tertiary structure of a protein more accurately and precisely than conventional protein prediction methods.

이상에서 살펴본 바와 같이, 본 발명은 기존의 단백질 예측 방법들보다 정확하고 정밀하게 단백질의 삼차 구조를 예측해줌으로써, 실험적으로 단백질 구조를 밝히는데 소비되는 비용과 시간을 절감하는 효과가 있다. 따라서 각종 질병을 치료를 위한 연구 분야에 유용하게 사용될 수 있다.As described above, the present invention predicts the tertiary structure of a protein more accurately and precisely than conventional protein prediction methods, thereby reducing the cost and time spent experimentally revealing the protein structure. Therefore, it can be usefully used in the research field for treating various diseases.

Claims

Input means;

Sequence preprocessing means for preprocessing the unknown protein sequence inputted from the input means;

A template protein database in which information of the template protein is stored;

Pair comparison means for comparing the unknown protein with the template protein based on the template protein database;

A protein similarity network database composed and stored of structural similarities between the template proteins;

Global comparison means for obtaining a global comparison score by taking a weighted sum of pair comparison scores of neighboring proteins in the protein similarity network database according to structural similarities;

Template selection means for selecting one or more template proteins having the largest global comparison score based on results of the pair comparison means and the global comparison means;

Modeling means for predicting the conformation of the unknown protein using sequencing software and protein structure modeling software, with the selected protein as a template based on the result of the template selecting means;

Verification means for examining whether the predicted protein structure is a structure that can be stably present using protein structure analysis software; And

An apparatus for predicting tertiary structure of an unknown protein comprising an output means for outputting the predicted protein structure.

The apparatus for predicting tertiary structure of an unknown protein according to claim 1, wherein the preprocessing means extracts at least one information selected from the group consisting of secondary structure, solvent exposure degree and profile from the inputted unknown protein sequence. .

The tertiary structure prediction of an unknown protein according to claim 1, wherein the template protein database stores at least one information selected from the group consisting of a sequence of a template protein, an actual protein structure, a structure predicted from the sequence, and a profile. Device.

The apparatus of claim 1, wherein the protein similarity network database comprises each template protein as a node and forms a link based on at least one of a structure and a function between the template proteins.

The tertiary structure of an unknown protein according to claim 1, wherein in the global comparison means, weights between neighboring proteins in the protein similarity network database use Z-scores (Z _{P, D} ) indicating structural similarity. Prediction device.

(a) extracting and preprocessing information from the sequence of the unknown protein input through the input means through sequence preprocessing means (step 1);

(b) pairwise comparing each template protein and the unknown protein stored in the template protein database via pair comparison means to calculate similarity between each protein (step 2);

(c) calculating a global comparison score of the unknown protein and the template protein on the protein similarity network database through global comparison means (step 3);

(d) selecting one or more template protein candidates with the highest global comparison scores based on the results of steps 2 and 3 via template selection means (step 4);

(e) modeling the structure of the unknown protein using a sequencing software and protein structure modeling software as a template, using modeling means (step 5); And

(f) verifying the predicted protein structure through verification means and outputting the predicted protein structure through output means (step 6).

The method of claim 6, wherein the pretreatment of step 1 comprises extracting at least one information selected from the group consisting of secondary structure, solvent exposure degree, and profile from an unknown protein sequence inputted through an input means. Method for predicting the structure of an unknown protein.

The method of claim 6, wherein the similarity calculation between the template protein and the unknown protein of step 2 is performed based on at least one of a sequence and a profile of the unknown protein.

The method of claim 6, wherein the calculation of the global comparison score of the unknown protein and the template protein of step 3 is weighted to the pair comparison result of the template protein and the unknown protein according to the structural similarity between neighboring proteins on the protein similarity network database. A method of predicting the structure of an unknown protein.

10. The method of claim 9, wherein the weights between the template proteins on the protein similarity network database use Z-scores (Z _P _{, D} ) _, indicating structure similarity.

A computer-readable recording medium for executing the method of any one of claims 6 to 10 on a computer.

Records a data structure in which a template protein similarity network database is stored, consisting of a node consisting of proteins in the template protein database and a link numerically representing the similarity based on at least one of the structure or function between the template proteins corresponding to the node. Recording media.