KR20040036691A

KR20040036691A - High performance sequence searching system and method for dna and protein in distributed computing environment

Info

Publication number: KR20040036691A
Application number: KR1020040010964A
Authority: KR
Inventors: 이관수; 김병진; 선충현
Original assignee: 학교법인 한국정보통신학원
Priority date: 2003-12-23
Filing date: 2004-02-19
Publication date: 2004-04-30
Also published as: KR100538451B1

Abstract

PURPOSE: A system and a method for searching similar sequences of gene and protein on a distributed computing environment are provided to apply a dynamic algorithm method to sequence search by applying a distributed computing method, and parallelize the sequence search and a statistical analysis of a result in order to fit to a distributed computing grid environment. CONSTITUTION: A terminal controller(100) presents a query sequence, which is a standard for similarity search. A terminal(200) performs the similar sequences search for the query sequence by connecting to the terminal controller. A database stores a plurality of gene and protein sequence files, and is divided into a plurality of database fragments(300) by a sequence search parallelizing method based on a dynamic algorithm. The database fragments are allotted to each terminal.

Description

HIGH PERFORMANCE SEQUENCE SEARCHING SYSTEM AND METHOD FOR DNA AND PROTEIN IN DISTRIBUTED COMPUTING ENVIRONMENT}

본 발명은 유전자와 단백질의 유사서열을 검색하는 시스템에 관한 것이다.The present invention relates to a system for searching for similar sequences of genes and proteins.

유사 서열을 검색하기 위해서는 서열 쌍의 유사도를 계산해야 하며, 먼저 두 서열을 정렬(alignment)할 필요가 있다. 종래의 서열 정렬 기법으로는 다이내믹 프로그래밍 알고리즘(Dynamic programming algorithm, 이하 다이내믹 알고리즘이라고 함)과 휴리스틱 알고리즘(Heuristic algorithm)이 있다.In order to search for similar sequences, the similarity of sequence pairs must be calculated, and the first two sequences need to be aligned. Conventional sequence alignment techniques include a dynamic programming algorithm (hereinafter referred to as a dynamic algorithm) and a heuristic algorithm.

다이내믹 알고리즘의 예로서 1970년 Needlman과 Wuncsh가 발표한 전역정렬기법(Global alignment)을 비롯하여, 1981년 Smith 와 Waterman의 지역정렬기법 (Local alignment) 및 이를 수정한 많은 유사 기법들이 있다.Examples of dynamic algorithms include Global alignment, published by Needlman and Wuncsh in 1970, and the local alignment of Smith and Waterman in 1981, and many similar techniques.

다이내믹 알고리즘은 서열 정렬의 최적화 기법이며, 정렬될 수 있는 모든 경우를 다 조사하여 최적의 정렬을 찾는 알고리즘이다. Smith&Waterman의 다이내믹 알고리즘 이후, Gotho는 이를 수정하여 보다 처리 속도가 향상된 연산식을 제안하였다. 뿐만 아니라, 다이내믹 알고리즘을 구현할 때 메모리의 소비를 최적화하기 위해 1994년에는 Chao가 리니어 스페이스(Linear space) 알고리즘을 적용하기도 하였다.The dynamic algorithm is an optimization technique of sequence alignment and an algorithm that finds an optimal alignment by examining all cases that can be aligned. After Smith & Waterman's dynamic algorithm, Gotho modified it and proposed a more efficient calculation. In addition, in 1994, Chao applied a linear space algorithm to optimize memory consumption when implementing dynamic algorithms.

다이내믹 알고리즘은 서열 정렬에 있어서 최적의 해를 도출하지만, 연산시간에 따른 서열 쌍의 길이에 대하여 제한을 받는다. 이는 정렬 및 유사도의 계산이 서열 길이의 제곱에 비례하기 때문이다. 만약 유사서열 검색을 위해 다이내믹 알고리즘을 이용한다면, 데이터베이스의 모든 서열과 유사도를 계산하는데 막대한 시간이 소요될 것이다. 그렇기 때문에 서열 검색을 위해서는 다이내믹 알고리즘이 아닌 휴리스틱 알고리즘이 이용된다.Dynamic algorithms yield optimal solutions for sequence alignment, but are limited in terms of length of sequence pairs over computation time. This is because the calculation of alignment and similarity is proportional to the square of the sequence length. If you use a dynamic algorithm to search for similar sequences, it will take a lot of time to calculate the similarity with all the sequences in the database. Thus, heuristic algorithms are used for sequence retrieval, not dynamic algorithms.

휴리스틱 알고리즘으로는 1983년 Wilbur, Lipman 등이 개발한 FASTA와 1990년 이후 Altshul 외 다수가 개발한 BLAST가 대표적이다.Heuristic algorithms include FASTA, developed by Wilbur and Lipman in 1983, and BLAST, developed by Altshul and others since 1990.

FASTA와 BLAST는 서열 정렬이 아닌 서열 검색의 목적으로 개발되었으며, 유사서열 검색의 가장 현실적인 방법으로 널리 사용되고 있다. 휴리스틱 알고리즘은 서열의 유사한 일부분을 찾아 이를 기점으로 서열 정렬을 완성해 나간다. 이러한 과정을 반복하고 통계적 기법을 적용하여 최적의 정렬이 될 가능성이 높은 정렬 쌍을 최적의 정렬 쌍으로 제시한다.FASTA and BLAST were developed for the purpose of sequence retrieval, not sequence alignment, and are widely used as the most realistic method of retrieval of similar sequences. The heuristic algorithm finds a similar portion of the sequence and completes the sequence alignment from there. By repeating this process and applying a statistical technique, we propose an optimal alignment pair as an alignment pair that is likely to be optimal.

그러나 BLAST 또는 FASTA와 같은 휴리스틱 알고리즘은 다이내믹 알고리즘 보다 정확도가 떨어진다. 따라서, 연구 결과를 보다 정확하게 이끌어 내기 위해서는 서열 검색에 다이내믹 알고리즘을 적용할 필요가 있다. 그런데 다이내믹 알고리즘은 속도가 매우 느려서, 현재와 같은 대용량 서열 분석 작업에 사용하기 어렵다. 뿐만 아니라 이러한 요구를 만족시키기 위해서는 수퍼 컴퓨터가 필요하다.However, heuristic algorithms such as BLAST or FASTA are less accurate than dynamic algorithms. Thus, dynamic algorithms need to be applied to sequence retrieval in order to derive the research results more accurately. Dynamic algorithms, however, are very slow, making them difficult to use for large-scale sequence analysis. In addition, a supercomputer is required to meet these needs.

본 발명이 이루고자 하는 기술적 과제는 분산 컴퓨팅 기법을 적용하여 다이내믹 알고리즘 기법을 서열 검색에 적용하는 시스템 및 방법을 제공하는 것이다.An object of the present invention is to provide a system and method for applying a dynamic algorithm technique to sequence retrieval by applying a distributed computing technique.

또한, 본 발명이 이루고자 하는 기술적 과제는 다이내믹 알고리즘을 기반으로 하는 서열 검색과 그 결과에 대한 통계적 분석을 분산 컴퓨팅 그리드 환경에 적합하도록 병렬화하는 시스템 및 방법을 제공하는 것이다.It is also an object of the present invention to provide a system and method for parallelizing sequence searching based on dynamic algorithms and statistical analysis of the results to be suitable for a distributed computing grid environment.

도 1은 본 발명의 실시예에 따른 유사서열 검색 시스템의 구성을 나타낸 도이다.1 is a diagram illustrating a configuration of a similar sequence search system according to an exemplary embodiment of the present invention.

도 2는 본 발명의 실시예에 따른 분산된 단말장치에서 수행되는 작업의 순서도이다.2 is a flow chart of tasks performed in a distributed terminal device according to an embodiment of the present invention.

도 3은 본 발명의 실시예에 따른 유사 서열 검색 시스템에 따른 GUI 화면을 나타낸 도이다.3 is a diagram illustrating a GUI screen according to a similar sequence search system according to an embodiment of the present invention.

도 4는 본 발명의 실시예에 따른 유사 서열 검색 및 통계 분석 결과를 나타낸 도이다.Figure 4 is a diagram showing the results of similar sequence search and statistical analysis according to an embodiment of the present invention.

이러한 과제를 해결하기 위한 본 발명의 특징에 따른 유전자 및 단백질 서열 검색 시스템은 질의 서열을 제시하는 단말 제어 장치; 상기 단말 제어 장치에 접속되어 있으며, 다이내믹 프로그래밍 알고리즘을 이용하여 상기 질의 서열에 대한 유사 서열 검색을 수행하는 복수 개의 단말 장치; 및 다수의 유전자 및 단백질 서열파일이 저장되어 있으며 상기 단말 장치에 의해 검색되도록 복수 개로 분할된 데이터베이스를 포함한다.Gene and protein sequence retrieval system according to a feature of the present invention for solving this problem is a terminal control device for presenting a query sequence; A plurality of terminal devices connected to the terminal control device, and performing a similar sequence search for the query sequence using a dynamic programming algorithm; And a plurality of gene and protein sequence files are stored and divided into a plurality of databases to be searched by the terminal device.

상기 단말장치는 상기 단말 제어 장치에 클러스터 또는 그리드 기반의 병렬 알고리즘에 의하여 접속되며,The terminal device is connected to the terminal control device by a cluster or grid based parallel algorithm,

상기 데이터베이스는 상기 단말 장치의 개수보다 많거나 같은 수의 데이터베이스 조각으로 분할된다.The database is divided into a number of database fragments equal to or greater than the number of terminal devices.

또한, 상기 복수의 단말 장치는 각각 범위가 다른 데이터베이스 조각을 검색하며,In addition, the plurality of terminal devices each search for a database fragment having a different range,

상기 복수의 단말 장치가 각각 동시에 유사 서열 검색을 시행한다.The plurality of terminal apparatuses perform similar sequence searches, respectively.

본 발명의 특징에 따른 유전자 및 단백질 서열 검색 방법은 복수 개의 단말 장치와 복수 개의 데이터베이스 조각으로 분할된 데이터베이스를 포함하는 분산 컴퓨팅 환경에서 유전자 및 단백질 서열을 검색하는 방법으로서,Gene and protein sequence search method according to a feature of the present invention is a method for searching for a gene and protein sequence in a distributed computing environment comprising a database divided into a plurality of terminal devices and a plurality of database pieces,

a) 다이내믹 프로그램 알고리즘을 통하여 질의 서열과 선택된 상기 데이터베이스 조각의 모든 서열을 검색하고 유사도를 계산하는 단계; b) 상기 계산된 유사도를 통하여 통계 분석 작업을 수행하는 단계; 및 c) 상기 통계 분석 작업 결과를 유사도가 높은 순서로 정렬하고, 상기 검색된 유사 서열의 리스트와 함께 정해진 디렉토리에 저장하는 단계를 포함한다.a) searching through a dynamic program algorithm and all sequences of the selected database fragments and calculating similarity; b) performing statistical analysis based on the calculated similarity; And c) sorting the results of the statistical analysis operation in the order of high similarity, and storing the searched similar sequences together in a predetermined directory.

상기 b) 단계는,B),

i) 상기 질의 서열과 상기 데이터베이스에 저장된 서열들 간의 상동성 점수에 대한 평균과 표준편차를 구하는 단계; ii) 굼벨 분포에 대한 파라미터를 구하는단계; iii) 상기 상동성 점수를 표준화하는 z 점수를 구하는 단계; iv) 상기 파라미터를 이용하여 상기 z 점수보다 크거나 같은 점수를 가지는 서열이 상기 전체 데이터베이스에서 검색될 확률인 p 값을 구하는 단계; 및 v) 상기 p 값을 이용하여 상기 z 점수와 같은 점수를 가지는 서열이 상기 전체 데이터베이스에서 검색될 확률인 e 값을 구하는 단계를 포함한다.i) obtaining a mean and standard deviation for homology scores between the query sequence and the sequences stored in the database; ii) obtaining a parameter for the gumbell distribution; iii) obtaining a z score that normalizes the homology score; iv) using the parameter to obtain a p value that is the probability that a sequence having a score greater than or equal to the z score is retrieved from the entire database; And v) using the p value to obtain an e value that is a probability that a sequence having a score equal to the z score is retrieved from the entire database.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였다. 명세서 전체를 통하여 유사한 부분에 대해서는 동일한 도면 부호를 붙였다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention. Like parts are designated by like reference numerals throughout the specification.

본 발명에 따른 다이내믹 알고리즘이 적용되는 분산 컴퓨팅 환경은 여러 대의 PC를 물리적 네트워크로 묶어 병렬화하는 PC 클러스터 기법이나 인터넷에 분산된 개인용 컴퓨터(또는 수퍼 컴퓨터 또는 클러스터)를 병렬화하는 그리드(Grid) 기술 등에 의해 구축될 수 있다. 이러한 병렬화 기법은 이미 공지된 기술이므로 상세한 설명을 생략한다.The distributed computing environment to which the dynamic algorithm according to the present invention is applied is based on a PC clustering technique in which multiple PCs are connected in parallel to a physical network, or a grid technology for parallelizing personal computers (or supercomputers or clusters) distributed on the Internet. Can be built. Since this parallelization technique is a known technique, a detailed description thereof will be omitted.

서열 정렬을 위한 다이내믹 알고리즘은 한 가지 서열에 대한 정렬에 대해서는 순차적으로 계산하는 방식이기 때문에 병렬화가 어렵다. 또한 처리 과정 동안에 컴퓨터들 간에 정보를 주고받아야 하기 때문에 이에 따른 시간적 손실이 매우 크다. 그러므로 본 발명에서는 서열 검색을 위해 컴퓨터들마다 검색 범위를 서로 다르게 할당하는 방식으로 병렬화 한다.The dynamic algorithm for sequence alignment is difficult to parallelize because it is a method of sequentially calculating the alignment of one sequence. In addition, because of the need to exchange information between computers during the process, the time loss is very large. Therefore, in the present invention, the computers are parallelized by assigning different search ranges to each other for sequence retrieval.

도 1은 본 발명의 실시예에 따른 다이내믹 알고리즘 기반의 유사서열 검색 방법이 적용되는 분산 컴퓨팅 환경을 나타낸 것이다.1 illustrates a distributed computing environment to which a dynamic algorithm based pseudosequence search method is applied according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명의 실시예에 따른 분산 컴퓨팅 환경은 단말 제어 장치(100), 단말 장치(200) 및 데이터 베이스 조각(300)를 포함한다.As shown in FIG. 1, a distributed computing environment according to an embodiment of the present invention includes a terminal control apparatus 100, a terminal apparatus 200, and a database fragment 300.

단말 제어 장치(100)는 유사도 검색의 기준이 되는 질의 서열을 제시하며, 단말 장치(200)는 단말 제어 장치에 접속되어 질의 서열에 대한 유사 서열 검색을 수행한다. 데이터베이스에는 다수의 유전자 및 단백질 서열 파일이 저장되어 있으며, 다이내믹 알고리즘을 기반으로 하는 서열 검색의 병렬화 기법으로 다수의 데이터베이스 조각(300)으로 분할되며, 각각의 데이터베이스 조각(300)은 각각의 단말 장치(200)에 할당한다.The terminal control apparatus 100 presents a query sequence that is a criterion for similarity search, and the terminal apparatus 200 is connected to the terminal control apparatus and performs a similar sequence search for the query sequence. A database stores a plurality of gene and protein sequence files, and is divided into a plurality of database fragments 300 by a parallel algorithm of sequence retrieval based on a dynamic algorithm, and each database fragment 300 is a terminal device ( 200).

즉, 도 1에 도시된 바와 같이 본 발명의 실시예에 따른 시스템의 데이터베이스는 검색에 참여하는 단말 장치(200)의 수보다 크거나 같은 수의 데이터베이스 조각(300)으로 분할된다.That is, as shown in FIG. 1, the database of the system according to the embodiment of the present invention is divided into a number of database fragments 300 equal to or greater than the number of terminal devices 200 participating in the search.

단말 제어 장치(100)로부터 질의서열이 주어지면, 각각의 분산된 단말 장치(200)에는 질의서열과 검색해야할 데이터베이스의 검색 범위가 할당한다. 분산된 단말 장치(200)는 각각 주어진 검색 범위에 속하는 데이터베이스 조각(300)을 읽어서 서열 검색을 행한다. 복수의 분산된 단말 장치(200)는 서열 검색을 동시에 처리하되 서로 간에는 주고받는 메시지가 없도록 한다. 또한 통계적 처리도 같이 이루어지도록 한다.Given a query sequence from the terminal control device 100, each distributed terminal device 200 is assigned a query sequence and a search range of a database to search. The distributed terminal device 200 reads the database fragment 300 belonging to each given search range and performs the sequence search. The plurality of distributed terminal devices 200 processes the sequence search at the same time, so that there is no message exchanged with each other. Also, statistical processing should be done together.

이때 분할된 데이터베이스 파일은 물리적 네트워크를 통해 읽혀진다. 또한 한번 참조된 데이터베이스의 조각들은 이를 참조한 컴퓨터뿐만 아니라 다른 컴퓨터에 의해서도 중복 참조되지 않도록 한다.The partitioned database file is then read over the physical network. It also ensures that fragments of a referenced database are not duplicated by other computers as well as the computer that referenced them.

또한 본 발명의 실시예에 따른 유사 서열 검색은 하나의 프로그램으로서 분산된 단말 장치(200)에서 각각 동시에 실행되지만, 이들은 서로 독립적이며 시간에 비종속적이다.In addition, similar sequence search according to an embodiment of the present invention is executed simultaneously in each of the distributed terminal device 200 as one program, but they are independent of each other and are time independent.

각각의 컴퓨터에서는 주어진 질의 서열과 선택된 데이터베이스의 서열들을 하나씩 쌍으로 정렬하고 그 유사도를 계산한다. 본 발명의 실시예에서는 정렬 및 유사도 계산 방법으로 Smith&Waterman 알고리즘 보다 연산속도가 빠른 Gotho의 알고리즘과 Linear space 알고리즘을 병합하여 가능한 한 빠르고 최소의 메모리로 작업할 수 있도록 한다. Gotho의 알고리즘과 Linear space 알고리즘을 병합하는 기술은 이미 공지된 기술이므로 설명을 생략한다.Each computer sorts a given query sequence and the sequences of the selected database one by one and calculates the similarity. In the embodiment of the present invention, the alignment and similarity calculation method merges Gotho's algorithm and Linear space algorithm, which is faster than Smith & Waterman algorithm, so that the user can work with the minimum memory as fast as possible. The technique of merging the Gotho's algorithm and the linear space algorithm is already known and thus the description is omitted.

다음, 도 2를 참조하여 본 발명의 실시예에 따른 유사 서열 검색 시스템의 유사 서열 검색 동작에 대하여 자세하게 설명한다.Next, with reference to Figure 2 will be described in detail the similar sequence search operation of the similar sequence search system according to an embodiment of the present invention.

도 2는 각각의 분산된 단말 장치(200)에서 수행되는 작업의 순서도이다.2 is a flowchart of a task performed in each distributed terminal device 200.

도 2에 도시된 바와 같이, 먼저 질의 서열이 주어지면 각각의 분산된 단말 장치(200)는 주어진 데이터베이스 조각(300)에 포함된 모든 서열들에 대하여 질의 서열과의 유사도를 계산한다(S200). 그리고 계산된 유사도를 통하여 통계 분석 작업을 수행한다(S201).As shown in FIG. 2, first, given a query sequence, each distributed terminal device 200 calculates similarity with the query sequence for all sequences included in a given database fragment 300 (S200). The statistical analysis is performed through the calculated similarity (S201).

유사도 계산과 통계 분석 작업이 끝나면 정해진 디렉토리에 서열 파일과 리스트 파일이 저장되어 있는지를 확인하고(S202), 저장되어 있지 않으면 유사도를 높은 순위로 정렬(sort)한 후 발견된 유사 서열과 리스트를 정해진 디렉토리에 저장한다(S203).After the similarity calculation and statistical analysis are completed, check whether the sequence file and the list file are stored in the predetermined directory (S202), and if not, the similarity sequence and the list are found after sorting the similarity in high order. Store in a directory (S203).

S202 단계에서 확인한 결과 정해진 디렉토리에 서열 파일과 리스트 파일이 이미 저장되어 있으면, 파일이 잠겨있는지를 확인한다(S204). 파일이 잠겨 있으면 다른 컴퓨터에서 해당 파일을 갱신하고 있는 것으로 판단하여 일정시간 대기한 후(S205) 다시 파일이 잠겨있는지를 확인하는 단계(S204)로 되돌아간다.As a result of checking in step S202, if the sequence file and the list file are already stored in the determined directory, it is checked whether the file is locked (S204). If the file is locked, it is determined that the file is being updated by another computer, and after waiting for a predetermined time (S205), the process returns to the step of checking whether the file is locked again (S204).

S204 단계에서 해당 파일이 잠겨있지 않으면 파일을 갱신하는 동안에 다른 컴퓨터에서 해당 파일을 열지 못하도록 먼저 파일을 잠금 설정한 후(S206), 파일을 열어서 새로 생성한 리스트와 병합한 후 재정렬 함으로써 서열 파일과 리스트 파일을 갱신한다(S207). 파일을 갱신한 후에는 잠금 설정한 파일의 잠금을 해제한다 (S208).If the file is not locked in step S204, the file is first locked to prevent the file from being opened on another computer while updating the file (S206), then the file is opened, merged with the newly created list, and then reordered Update (S207). After updating the file, the lock of the locked file is released (S208).

이와 같이, 분산된 단말 장치(200)들을 단말 제어 장치(100)로부터 동일한 질의서열을 받아 동시에 같은 작업을 진행하되, 참조하는 데이터베이스만 다르다. 각각의 컴퓨터에서 진행되는 작업은 서로가 독립적이며 시간에 대해서도 독립적이다. 즉, 작업에 참여하는 컴퓨터 수가 늘어나면 이에 비례하여 작업의 속도도 빨라진다.In this manner, the distributed terminal apparatuses 200 receive the same query sequence from the terminal control apparatus 100 and simultaneously perform the same task, but differ only in the database to which they refer. The work on each computer is independent of each other and independent of time. In other words, as the number of computers participating in the work increases, the work speed increases.

다음, 본 발명의 실시예에 따른 유사 서열 검색의 통계적 분석 기법에 대하여 상세하게 설명한다.Next, a statistical analysis technique of similar sequence search according to an embodiment of the present invention will be described in detail.

주어진 질의서열과 데이터베이스 내의 모든 서열들 간의 상동성 점수는 포아슨(Poison) 분포를 따르는데, 특히 이 분포는 굼벨 분포(Gumbel positive extreme distribution)를 따른다. 상동성 점수는 다음의 수학식 1과 같다.The homology scores between a given query sequence and all sequences in the database follow a Poison distribution, which in particular follows a Gumbel positive extreme distribution. The homology score is shown in Equation 1 below.

여기서, λ와 μ는 각각 분포곡선의 크기와 위치를 결정하는 파라미터이다. 이 값들을 결정하기 위해 상동성 점수들의 평균(x_mean)과 표준편차(σ)를 구한다. 평균과 표준편차는 다음의 수학식 2에 의해 계산된다.Here, λ and μ are parameters for determining the size and position of the distribution curve, respectively. To determine these values, find the mean (x _mean ) and standard deviation (σ) of homology scores. The mean and standard deviation are calculated by the following equation.

한편, 분산 컴퓨팅 환경에서 각각의 노드들은 자신이 담당한 데이터베이스 조각(300)에서 얻은 상동성 점수만 존재할 뿐 평균은 모든 컴퓨터의 작업이 끝나기 전에는 알 수 없다. 그렇다고 평균을 구하기 위해 계산된 모든 상동성 점수를 저장하는 것은 메모리 낭비이므로, 메모리 절약을 위하여 표준편차를 구하는 식을 다음의 수학식 3과 같이 변형한다.On the other hand, in a distributed computing environment, each node only has a homology score obtained from its own database fragment 300, and the average cannot be known until all the computers are finished. However, storing all the homology scores calculated to obtain the average is a waste of memory, so in order to save memory, the equation for calculating the standard deviation is modified as in Equation 3 below.

컴퓨터들은 서열 정렬이 끝날 때마다 상동성 점수의 누적값과 제곱의 누적값을 계산하며, 자신이 담당한 데이터베이스의 모든 서열 검색을 완료하면 평균과 표준편차를 구해서 지정된 디렉토리에 그 값들을 저장한다. 이때, 만일 해당 디렉토리에 저장된 파일이 이미 존재하면 그 파일을 열어서 자신이 계산한 값들을 누적하여 새로운 평균과 표준편차를 구하고 그 값들을 갱신한다. 이렇게 하여 컴퓨터들의 계산이 완료될 때마다 평균과 표준편차는 계속 갱신된다.Each time the sequence is sorted, the computer calculates the cumulative value of the homology scores and the cumulative squares. After completing all sequence searches in its database, it calculates the mean and standard deviation and stores them in the specified directory. At this time, if there is already a file stored in the directory, open the file and accumulate the values calculated by itself to obtain a new average and standard deviation and update the values. In this way, the mean and standard deviation continue to be updated whenever the computers complete their calculations.

한편, 상동성 점수를 표준화하기 위해서 z점수(z-score)를 다음의 수학식 4와 같이 구한다.Meanwhile, in order to standardize homology scores, z-scores are calculated as in Equation 4 below.

질의 서열과 데이터베이스 안에서의 임의의 서열과의 점수가 주어지면, 그 점수와 같거나 그보다 큰 점수가 전체의 데이터베이스에서 검색될 확률은 p값(p-value)으로 표시한다. p값은 다음의 수학식으로 구한다.Given a score between the query sequence and any sequence in the database, the probability that a score equal to or greater than that score is retrieved from the entire database is expressed as a p-value. The p value is obtained from the following equation.

여기서, 필요한 굼벨 분포의 파라미터 λ와 μ는 다음의 수식으로 계산된다.Here, parameters λ and μ of the required gumbell distribution are calculated by the following equation.

또한, 전체의 데이터베이스에서 이 점수를 가진 서열(sequence)이 나타날 것으로 예상되는 개수는 e값(e-value)으로 표시하며, e값은 다음의 수학식으로 구한다.In addition, the number expected to appear in this database of the score (sequence) is represented by the e-value (e-value), e value is obtained by the following equation.

여기서, D는 데이터베이스에 포함된 서열의 개수이다.Where D is the number of sequences included in the database.

도 3은 본 발명의 실시예에 따른 유사 서열 검색 시스템에 따른 GUI 화면을 나타낸 것이다.3 illustrates a GUI screen according to a similar sequence search system according to an embodiment of the present invention.

도 3에 도시된 바와 같이, 사용자는 먼저 검색하고자 하는 서열을 입력하고 데이터베이스를 선택할 수 있다. 또한, 도 3에서 "Requirements"와 "Rank"는 작업에 참여하는 컴퓨터의 최소/최대 요구사양을 나타내는 것으로 이 값들은 사용자가 변경할 수 있다. "E-mail"에 이메일 주소를 입력하면 작업 결과를 해당 주소로 받을 수 있다.As shown in FIG. 3, a user may first enter a sequence to search and select a database. In addition, in FIG. 3, "Requirements" and "Rank" represent minimum / maximum requirements of a computer participating in a task, and these values can be changed by a user. Enter your e-mail address in "E-mail" to receive the results of your work at that address.

도 4는 본 발명의 실시예에 따른 유사 서열 검색 및 통계 분석 결과를 나타낸 것으로, 길이가 100자에서 5000자까지의 다양한 질의 서열을 1대의 컴퓨터와 8대의 분산 컴퓨터에서 검색한 결과를 나타낸 것이다.Figure 4 shows the results of similar sequence search and statistical analysis according to an embodiment of the present invention, showing the results of searching a single query and eight distributed computers of various query sequences ranging in length from 100 to 5000 characters.

도 4에 도시된 바와 같이, 서열의 길이에 따라 조금씩 다르지만 평균적으로 8배정도 검색 속도가 향상된 것을 알 수 있다.As shown in Figure 4, slightly different depending on the length of the sequence can be seen that the search speed is improved by about 8 times on average.

이상에서 본 발명의 바람직한 실시예에 대하여 상세하게 설명하였지만 본 발명은 이에 한정되는 것은 아니며, 그 외의 다양한 변경이나 변형이 가능하다.Although the preferred embodiment of the present invention has been described in detail above, the present invention is not limited thereto, and various other changes and modifications are possible.

이상에서 설명한 바와 같이 본 발명에 따르면, 최적의 서열 정렬과 유사도 계산이라는 장점을 가지고 있음에도 매우 긴 연산시간 때문에 단일 컴퓨터에서 거의 사용이 불가능한 다이내믹 알고리즘의 유전자 및 단백질 서열 검색을 pc의 클러스터나 그리드와 같은 분산 컴퓨팅 환경에서 구현함으로써 현실적으로 이용 가능하도록 한다. 또한, 고성능 컴퓨터를 구비하지 않더라도 다이내믹 알고리즘 기반의 유사서열 검색이 가능하다.As described above, according to the present invention, gene and protein sequence search of a dynamic algorithm, which is almost impossible to use in a single computer due to a long calculation time, has advantages of optimal sequence alignment and similarity calculation, such as a cluster or a grid of a pc. Implementation in a distributed computing environment makes it practically available. In addition, it is possible to search for similar sequences based on dynamic algorithms without having a high-performance computer.

Claims

A terminal control device for presenting a query sequence;

A plurality of terminal devices connected to the terminal control device, and performing a similar sequence search for the query sequence using a dynamic programming algorithm; And

A plurality of gene and protein sequence files are stored and divided into a plurality of databases to be retrieved by the terminal device

Gene and protein sequence search system comprising a.

The method of claim 1,

The terminal device is connected to the terminal control device by a parallel algorithm based on a cluster or grid.

Gene and Protein Sequence Search System.

The method of claim 1,

And said database is divided into a number of database fragments greater than or equal to the number of terminal devices.

The method according to claim 1 or 3,

The plurality of terminal devices are gene and protein sequence retrieval system for retrieving a database piece of a different range, respectively.

The method of claim 1,

Gene and protein sequence retrieval system wherein the plurality of terminal devices perform similar sequence retrieval at the same time.

A method of searching for gene and protein sequences in a distributed computing environment comprising a database divided into a plurality of terminal devices and a plurality of database fragments,

a) searching through a dynamic program algorithm and all sequences of the selected database fragments and calculating similarity;

b) performing statistical analysis based on the calculated similarity; And

c) sorting the results of the statistical analysis operation in the order of high similarity, and storing them in a predetermined directory together with the list of retrieved similar sequences

Gene and protein sequence search method comprising a.

The method of claim 6,

B),

i) obtaining a mean and standard deviation for homology scores between the query sequence and the sequences stored in the database;

ii) obtaining parameters for the lumpbell distribution;

iii) obtaining a z score that normalizes the homology score;

iv) using the parameter to obtain a p value that is the probability that a sequence having a score greater than or equal to the z score is retrieved from the entire database; And

v) using the p value to obtain an e value that is a probability that a sequence having a score equal to the z score is searched in the entire database

Gene and protein sequence search method comprising a.

The method of claim 7, wherein

In step i) the standard deviation is calculated by the following formula gene and protein sequence search method.

The method according to claim 7 or 8,

The method of claim 2, wherein the parameter is calculated by the following formula.

The method of claim 9,

P and the p value in step iv) is calculated by the following formula.

The method of claim 6,

In step c),

If there is a sequence and list file already stored in the predetermined directory, the file is merged with the newly created list and rearranged to update and store the file.

Gene and Protein Sequence Search Methods.

The method of claim 13,

C),

A gene and protein sequence retrieval method for keeping a file locked while updating the file.