KR100681795B1

KR100681795B1 - A protocol for genome sequence alignment on grid environment

Info

Publication number: KR100681795B1
Application number: KR1020060119714A
Authority: KR
Inventors: 이관수; 김민성; 선충현; 김진기; 김수영
Original assignee: 한국정보통신대학교 산학협력단
Priority date: 2006-11-30
Filing date: 2006-11-30
Publication date: 2007-02-12

Abstract

A method for aligning a genome sequence in a grid computing environment and a program storing medium are provided to efficiently apply present sequence alignment programs difficult to compare a large quality of genome sequences owing to restriction of computing resources under the grid computing environment. The first and second genome sequences are cut by specific overlapped section based on calculating algorithm which aligns the size of the first and second sequences and the sequences, and repeated sequences in cut fragments are indicated(S100). The first and second genome sequence fragments are distributed to each computer of the grid computing environment, and are aligned by using the sequence alignment program(S200). Generated alignment results are added up and only the statistically meaning sequence information are extracted(S300).

Description

A protocol for genome sequence alignment on grid environment

도 1은 본 발명의 바람직한 일 실시예에 따른 그리드 컴퓨팅 환경에서의 유전체 정렬을 위한 방법을 전체적으로 설명하기 위한 흐름도, 1 is a flow diagram illustrating the method for dielectric alignment in a grid computing environment in accordance with a preferred embodiment of the present invention.

도 2는 본 발명의 바람직한 일 실시예에 따른 그리드 컴퓨팅 환경에서의 유전체 정렬을 위한 방법 중 비교하려는 유전체 서열을 그리드 환경에서 정렬하기에 알맞도록 하는 전처리 과정을 구체적으로 설명하기 위한 흐름도, 2 is a flowchart illustrating a preprocessing process suitable for aligning a genomic sequence to be compared in a grid environment among methods for genome alignment in a grid computing environment according to an embodiment of the present invention;

도 3은 본 발명의 바람직한 일 실시예에 따른 그리드 컴퓨팅 환경에서의 유전체 정렬을 위한 방법 중 유전체 서열 조각들을 그리드 컴퓨팅 환경의 각 컴퓨터에 분배시킨 후 서열 정렬 프로그램을 이용하여 정렬하는 과정을 구체적으로 설명하기 위한 흐름도, 3 illustrates a process of distributing genome sequence fragments to each computer in a grid computing environment and then using a sequence alignment program in a method for genome alignment in a grid computing environment according to an exemplary embodiment of the present invention. Flow chart for

도 4는 본 발명의 바람직한 일 실시예에 따른 그리드 컴퓨팅 환경에서의 유전체 정렬을 위한 방법 중 정렬 결과에서 뽑아낸 정렬 정보들을 기반으로 추출, 제거 혹은 합친 후 통계적으로 의미가 있는 정렬 정보들만을 얻기 위한 후처리 과정을 구체적으로 설명하기 위한 흐름도, 4 is a diagram for obtaining only statistically meaningful alignment information after extraction, removal or merging based on alignment information extracted from an alignment result in a method for dielectric alignment in a grid computing environment according to an exemplary embodiment of the present invention. Flow chart for explaining the post-processing process in detail,

도 5는 본 발명의 바람직한 일 실시예에 따른 그리드 컴퓨팅 환경에서의 유전체 정렬을 위한 방법 중 유전체 서열의 조각화로 인해 조각난 정렬 정보들을 다시 합 쳐진 정렬 정보로 생성하는 과정을 구체적으로 설명하기 위한 그림, FIG. 5 is a diagram for explaining in detail a process of generating fragmented alignment information, which is fragmented due to fragmentation of a genomic sequence, in a method for genome alignment in a grid computing environment according to an embodiment of the present invention;

도 6은 본 발명의 바람직한 일 실시예에 따른 그리드 컴퓨팅 환경에서의 유전체 정렬을 위한 방법 중 한 서열의 일부분이 다른 서열의 여러 부분과 정렬되는 정렬 정보들을 다루는 방법과 서로 겹치거나 인접한 유전체 정보들을 서로 합치는 단계를 구체적으로 설명하기 위한 그림, FIG. 6 illustrates a method of handling alignment information in which a portion of one sequence is aligned with various portions of another sequence in a method for aligning a genome in a grid computing environment according to an exemplary embodiment of the present invention. Illustration to specifically illustrate the steps to merge,

도 7은 본 발명의 바람직한 일 실시예에 따른 그리드 컴퓨팅 환경에서의 유전체 정렬을 위한 방법 중 후처리 과정을 통하여 서열 정렬 결과가 어떻게 변하는지를 보여주기 위한 자료이다. 7 is data for showing how sequence alignment results are changed through a post-processing process in a method for genome alignment in a grid computing environment according to an exemplary embodiment of the present invention.

본 발명은 그리드 컴퓨팅 환경에서 유전체 서열을 비교 분석하기 위한 생물정보학 분야의 기술이다. 유전체 정렬 기술은 서로 다른 종들 간의 유전체 서열을 비교하여 비슷한 부분을 찾는 것으로 이를 기반으로 다른 종의 비슷한 기능을 가진 유전자를 추측해낼 수 있고 서열상의 차이점을 이용하여 계통발생학적 관계를 밝힐 수 있는 등 다양한 정보를 추출해낼 수 있다. The present invention is in the field of bioinformatics for comparative analysis of genomic sequences in grid computing environments. Genome alignment technology compares genome sequences between different species and finds similar parts. Based on this, genes with similar functions from other species can be inferred, and sequence differences can be used to reveal phylogenetic relationships. Information can be extracted.

이렇듯 생물학적 연구의 가장 기본적이고 중요한 기술 중의 하나인 유전체 서열 정렬기술은 다이나믹 프로그래밍을 기반으로 한 Smith-Waterman 알고리즘을 시작으로 본격적으로 연구되었고, 1990년도 BLAST 알고리즘을 시작으로 급속도로 연구가 진행되어 최근에는 BLAT, MUMmer, BLASTZ등 다양한 방법론들이 만들어져 왔다. As such, genomic sequence alignment, one of the most basic and important techniques of biological research, has been studied in earnest, starting with the Smith-Waterman algorithm based on dynamic programming. Various methodologies have been made, including BLAT, MUMmer, and BLASTZ.

이렇듯 유전체 정렬 기술들의 발전뿐만 아니라 게놈 시퀀싱(Genome Sequencing) 기술의 발전으로 간단한 생명체부터 고등 생물체까지 다양한 종의 유전체 서열들이 밝혀지게 되었다. 이는 곧 일부 서열이 아닌 각 종의 유전체 서열 전체를 다른 종의 서열 전체와 비교하는 작업을 가능케 하였고, 이는 반복서열과 신터니(Synteny) 등 각 종의 유전체에 관한 더욱더 자세한 정보를 얻게 해주었다. As well as the development of genome sequencing technology, the development of genome sequencing technology has led to the discovery of genome sequences of various species, from simple to higher organisms. This made it possible to compare the entire genome sequence of each species, not just some of them, with the whole sequence of other species, which gave us more detailed information about each genome, such as repeat sequences and Synteny.

그러나 다루어지는 유전체 서열의 양이 커짐에 따라서 서열 정렬 프로그램들은 어떻게 하면 보존되는 부위를 잘 찾을까 하는 초기의 초점에서 벗어나 어떻게 하면 기존의 컴퓨팅 리소스를 가지고 빠른 시간 내에 작업을 마칠 수 있는 지에 초점을 맞추어 다양한 경험적 방법(Heuristic Method)론들이 만들어지게 되었다. 이렇게 발전된 방법론들은 각종의 모든 유전체 서열까지도 서로 비교 가능하도록 해주었지만 정확성 측면에서는 떨어지는 결과를 가져오게 되었다. However, as the amount of genomic sequences handled increases, sequence alignment programs move away from the initial focus on how to find conserved regions, and how to get work done quickly with existing computing resources. Various heuristic methodologies have been created. These advanced methodologies allow all genome sequences to be comparable to each other, but with poor accuracy.

본 발명에서는 상기한 바와 같은 종래 방법의 한계를 해결하기 위해 그리드 컴퓨팅 환경을 구축하고, 컴퓨팅 리소스의 한계로 대량의 유전체 서열 비교 작업 시 힘들었던 기존의 서열 정렬 프로그램들을 그리드 환경에서 효율적으로 적용할 수 있는 방법을 제안하고자 한다. In the present invention, to solve the limitations of the conventional method as described above, it is possible to build a grid computing environment, and to efficiently apply existing sequence alignment programs in a grid environment, which are difficult when comparing a large amount of genomic sequences due to the limitation of computing resources. I would like to suggest a method.

상기한 바와 같은 목적을 달성하기 위하여, 본 발명의 바람직한 일 실시예에 따르면, 정렬하고자 하는 유전체 서열이 클 경우 일정한 중복구간을 두어 자르고, 그렇게 잘려진 조각들에서 반복되는 서열들을 표기하는 전처리 단계(a); 상기 단계 (a) 의 과정을 거친 유전체 서열 조각들을 그리드 컴퓨팅 환경의 각 컴퓨터들에 분배시킨 후 서열 정렬 프로그램을 이용하여 정렬시키는 단계(b); 상기 단계 (b)에서 나온 정렬 결과들을 서로 합치고 통계적으로 의미가 있는 정렬 정보들만을 뽑아내는 후처리 단계(c)를 제공한다. In order to achieve the object as described above, according to a preferred embodiment of the present invention, when the genomic sequence to be aligned is large, the pre-treatment step of cutting a certain overlapping interval, and marking the repeated sequences in the cut pieces so (a ); (B) distributing the genomic sequence fragments subjected to the process of step (a) to each computer in a grid computing environment and then using a sequence alignment program; A post-processing step (c) is provided which combines the sorting results from step (b) with each other and extracts only statistically meaningful sorting information.

이하에서, 첨부된 도면을 참조하여 본 발명에 의한 그리드 컴퓨팅 환경에서의 유전자 정렬 방법의 바람직한 실시예를 상세히 설명한다. 본 실시예는 본 발명의 권리범위를 한정하는 것은 아니고, 단지 예시로 제시된 것이다. Hereinafter, with reference to the accompanying drawings will be described in detail a preferred embodiment of the gene alignment method in a grid computing environment according to the present invention. This embodiment is not intended to limit the scope of the invention, but is presented by way of example only.

도 1은 본 발명의 바람직한 일 실시예에 따른 그리드 컴퓨팅 환경에서의 유전체 정렬 방법을 전체적으로 설명하기 위한 흐름도이다. 1 is a flow chart illustrating the overall dielectric alignment method in a grid computing environment according to an embodiment of the present invention.

도 1에서 도시한 바와 같이, 먼저 단계 S100에서는 그리드 환경에서 유전체 서열 정렬을 하기 위한 전처리 작업을 하게 된다. 이러한 전처리 작업은 작업의 적절한 분배를 준비하는 것뿐만 아니라 유전체 서열 정렬 과정에서 비교하는 유전체 서열 간의 유전자 보존 정도를 더욱 명확히 보기 위해 꼭 필요한 단계이다. As shown in FIG. 1, first, in step S100, a pretreatment operation for aligning genomic sequences in a grid environment is performed. This pretreatment is a necessary step to not only prepare the proper distribution of the work but also to more clearly see the degree of gene preservation between the genomic sequences being compared in the genome sequence alignment process.

도 2는 본 발명의 바람직한 실시예에 따른 그리드 환경에서의 유전체 정렬을 위한 방법 중 앞서 설명된 유전체 서열을 그리드 환경에서 정렬하기에 알맞도록 하는 전처리 과정을 구체적으로 설명하기 위한 흐름도이다. FIG. 2 is a flowchart for explaining in detail a pretreatment process suitable for aligning the above-described genome sequence in a grid environment in a method for aligning a dielectric in a grid environment according to a preferred embodiment of the present invention.

도 2에 도시된 바와 같이, 단계 S110에서는 정렬하고자 하는 두 유전체 서열들을 중복구간을 두어 자른다. 이때 자르기를 실시할 것인가에 대한 여부와 잘려진 조각의 크기는 정렬하고자하는 유전체 서열의 크기와 사용하려는 정렬 알고리즘의 계산량에 따라서 다르다. 다음으로, 단계 S120에서는 그렇게 일정한 크기로 조각난 유 전체 서열에서 반복되는 유전체 서열들을 RepeatMasker 등 전문 프로그램을 이용하여 표기한다. 이러한 표기 과정은 인간이나 쥐와 같은 고등 생물체의 유전체 서열 내에 포함된 수많은 반복서열들이 정작 보고자 하는 서열들 간의 유전자 보존 정보를 추출하는 데 있어 방해요소로 작용하는 것을 막기 위하여 추후 정렬 과정에서 이러한 반복 서열들을 제외하기 위한 것이다. 이어 단계 S130에서는 그리드 환경에서 각 컴퓨터에 작업이 적절히 분배되도록 조각난 서열들을 여러 파일들로 만들게 되는데, 상기 두 서열 중 한 서열은 여러 파일들로 만들고 다른 서열은 한 파일로 만든다. As shown in FIG. 2, in step S110, two genome sequences to be aligned are cut with overlapping sections. At this time, whether to cut or not and the size of the cut pieces depends on the size of the genome sequence to be aligned and the calculation amount of the alignment algorithm to be used. Next, in step S120, genomic sequences that are repeated in the dielectric sequence fragmented to a certain size are described using a specialized program such as RepeatMasker. This labeling process is used in the subsequent sorting process to prevent the numerous repetitive sequences included in the genome sequences of higher organisms such as humans and mice from interfering in extracting gene conservation information between sequences. To exclude them. Subsequently, in step S130, fragmented sequences are made into several files so that the work is properly distributed to each computer in a grid environment. One of the two sequences is made of several files, and the other sequence is made of one file.

도 3은 본 발명의 바람직한 일 실시예에 따른 그리드 컴퓨팅 환경에서의 유전체 정렬을 위한 방법 중 유전체 서열 조각들을 그리드 컴퓨팅 환경의 각 컴퓨터에 분배시킨 후 서열 정렬 프로그램을 이용하여 정렬시키는 과정을 구체적으로 설명하기 위한 흐름도이다. FIG. 3 illustrates in detail a process of distributing genome sequence fragments to each computer in a grid computing environment and then using a sequence alignment program in a method for genome alignment in a grid computing environment according to an exemplary embodiment of the present invention. It is a flowchart for doing so.

도 3에 도시된 바와 같이, 단계 S210에서는 상기 유전체 파일들의 양식을 유전체 서열 정렬 프로그램이 요구하는 입력 양식에 따라 바꿔준다. 이어 단계 S220에서는 상기 만들어진 한 서열의 여러 유전체 파일중의 하나씩과 한 파일로 만들어진 다른 서열을 그리드 환경의 각 컴퓨터에 분배한다. 단계 S230에서는 서열 정렬 프로그램을 이용하여 상기 유전체 서열 조각들을 정렬한다. As shown in Fig. 3, in step S210, the format of the genome files is changed according to the input form required by the genome sequence alignment program. Subsequently, in step S220, one of the generated genome files of one sequence and the other sequence made of one file are distributed to each computer in the grid environment. In step S230, the genomic sequence fragments are aligned using a sequence alignment program.

도 4는 본 발명의 바람직한 일 실시예에 따른 그리드 컴퓨팅 환경에서의 유전체 정렬을 위한 방법 중 정렬 결과에서 뽑아낸 정렬 정보들을 기반으로 정렬 정보를 추출하거나 제거 혹은 합친 후 통계적으로 의미 있는 정렬 정보들만을 얻기 위한 후처리 과정을 구체적으로 설명하기 위한 흐름도이다. 4 is only statistically meaningful alignment information after extracting, removing, or combining the alignment information based on the alignment information extracted from the alignment result in the method for dielectric alignment in a grid computing environment according to an exemplary embodiment of the present invention. It is a flowchart for explaining the post-processing process for obtaining in detail.

도 4에 도시된 바와 같아, 단계 S310에서는 상기 단계에서 나온 유전체 정렬 결과들 중에서 정렬 정보들을 뽑아낸다. 이때 결과 포맷은 사용하는 서열 정렬 프로그램에 따라서 다른데 정렬된 결과의 위치 정보, 그 정렬 결과의 점수, 정렬된 부위의 실제 서열 등의 정보가 FASTA 포맷이나 XML 포맷 등 여러 포맷으로 표현되게 된다. 이를 고려하여 정보를 추출하는 방법을 모듈로써 설계를 함으로써 어떤 서열 정렬 프로그램이 사용된다 하더라도 동일한 포맷의 정렬 정보들을 얻을 수 있도록 한다. 단계 S320에서는 두개의 비교 서열 중 한 서열에서 나타나는 위치를 기준으로 정렬 정보들을 순서화시킨다. As shown in FIG. 4, in step S310, alignment information is extracted from the dielectric alignment results from the step. At this time, the result format is different according to the sequence alignment program to be used. Information such as the position information of the sorted result, the score of the sorted result, and the actual sequence of the sorted site is expressed in various formats such as FASTA format or XML format. Considering this, the method of extracting information as a module is designed so that alignment information of the same format can be obtained regardless of which sequence alignment program is used. In step S320, the alignment information is ordered based on the position appearing in one of the two comparison sequences.

단계 S330에서는 상기 유전체 서열의 조각화로 인해 조각난 정렬 정보들을 다루게 된다. 조각난 정렬 정보들은 발생하는 위치와 그 길이에 따라 다양한 패턴으로 나타나게 되는데 대부분은 이후의 단계에서 서로 겹쳐지는 정렬 정보들 중 정렬 점수를 기반으로 하나를 선택함으로써 없어지므로 고려할 필요가 없다. 하지만 상기 과정에서 처리되지 않는 정렬 정보, 즉 설정한 중복구간보다 큰 보존 서열로 인해 정렬 정보의 한쪽 끝은 중복된 구간을 벗어나고 다른 한쪽 끝은 조각화로 인해서 잘려진 패턴으로 존재하는 정렬 정보들을 찾는다. 이어 단계 S340에서는 상기 단계에서 찾은 정렬 정보들을 가지는 유전체 서열 조각들과 인접한 유전체 서열 조각들에서 잘려진 정렬 정보들을 찾아낸 후 다시 합쳐진 정렬 정보를 생성하고, 상기 정렬 과정에서 사용한 정렬 점수 계산 방식을 사용하여 합쳐진 정렬 정보에 대한 정렬 점수를 재계산하게 된다. In step S330, fragmented alignment information is handled due to fragmentation of the genome sequence. The fragmented sorting information appears in various patterns according to the location and length of the fragmentation. Most of the fragmented sorting information is removed by selecting one of the sorting information overlapping each other based on the sorting score. However, due to alignment information that is not processed in the above process, that is, a conserved sequence larger than the set overlapping section, one end of the sorting information is found out of the overlapping section, and the other end finds the sorting information present in the pattern cut by fragmentation. Subsequently, in step S340, the sequence information having the alignment information found in the above step and the alignment information cut from the adjacent genome sequence fragments are found and then merged again to generate alignment information, which is then merged using the alignment score calculation method used in the alignment process. The sort score for the sort information is recalculated.

도 5는 발명의 바람직한 일 실시예에 따른 그리드 컴퓨팅 환경에서의 유전체 정렬을 위한 방법 중 상기 설명한 유전체 서열의 조각화로 인해 조각난 정렬 정보들을 다시 합쳐진 정렬 정보로 생성하는 과정을 구체적으로 설명하기 위한 그림이다. FIG. 5 is a diagram for describing in detail a process of generating fragmented alignment information into merged alignment information due to fragmentation of the above-described genome sequence in a method for genome alignment in a grid computing environment according to an exemplary embodiment of the present invention. .

도 5는 먼저 유전체 서열을 자르지 않은 상태에서 정렬 프로그램을 실행했을 때의 가상의 정렬 정보 분포를 보여주고 있고, 서열을 조각화하여 정렬을 할 경우 그러한 정렬 정보들이 유전체 서열 조각에서 어떻게 분포하게 되는지를 보여준다. 그리고 앞서 설명한 설정한 중복구간보다 큰 보존 서열로 인해 정렬 정보의 한쪽 끝은 중복된 구간에서 벗어나고 다른 한쪽 끝은 조각화로 인해서 잘려진 패턴으로 존재하는 정렬 정보의 실제 예를 보여주고 이것이 어떻게 합쳐지는 지를 간략하게 보여주고 있다. FIG. 5 shows a hypothetical distribution of the alignment information when the alignment program is executed without first cutting the genome sequence, and shows how the alignment information is distributed in the genome sequence fragment when the sequence is fragmented and aligned. . In addition, due to the conserved sequence larger than the overlapping section described above, one end of the alignment information is separated from the overlapping section, and the other end shows a practical example of the alignment information that exists in the cut pattern due to fragmentation. Is showing.

단계 S350에서는 앞서 유전체 정렬 이전에 반복되는 서열을 표기해 두었는데 그렇게 표기된 부분들과 정렬 정보들이 일정부분 이상 겹치는 것들을 제거해준다. In step S350, the sequence to be repeated before the genome alignment is marked, and the marked portions and the alignment information are eliminated by overlapping a predetermined portion or more.

단계 S360에서는 하나의 서열의 일정 부분이 다른 서열의 여러 곳에서 정렬이 된 것으로 나타나는 정렬 정보들을 설정한 기준 유사치를 바탕으로 정렬 정보의 유사치가 기준 유사치보다 높은 정렬 정보들은 추출하여 독립된 파일로 저장하고, 낮은 정렬 정보들은 상기 서열 정렬 프로그램에서 계산된 정렬 점수를 기반으로 가장 높은 정렬 정보들만을 선택한다. 상기 과정에서 독립된 파일로 저장된 정보는 유전체 정렬 과정에서는 중요한 의미가 있지 않지만 한 종의 반복되는 염기 서열을 연구하는 데 있어 유용한 의미를 갖게 된다. 또한 상기 선택 과정을 통해서 불필요한 정렬 정보들을 제거함으로써 유전자 보존 정도를 더욱 명확히 볼 수 있다. In step S360, the alignment information having a similarity higher than the reference similarity is extracted and stored as an independent file based on the reference similarity which sets the alignment information in which a portion of one sequence is aligned at various places of the other sequence. The low alignment information selects only the highest alignment information based on the alignment score calculated in the sequence alignment program. The information stored as an independent file in the above process is not important in the genome alignment process, but has a useful meaning in studying a single repeated sequence. In addition, it is possible to more clearly see the degree of gene conservation by removing unnecessary alignment information through the selection process.

단계 S370에서는 비교하는 두 서열에서 같은 방향, 즉 증가하는 방향 혹은 감소하는 방향으로 동시에 서로 겹치거나 설정한 한계치 이내에 인접해 있는 정렬 정보들을 서로 합쳐 더 큰 정렬 정보로 생성하고 상기 서열 정렬 프로그램에서 사용한 정렬 점수 계산 방식을 사용하여 상기 합쳐진 정렬 정보의 정렬 점수를 재계산한다. In step S370, the alignment information adjacent to each other overlapping or overlapping each other in the same direction, that is, increasing or decreasing direction at the same time in the two sequences to be compared to each other are combined to generate larger alignment information and used in the sequence alignment program. A score calculation method is used to recalculate the sort score of the combined sort information.

도 6은 본 발명의 바람직한 일 실시예에 따른 그리드 컴퓨팅 환경에서의 유전체 정렬을 위한 방법 중 상기 설명한 한 서열의 일부분이 다른 서열의 여러 부분과 정렬되는 정렬 정보들을 다루는 방법과 서로 겹치거나 인접한 유전체 정보들을 서로 합치는 단계를 구체적으로 설명하기 위한 그림이다. FIG. 6 illustrates a method for handling alignment information in which a portion of one sequence described above is aligned with various portions of another sequence in a method for aligning a genome in a grid computing environment according to an exemplary embodiment of the present invention. This is a detailed illustration of the steps to combine them together.

도6에 제시된 비교하는 두 유전체 서열의 윗부분에는 서열 하나의 일부분이 다른 서열의 여러 곳에서 정렬이 되는 두 가지 경우를 보여주고 있다. 이 중 첫 번째 경우 A는 유전체 서열 1에서 A1, A2, A3 정렬 정보들이 있고 이 정렬 정보들은 기준 유사치보다 높은 유사치를 가지는 유사한 서열들이므로 상기 3가지 정렬 정보들을 추출하여 독립된 파일로 저장한다. 두 번째 경우 B는 유전체 서열 2에서 B1, B2 정렬 정보들이 있고 이 정렬 정보들은 기준 유사치 보다 낮은 유사치를 가지는 서열들이므로 그 중 가장 정렬 점수가 가장 높은 B2를 선택하고 나머지 B1은 제거한다. The upper part of the two comparative genomic sequences shown in FIG. 6 shows two cases in which a portion of one sequence is aligned at several places in another sequence. In the first case, A has A1, A2, and A3 alignment information in genome sequence 1, and the alignment information is similar sequences having a similarity higher than the reference similarity, so the three alignment information is extracted and stored as an independent file. In the second case, B has B1 and B2 alignment information in genome sequence 2, and these alignment information are sequences having a similarity lower than the reference similarity, so B2 having the highest alignment score is selected and the remaining B1 is removed.

또한 두 유전체 서열의 아랫부분에는 같은 방향으로 동시에 서로 겹치는 정렬 정보의 예 D와 한계치 이내에 인접한 정렬 정보의 예 C를 보여주면서 이 정렬 정보들이 하나로 합쳐지는 모습을 보여주고 있다. In addition, the lower parts of the two genome sequences show an example D of alignment information overlapping each other in the same direction at the same time, and an example C of adjacent alignment information within a limit value, and the alignment information is merged into one.

단계 S380에서는 서로 일정 부분 혹은 전체가 겹치는 정렬 정보들 중에서 통계적으로 좀 더 의미가 있을 것으로 예상되는 정렬 정보들만을 선택한다. 이때 비교하 는 서열들 간의 유사한 정도에 따라서 그 정도가 높으면 정렬 정보의 정렬 점수를 기준으로 선택하게 되고, 그 정도가 낮으면 하기한 수학식 1과 같은 점수를 계산한 후 그 점수에 의거해 선택하게 된다. In step S380, only the sorting information that is expected to be statistically more meaningful is selected from the sorting information overlapping a part or the whole of each other. In this case, if the degree is high according to the similarity between the sequences to be compared, the selection is made based on the alignment score of the alignment information. If the degree is low, the score is calculated as shown in Equation 1 below and then selected based on the score. do.

수학식 1의 ω는 정렬 정보의 서열 길이(v)와 정렬 점수(u)를 기반으로 계산되는데 유사한 정도가 낮을 경우는 진화론에 따라서 같은 역할을 하는 유전자 간에도 서열상의 차이가 날 수 있음으로, 이를 고려하여 정렬 정보의 길이의 비중을 정렬 점수보다 높게 잡는다. Ω in Equation 1 is calculated based on the sequence length (v) and alignment score (u) of the alignment information. When the similarity is low, sequence differences may occur between genes that play the same role according to evolutionary theory. In consideration, the specific gravity of the length of the sorting information is set higher than the sorting score.

도 7은 인간의 20번 염색체와 쥐의 2번 염색체를 BLAST를 이용하여 정렬한 후 정렬 결과를 도시한 그래프로서 x축은 인간 20번 염색체의 위치를 의미하고 y축은 쥐의 2번 염색체의 위치를 의미한다. 왼쪽 그래프에서 보이듯 수많은 정렬 결과로 인해서 혼잡했던 정렬 결과가 후처리 과정을 거침으로써 오른쪽 그래프처럼 좀 더 명확한 결과를 얻을 수 있음을 알 수 있다. 7 is a graph showing the alignment after aligning human chromosome 20 and mouse chromosome 2 using BLAST. The x-axis indicates the position of human chromosome 20 and the y-axis indicates the position of chromosome 2 in mice. it means. As you can see in the left graph, you can see that the sort result, which was congested due to numerous sort results, is post-processed, so you can get a clearer result as in the graph on the right.

본 발명은 기존의 몇몇 서열 정렬 프로그램들이 높은 정확성에도 불구하고 그 방 법론으로 인한 컴퓨팅 리소스의 한계로 인해 대량의 유전체 서열을 비교 분석하는데 쓰이지 못하는 현실을 인식하고 이를 극복하기 위해서 만들어졌다. The present invention was made to recognize and overcome the reality that some existing sequence alignment programs cannot be used to compare and analyze a large amount of genomic sequences due to the limitation of computing resources due to the methodology despite the high accuracy.

이러한 발명을 통해 기존의 정확성이 높은 서열 정렬 프로그램들을 이용하여 현재 제공되는 각종 데이터보다 자세하거나 혹은 좀 더 의미 있는 결과를 얻을 수 있는 가능성을 제시하고, 일선의 연구소들에서도 몇몇 큰 단체에서 제공하는 한정된 데이터만을 사용하는 것이 아니라 사용하는 컴퓨팅 리소스를 모아 포유류의 염색체 서열과 같이 큰 서열을 능동적으로 정렬해보고 이를 진행하는 연구에 적용하는 기회를 마련하게 한다. This invention offers the possibility of using the existing highly accurate sequence alignment programs to obtain more detailed or more meaningful results than the currently available data, and is limited by some large organizations at leading research institutes. Instead of using only the data, it gathers the computing resources it uses to create an opportunity to actively align large sequences, such as mammalian chromosomal sequences, and apply them to ongoing research.

또한 본 발명을 이용하여 발견될 수 있는 유전체 보전 결과는 새로운 유전자의 발견에 도움을 줄 수 있어 큰 파급효과를 불러올 수 있을 것으로 예상되어 진다. In addition, the results of genome conservation that can be found using the present invention are expected to bring about a great ripple effect as it can help in the discovery of new genes.

Claims

A genome sequence alignment method for aligning a first genome sequence with a second genome sequence in a grid computing environment,

(a) a pretreatment step of cutting the first and second genome sequences at regular overlapping intervals and marking the repeated sequences in the cut pieces;

(b) distributing the first and second genome sequence pieces to each computer in the grid computing environment and then aligning using a sequence alignment program; And

and (c) a post-processing step of combining the alignment results generated through step (b) with each other and extracting only statistically meaningful alignment information.

The method of claim 1, wherein step (a) comprises:

(a1) cutting into pieces of constant size having overlapping intervals in consideration of the first and second genome sequence sizes and the calculation amount of an algorithm for aligning the sequences;

(a2) marking sequences that are repeated in the genome sequence fragments generated through step (a1); And

(a3) making the first genome sequence pieces into multiple files and the second genome sequence pieces into one file such that the sequence alignment of step (b) is properly distributed to each computer in a grid computing environment. Dielectric alignment method in a grid computing environment comprising the process.

The method of claim 2, wherein step (b) comprises:

(b1) changing the format of the file containing the genome sequence fragments according to the input format required by the genome sequence alignment program;

(b2) distributing one of said first dielectric files and said second dielectric file made in (a3) to each computer in a grid environment; And

(b3) aligning the genome fragments of the first genome file and the second genome file at each computer in the grid computing environment using the sequence alignment program. Way.

The method of claim 3, wherein step (c) comprises:

(c1) extracting alignment information from the genome sequence alignment result generated through the step (b);

(c2) ordering the alignment information according to the position of one of the first and second genome sequences;

(c3) Due to fragmentation of the genomic sequence, alignment information larger than the overlapping interval set among various patterns of fragmented alignment information existing in the overlapping section, that is, a pattern cut at one end out of the overlapping section and at the other end due to fragmentation Finding the alignment information existing as;

(c4) align the merged alignment information by using the alignment score calculation method used in the sequence alignment program by generating the alignment information obtained by finding the alignment information truncated in the adjacent genomic sequence fragment with the alignment information found in the step; Recalculating the scores;

(c5) removing alignment information overlapping at least a part of the indicated repeating sequences;

(c6) alignment information having a higher similarity of alignment information than the reference similarity based on reference similarity which sets alignment information indicating that a portion of one of the first and second genome sequences is aligned with several places of another sequence; Extract and store in a separate file, wherein the low alignment information selects only the highest alignment information based on the alignment score calculated in the sequence alignment program;

(c7) in the first and second genome sequences, the alignment information overlapping each other at the same time or adjacent to each other within the limit are combined to generate larger alignment information, and using the alignment score calculation method used in the sequence alignment program. Recalculating the alignment score of the alignment information; And

(c8) Among the alignment information overlapping each other, the alignment information that is expected to be statistically significant is obtained by the alignment score calculated by the sequence alignment program or the score calculated by the sequence length (v) and alignment score (u) of the alignment information ( and a step of selecting based on ω).

The method of claim 1, wherein the grid computing environment is a PC cluster based system or a grid system.

A program storage medium having stored thereon a program for executing the method of any one of claims 1 to 5.