KR102380935B1

KR102380935B1 - System and method for searching genomic regions

Info

Publication number: KR102380935B1
Application number: KR1020210170338A
Authority: KR
Inventors: 한헌종; 조유경
Original assignee: 주식회사 쓰리빌리언
Priority date: 2021-12-01
Filing date: 2021-12-01
Publication date: 2022-04-01

Abstract

The present invention relates to a genome region search system. The genome region search system, which searches for a reference region having a section overlapping with a specific region to be searched among reference regions having various segment lengths (M) represented by a start position and an end position and composed of sequence data, comprises: a length compensating part correcting the reference area with a combination of comparison target areas having a preset adjustment length (L); an area arranging part arranging the comparison target regions in ascending order based on a start position of the comparison target region, wherein a section length of the comparison target region is expressed by a start position and an end position; a binary search part specifying a comparison target region overlapping a section of the specific region among the comparison target regions by using a binary search method; and a reference region search part comparing the specific region with a reference region aligned in front of the comparison target region specified by the binary search method, and searching for a reference region whose section overlaps by comparing with a reference region aligned behind the comparison target region specified by the binary search method.

Description

SYSTEM AND METHOD FOR SEARCHING GENOMIC REGIONS

본 발명은 유전체 영역 검색 시스템 및 방법에 관한 것으로, 보다 자세하게는 이진탐색을 이용해 유전체 영역을 빠르고 정확하게 검색하기 위한 검색 시스템 및 방법에 관한 것이다.The present invention relates to a genome region search system and method, and more particularly, to a search system and method for quickly and accurately searching a genome region using binary search.

최근 차세대 염기서열 분석기술(Next Generation Sequencing:NGS)의 발달로 인해 인간을 비롯한 생물의 전체 유전체를 빠른 속도와 저렴한 비용으로 분석하는 것이 가능하게 되었다. With the recent development of next-generation sequencing (NGS), it has become possible to analyze the entire genome of living organisms, including humans, at a high speed and at low cost.

하지만, 전체 유전체 서열분석의 결과로 얻어지는 방대한 양의 유전체 데이터를 효율적으로 처리하는 것이 유전공학에 있어 매우 중요한 일이며, 특히 이러한 유전체 데이터를 서로 비교하면서 검색하는 작업은 유전체 데이터를 분석하는 다양한 과정에서 사용된다.However, it is very important for genetic engineering to efficiently process the vast amount of genomic data obtained as a result of whole genome sequencing. used

예를 들어, 환자에서 발견된 돌연변이가 어떤 유전자에서 발견되었는지 알아낼 때, 또는 시퀀싱 데이터에서 나온 리드들이 어떤 유전자에게 발견되었는지를 알아내고자 할 때도 유전체 데이터를 서로 비교하면서 검색하는 작업이 필요하다.For example, when trying to find out which gene a mutation found in a patient was found, or to find out which gene the reads from sequencing data were found in, it is necessary to search while comparing genomic data.

상기와 같은 유전체 서열분석 결과의 검색을 위해서는 특정 염색체에서 염기서열 상의 특정 구간을 의미하는 유전체 영역을 서로 비교하면서 검색해야 한다.In order to search for the results of genomic sequencing as described above, it is necessary to search while comparing genomic regions indicating a specific section on a nucleotide sequence in a specific chromosome.

그러나 유전체 영역을 서로 비교하면서 검색할 때 두가지 문제가 발생하는데, 첫번째는 유전체 데이터의 방대한 양 때문에 검색 시간이 길어진다는 것이다. 두번째 문제는 유전체 영역의 구간길이(interval)가 서로 일정하지 않고 제각각이기 때문에 이진탐색을 하더라도 구간길이가 겹치는 데이터를 모두 찾을 수 없어 정확성이 떨어지게 된다.However, two problems arise when comparing genome regions with each other. The first is that the search time becomes longer due to the vast amount of genome data. The second problem is that since the interval lengths of the dielectric regions are not constant and are different, even if the binary search is performed, all data with overlapping interval lengths cannot be found, resulting in lower accuracy.

상기와 같은 유전체 서열분석 결과의 검색을 위한 종래의 도구로는 binary interval search(layer et al., 2013) 방법이 있다.As a conventional tool for searching the results of genome sequencing as described above, there is a binary interval search (layer et al., 2013) method.

이는 이진탐색으로 속도를 높이면서, 정확한 검색을 위해 두 결과의 구간길이(interval)을 알아내는 방식을 택하지만, 구간길이(interval)를 구하는 시점에서 이진탐색의 속도 이점을 잃어버리게 되며, 같은 구간 시작 위치와 구간 끝 위치를 가지는 영역이 여러 개 존재할 때 이를 모두 포함한 정보가 아닌 잘못된 위치 정보를 반환할 수 있다는 문제가 있다.This increases the speed of binary search and takes the method of finding out the interval length of two results for an accurate search, but loses the speed advantage of binary search at the point of obtaining the interval length, and the same interval When there are several regions having a start position and a section end position, there is a problem that incorrect position information may be returned instead of information including all of them.

본 발명이 이루고자 하는 기술적 과제는 다수의 구간길이(interval)가 일정하지 않는 유전체 영역을 정확하고 빠르게 검색할 수 있는 유전체 영역 검색 시스템 및 방법을 제공하고자 한다.An object of the present invention is to provide a genome region search system and method capable of accurately and quickly searching for a genome region having a plurality of intervals that are not constant.

이러한 과제를 해결하기 위하여 본 발명의 실시예에 따른 유전체 영역 검색 시스템은 서열데이터로 이루어지며 시작위치와 끝위치로 표현되는 다양한 구간길이(M)를 가지는 참조영역 중에서 검색하고자 하는 특정영역과 구간이 겹치는 참조영역을 검색하는 유전체 영역 검색 시스템에서, 상기 참조영역을 기 설정된 조정길이(L)를 가지는 비교대상영역의 조합으로 보정하는 길이 보정부; 상기 비교대상영역의 구간길이는 시작위치와 끝위치로 표현되며, 상기 비교대상영역의 시작위치를 기준으로 상기 비교대상영역을 오름차순으로 정렬하는 영역 정렬부; 상기 비교대상영역 중에서 상기 특정영역과 구간이 겹치는 비교대상영역을 이진탐색(binary search) 방법으로 특정하는 이진 탐색부; 및 상기 특정영역과 이진탐색 방법에 의해 특정된 비교대상영역의 앞쪽에 정렬된 참조영역과 비교하고, 이진탐색 방법에 의해 특정된 비교대상영역의 뒤쪽에 정렬된 참조영역과 비교하여 구간이 겹치는 참조영역을 검색하는 참조영역 검색부를 포함한다.In order to solve this problem, the genome region search system according to an embodiment of the present invention is composed of sequence data and a specific region and section to be searched among reference regions having various section lengths (M) expressed by start and end positions In the dielectric region search system for searching overlapping reference regions, the reference region comprising: a length correction unit for correcting the reference region with a combination of comparison regions having a preset adjustment length (L); a region arranging unit in which a section length of the comparison target region is expressed by a start position and an end position, and sorts the comparison target region in ascending order based on the start position of the comparison target region; a binary search unit for specifying a comparison target area in which the specific area and a section overlap among the comparison target areas using a binary search method; and comparing the specific region with the reference region aligned in front of the comparison target region specified by the binary search method, and comparing the reference region overlapping the reference region aligned behind the comparison target region specified by the binary search method and a reference area search unit for searching the area.

상기 길이 보정부는 상기 참조영역을 하기 식 1을 만족하는 n개의 비교대상영역으로 보정할 수 있다.The length correction unit may correct the reference region into n comparison target regions satisfying Equation 1 below.

[식 1][Equation 1]

(여기서 S는 참조영역의 시작위치, E는 참조영역의 끝위치, L는 조정길이, n는 상수이다)
상기 비교대상영역의 시작위치는

이고, 상기 비교대상영역의 끝위치는

일 수 있다.(Where S is the start position of the reference region, E is the end position of the reference region, L is the adjustment length, and n is a constant)
The starting position of the comparison target area is

and the end position of the comparison target area is

can be

(여기서 S는 참조영역의 시작위치, L는 조정길이,k는

를 만족하는 정수이다)(Where S is the starting position of the reference area, L is the adjustment length, and k is the

is an integer that satisfies

삭제delete

상기 길이 보정부는 상기 참조영역과 보정된 비교대상영역을 서로 매칭할 수 있다.The length correction unit may match the reference region and the corrected comparison target region with each other.

상기 참조영역 검색부는 이진탐색 방법에 의해 특정된 비교대상영역과 매칭된 참조영역의 구간과 상기 특정영역의 구간이 겹치는지를 확인할 수 있다.The reference region search unit may determine whether a section of the reference region matched with the comparison target region specified by the binary search method overlaps with a section of the specific region.

상기 이진 탐색부는 N/2번째 비교대상영역과 상기 특정영역을 비교하여, 상기 특정영역의 끝위치보다 N/2번째 비교대상영역의 시작위치가 더 큰 경우에는 1번째에서 N/2-1번째 비교대상영역을 대상으로 탐색을 계속하고, N/2번째 비교대상영역의 끝위치보다 상기 특정영역의 시작위치가 더 큰 경우에는 N/2+1번째에서 N번째 비교대상영역을 대상으로 탐색을 계속하고, 상기 특정영역과 상기 비교대상영역의 구간이 겹치는 경우에는 탐색을 끝내고 해당 비교대상영역을 특정할 수 있다.The binary search unit compares the N/2-th comparison target area with the specific area, and when the start position of the N/2-th comparison target area is greater than the end position of the specific area, the first to N/2-1th comparison areas The search is continued on the comparison target area, and when the start position of the specific area is larger than the end position of the N/2th comparison area, the search is performed on the N/2+1th comparison target area. Continuing, when the section between the specific region and the comparison target region overlaps, the search ends and the corresponding comparison target region can be specified.

상기 참조영역 검색부는 특정된 비교대상영역의 앞쪽에 정렬되어 있는 비교대상영역과 상기 특정영역의 구간이 겹치지 않을 때까지 검색하고, 특정된 비교대상영역의 뒤쪽에 정렬되어 있는 비교대상영역과 상기 특정영역의 구간이 겹치지 않을 때까지 검색할 수 있다.The reference region search unit searches until a section between the comparison target region and the specific region arranged in front of the specified comparison target region does not overlap, and the comparison target region aligned behind the specified comparison target region and the specific region You can search until the sections of the region do not overlap.

위에서 언급된 본 발명의 기술적 과제 외에도, 본 발명의 다른 특징 및 이점들이 이하에서 기술되거나, 그러한 기술 및 설명으로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.In addition to the technical problems of the present invention mentioned above, other features and advantages of the present invention will be described below, or will be clearly understood by those of ordinary skill in the art from such description and description.

이상과 같은 본 발명에 따르면 다음과 같은 효과가 있다.According to the present invention as described above, there are the following effects.

본 발명은 이진탐색 방법을 이용함으로써 많은 양의 영역정보를 빠르게 검색할 수 있을 뿐만 아니라, 다양한 구간길이를 가지는 영역정보에 대해서 영역의 구간길이를 보정함으로써 놓치는 영역 없이 검색할 수 있어 탐색 속도와 정확성을 향상시킬 수 있다.According to the present invention, not only can a large amount of area information be quickly searched by using the binary search method, but also search speed and accuracy can be achieved without missing areas by correcting the section length for area information having various section lengths. can improve

이 밖에도, 본 발명의 실시 예들을 통해 본 발명의 또 다른 특징 및 이점들이 새롭게 파악될 수도 있을 것이다.In addition, other features and advantages of the present invention may be newly recognized through embodiments of the present invention.

도 1은 본 발명의 일 실시예에 따른 유전체 영역 검색 시스템의 개략적인 구성을 도시한 블록도이다.
도 2는 본 발명의 일 실시예에 따른 서열데이터로 이루어진 참조영역을 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시예에 따른 길이 보정부를 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시예에 따른 영역 정렬부를 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에 따른 이진 탐색부를 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따른 참조영역 검색부를 설명하기 위한 도면이다.
도 7은 종래 기술에 따른 비교예와 본 발명에 따른 실시예를 비교하기 위한 도면이다.
도 8 내지 도 11은 본 발명의 실시예에 따른 유전체 영역 검색 시스템이 염색체 1번의 특정영역과 구간이 겹치는 영역을 찾는 것을 보여주기 위한 도면이다.
도 12는 종래 기술에 따른 비교예들과 본 발명에 따른 실시예의 효과를 비교하기 위한 도면이다.1 is a block diagram illustrating a schematic configuration of a dielectric region search system according to an embodiment of the present invention.
2 is a diagram for explaining a reference region made of sequence data according to an embodiment of the present invention.
3 is a view for explaining a length correction unit according to an embodiment of the present invention.
4 is a view for explaining an area alignment unit according to an embodiment of the present invention.
5 is a diagram for explaining a binary search unit according to an embodiment of the present invention.
6 is a diagram for explaining a reference region search unit according to an embodiment of the present invention.
7 is a view for comparing a comparative example according to the prior art and an embodiment according to the present invention.
8 to 11 are diagrams illustrating that the genome region search system according to an embodiment of the present invention finds a region where a section overlaps with a specific region of chromosome 1.
12 is a view for comparing the effects of Comparative Examples according to the prior art and the Example according to the present invention.

본 명세서에서 서술되는 용어의 의미는 다음과 같이 이해되어야 할 것이다. The meaning of the terms described in this specification should be understood as follows.

단수의 표현은 문맥상 명백하게 다르게 정의하지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "제1", "제2" 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위한 것으로, 이들 용어들에 의해 권리범위가 한정되어서는 아니 된다..The singular expression is to be understood as including the plural expression unless the context clearly defines otherwise, and the terms "first", "second", etc. are used to distinguish one element from another, The scope of rights should not be limited by these terms.

"포함하다" 또는 "가지다" 등의 용어는 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.It should be understood that terms such as “comprise” or “have” do not preclude the possibility of addition or existence of one or more other features or numbers, steps, operations, components, parts, or combinations thereof.

이하, 첨부되는 도면을 참고하여 상기 문제점을 해결하기 위해 고안된 본 발명의 바람직한 실시예들에 대해 상세히 설명한다.Hereinafter, preferred embodiments of the present invention designed to solve the above problems will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 유전체 영역 검색 시스템의 개략적인 구성을 도시한 블록도이고, 도 2는 본 발명의 일 실시예에 따른 서열데이터로 이루어진 참조영역을 설명하기 위한 도면이고, 도 3은 본 발명의 일 실시예에 따른 길이 보정부를 설명하기 위한 도면이고, 도 4는 본 발명의 일 실시예에 따른 영역 정렬부를 설명하기 위한 도면이고, 도 5는 본 발명의 일 실시예에 따른 이진 탐색부를 설명하기 위한 도면이고, 도 6은 본 발명의 일 실시예에 따른 참조영역 검색부를 설명하기 위한 도면이다.1 is a block diagram showing a schematic configuration of a genomic region search system according to an embodiment of the present invention, and FIG. 2 is a diagram for explaining a reference region composed of sequence data according to an embodiment of the present invention; 3 is a view for explaining a length correction unit according to an embodiment of the present invention, Figure 4 is a view for explaining an area aligning unit according to an embodiment of the present invention, Figure 5 is a view for explaining an embodiment of the present invention A diagram for explaining a binary search unit according to the present invention, and FIG. 6 is a view for explaining a reference region search unit according to an embodiment of the present invention.

도 1 내지 도 6을 참조하면, 본 발명의 일 실시예에 따른 유전체 영역 검색 시스템(1000)은 길이 보정부(100), 영역 정렬부(300), 이진 탐색부(500), 참조영역 검색부(700), 및 데이터베이스(DB, 900)을 포함한다.1 to 6 , the genome region search system 1000 according to an embodiment of the present invention includes a length correction unit 100 , a region alignment unit 300 , a binary search unit 500 , and a reference region search unit. 700 , and a database DB 900 .

본 발명의 일 실시예에 따른 유전체 영역 검색 시스템(1000)은 서열데이터(G)로 이루어지며 시작위치(S)와 끝위치(E)로 표현되는 다양한 구간길이(M)를 가지는 참조영역(10) 중에서 검색하고자 하는 특정영역(30)과 구간이 겹치는 참조영역(10)을 검색할 수 있다.The genome region search system 1000 according to an embodiment of the present invention consists of sequence data (G) and a reference region (10) having various section lengths (M) expressed by a start position (S) and an end position (E). ), the reference area 10 overlapping the specific area 30 to be searched can be searched for.

서열데이터(sequence data)란 일정한 서열 정보를 가지는 데이터를 의미하며, 도 2에서 도시된 서열데이터(G)는 염색체의 염기서열 정보를 나타내는 염색체 서열데이터(chromosome sequence data)이지만, 이에 한정되는 것은 아니고 일정한 서열정보를 가지는 데이터는 모두 포함될 수 있다. 일 예로 단백질 잔기(residue) 순서에 따른 아미노산 특성 정보를 나타내는 단백질 서열 데이터(protein sequence data)도 적용될 수 있다.Sequence data means data having certain sequence information, and the sequence data (G) shown in FIG. 2 is chromosome sequence data indicating the nucleotide sequence information of the chromosome, but is not limited thereto. All data having certain sequence information may be included. As an example, protein sequence data indicating amino acid characteristic information according to the sequence of protein residues may also be applied.

본 명세서에서 사용되는 특정영역(30)은 검색하고자 하는 서열데이터이고, 참조영역(10)은 이미 알고 있거나 알려진 서열데이터로 데이터베이스(90)에 저장될 수 있다.The specific region 30 used herein is sequence data to be searched, and the reference region 10 may be stored in the database 90 as known or known sequence data.

참조영역(10)과 특정영역(30)은 시작위치와 끝위치로 표현될 수 있으며 구간길이를 가지고 있다.The reference region 10 and the specific region 30 can be expressed as a start position and an end position, and have a section length.

일 예로, 도 2에서 제1 참조영역(10a)은 염색체 염기서열 A, T, G로 이루어지고 시작위치는 1이며 끝위치는 3이다. 즉 제1 참조영역(10a)의 위치는 (1, 3)이고, 구간길이는 3이다. 제2 참조영역(10b)은 염색체 염기서열 G, T, T, C, G로 이루어지고 시작위치는 4이며 끝위치는 8이므로, 제2 참조영역(10b)의 위치는 (4, 8)이고 구간길이는 5이다. 제3 참조영역(10c)은 염색체 염기서열 A, G, A, C로 이루어지고 시작위치는 9이고 끝위치는 12이므로, 제3 참조영역(10c)의 위치는 (9, 12)이고 구간길이는 4이다.For example, in FIG. 2 , the first reference region 10a is composed of chromosomal nucleotide sequences A, T, and G, and has a starting position of 1 and an ending position of 3. That is, the position of the first reference region 10a is (1, 3), and the section length is 3. Since the second reference region 10b is composed of the chromosomal sequences G, T, T, C, and G, and the start position is 4 and the end position is 8, the position of the second reference region 10b is (4, 8) and The length of the section is 5. Since the third reference region 10c is composed of chromosomal sequences A, G, A, and C, and the start position is 9 and the end position is 12, the position of the third reference region 10c is (9, 12) and the section length is is 4

길이 보정부(100)는 다양한 구간길이(M)를 가지는 참조영역(10)을 기 설정된 조정길이(L)를 가지는 비교대상영역(15)의 조합으로 보정할 수 있다.The length correction unit 100 may correct the reference region 10 having various section lengths M by a combination of the comparison target region 15 having a preset adjustment length L.

여기서 각각의 비교대상영역(15)는 각각의 시작위치와 끝위치를 가지고 있으나, 구간길이(L)는 모두 동일하다. 이때 구간길이(L)는 임의적 선택될 수 있다.Here, each comparison target area 15 has a respective start position and an end position, but the section length L is all the same. In this case, the section length L may be arbitrarily selected.

길이 보정부(100)는 구간길이가 다른 각각의 참조영역(10)을 하기 식 1을 만족하는 n개의 비교대상영역으로 보정할 수 있다.The length correction unit 100 may correct each reference region 10 having a different section length into n comparison target regions satisfying Equation 1 below.

[식 1][Equation 1]

여기서 S는 참조영역의 시작위치, E는 참조영역의 끝위치, L는 조정길이, n는 상수이다.
이때, 참조영역을 보정하기 위해 사용되는 비교대상영역의 좌표는 (S+(k-1)*L, S+k*L-1)이고, k는 1≤k≤n 를 만족하는 정수이다. 여기서 참조영역을 보정하기 위해 사용되는 비교대상영역의 개수는 n개이므로, 참조영역은 (S, S+L-1), ¨˙ ,(S+(k-1)*L, S+n*L-1) 각각의 비교대상영역의 조합으로 이루어질 수 있다.Here, S is the start position of the reference region, E is the end position of the reference region, L is the adjustment length, and n is a constant.
In this case, the coordinates of the comparison area used to correct the reference area are (S+(k-1)*L, S+k*L-1), and k is an integer satisfying 1≤k≤n. Here, since the number of comparison areas used to correct the reference area is n, the reference area is (S, S+L-1), ¨˙ , (S+(k-1)*L, S+n*L -1) It can consist of a combination of each comparison target area.

삭제delete

예를 들어 제2 참조영역(10b)의 위치는 (4, 8)이고 구간길이는 5이다. 조정길이(L)를 3으로 설정한 경우에, 4

를 만족하는 n=2이다.For example, the position of the second reference region 10b is (4, 8) and the length of the section is 5. When the adjustment length (L) is set to 3, 4

n = 2 satisfying .

제2 참조영역(10b)은 2개의 비교대상영역으로 이루어지고, 이때 각각의 비교대상영역의 위치는 (4, 6)과 (7, 9)이다.The second reference region 10b includes two comparison target regions, and the positions of each comparison target region are (4, 6) and (7, 9).

길이 보정부(200)는 참조영역과 보정된 비교대상영역을 서로 매칭하여 각각의 시작위치와 끝위치 정보를 데이터베이스(900)에 저장할 수 있다.The length correction unit 200 may match the reference region and the corrected comparison target region with each other and store respective start position and end position information in the database 900 .

영역 정렬부(300)는 비교대상영역(15)의 시작위치를 기준으로 각각의 비교대상영역을 오름차순으로 정렬할 수 있다.The region aligning unit 300 may sort each comparison target region in an ascending order based on the start position of the comparison target region 15 .

도 4를 참조하면, 비교대상영역(15)도 각각의 시작위치를 기준으로 오름차순으로 정렬될 수 있으며 아라비아 숫자 순으로 표시될 수 있으며, 이때 참조영역과 보정된 비교대상영역을 매칭하기 위해서 비교대상영역(15)의 아라비아 숫자 앞에는 매칭된 참조영역의 알파벳을 붙여서 표시될 수 있다.Referring to FIG. 4 , the comparison target area 15 may also be arranged in ascending order based on each start position and may be displayed in Arabic numeral order, and in this case, in order to match the reference area and the corrected comparison target area In front of the Arabic numerals in the area 15 , the alphabet of the matched reference area may be attached and displayed.

예를 들어 h번째 참조영역(h)를 보정하기 위해 사용된 각각의 비교대상영역은 h9, h11로 표시될 수 있으며, 이때 각각의 참조영역과 비교대상영역의 정렬순서와 각각의 위치가 데이터베이스(900)에 저장될 수 있다.For example, each comparison target area used to correct the h-th reference area (h) may be expressed as h9, h11, in which case the sort order and each position of each reference area and comparison area are stored in the database ( 900) can be stored.

이진 탐색부(500)는 비교대상영역(30) 중에서 특정영역(30)과 구간이 겹치는 영역을 이진탐색(binary search) 방법으로 특정할 수 있다.The binary search unit 500 may specify a region where the specific region 30 and the section overlap among the comparison target regions 30 by a binary search method.

이진탐색(binary search) 방법은 오름차순으로 정렬된 비교대상영역에서 특정영역과 겹치는 구간을 찾을 때 서열 중간에 위치한 비교대상영역을 임의의 값으로 선택하여, 서열 중간에 위치한 비교대상영역과 특정영역이 서로 겹치는 구간이 있을 때까지 구간을 줄이면서 서로 비교하는 방식이다. The binary search method selects a comparison target region located in the middle of the sequence as a random value when searching for a section overlapping a specific region in the comparison target region sorted in ascending order, so that the comparison target region located in the middle of the sequence and the specific region It is a method of comparing each other while reducing the sections until there is an overlapping section.

이때 처음 선택한 중간에 위치한 N/2번째 비교대상영역이 특정영역과 겹치는 구간이 없는 경우에 N/2번째 비교대상영역의 끝위치보다 특정영역의 시작위치가 더 크면 N/2+1번째 비교대상영역부터 N번째 비교대상영역을 대상으로 이진탐색을 계속하고, 특정영역의 끝위치보다 N/2번째 비교대상영역의 시작위치가 더 크면 1번째 비교대상영역부터 N/2-1번째 비교대상영역을 대상으로 이진탐색을 계속할 수 있다.At this time, if there is no section where the N/2th comparison target area located in the first selected middle overlaps with the specific area, and the start position of the specific area is greater than the end position of the N/2th comparison target area, the N/2+1th comparison target area Binary search is continued for the Nth comparison target area from the area, and if the start position of the N/2th comparison area is larger than the end position of the specific area, the N/2-1th comparison area from the 1st comparison area You can continue the binary search for .

이때 N/2이 상수가 아닌 경우에는 반올림 또는 반내림 중 선택해서 진행할 수 있다.At this time, if N/2 is not a constant, you can proceed by selecting rounding or rounding down.

도 5를 참조하면, 12개의 비교대상영역(15) 중에서 서열 중간인 6번째 비교대상영역(e6)을 특정영역(30)과 비교하면 서로 겹치는 구간이 없음을 알 수 있다. 이때 특정영역(30)의 끝위치보다 6번째 비교대상영역(e6)의 시작위치가 더 크므로 다시 1번째에서 5번째 비교대상영역을 대상으로 탐색을 계속한다. 3번째 비교대상영역(b3)와 특정영역(30)을 비교하면 또 서로 겹치는 구간이 없음을 알 수 있다. 이때 3번째 비교대상영역(b3)의 끝위치보다 특정영역(30)의 시작위치가 더 크므로 4번째에서 5번째 비교대상영역을 대상으로 탐색을 계속한다. 4번째 비교대상영역(c4)과 특정영역(30)을 비교하면 겹치는 구간이 있으므로 이진탐색을 멈추고 4번째 비교대상영역(c4)을 특정할 수 있다.Referring to FIG. 5 , it can be seen that there is no overlapping section when the sixth comparison target region e6, which is in the middle of the sequence, is compared with the specific region 30 among the 12 comparison target regions 15 . At this time, since the start position of the sixth comparison target region e6 is larger than the end position of the specific region 30, the search continues with the first to fifth comparison target regions again. When the third comparison target region b3 and the specific region 30 are compared, it can be seen that there is no overlapping section. At this time, since the start position of the specific region 30 is larger than the end position of the third comparison region b3, the search continues for the fourth to fifth comparison regions. When the fourth comparison target region c4 is compared with the specific region 30, since there is an overlapping section, the binary search can be stopped and the fourth comparison target region c4 can be specified.

참조영역 검색부(700)는 이진탐색 방법에 의해 특정된 비교대상영역과 매칭된 참조영역을 특정영역과 비교하여 구간이 겹치는지를 확인할 수 있다. 또한, 참조영역 검색부(700)는 특정영역과 이진탐색 방법에 의해 특정된 비교대상영역의 앞쪽에 정렬된 참조영역과 비교하고, 이진탐색 방법에 의해 특정된 비교대상영역의 뒤쪽에 정렬된 참조영역과 비교하여 구간이 겹치는 참조영역을 검색할 수 있다.The reference region search unit 700 compares the reference region matched with the comparison target region specified by the binary search method with the specific region to determine whether the sections overlap. Also, the reference region search unit 700 compares the specific region with the reference region aligned in front of the comparison target region specified by the binary search method, and the reference aligned behind the comparison target region specified by the binary search method A reference region with overlapping sections can be searched for by comparison with the region.

참조영역 검색부(700)는 특정된 비교대상영역의 앞쪽에 정렬되어 있는 비교대상영역과 특정영역의 구간이 겹치지 않을 때까지 검색하고, 특정된 비교대상영역의 뒤쪽에 정렬되어 있는 비교대상영역과 특정영역의 구간이 겹치지 않을 때까지 검색할 수 있다.The reference region search unit 700 searches until the section between the comparison target region and the specific region arranged in front of the specified comparison target region does not overlap, and the comparison target region and the specified comparison target region are aligned behind the specified comparison target region You can search until the sections of a specific area do not overlap.

도 6을 참조하면, 참조영역 검색부(700)는 특정된 비교대상영역(c4)과 매칭된 참조영역(c)을 검색할 수 있다. 또한, 특정된 비교대상영역(c4)의 앞쪽에 정렬되어 있는 비교대상영역(b3), 비교대상영역(b2), 비교대상영역(a1)을 차례로 특정영역(30)과 비교할 수 있다. 먼저 비교대상영역(b3)과 특정영역(30)을 비교하면 서로 구간이 겹치지 않으므로 특정된 비교대상영역(c4)의 앞쪽에 정렬되어 있는 참조영역의 검색을 멈출 수 있다. Referring to FIG. 6 , the reference region search unit 700 may search for a reference region c that matches the specified comparison target region c4 . In addition, the comparison target area b3, the comparison target area b2, and the comparison target area a1 arranged in front of the specified comparison target area c4 may be sequentially compared with the specific area 30 . First, if the comparison target region b3 and the specific region 30 are compared, since the sections do not overlap each other, the search for the reference region aligned in front of the specified comparison target region c4 can be stopped.

그리고 특정된 비교대상영역(c4)의 뒤쪽에 정렬되어 있는 비교대상영역들을 순차적으로 특정영역(30)과 비교할 수 있다. 먼저 비교대상영역(d5)과 특정영역(30)을 비교하면 서로 구간이 겹치므로 참조영역(d)를 검색하여 소환할 수 있다. 이후 비교대상영역(e6)과 특정영역(30)을 비교하여 서로 구간이 겹치지 않으므로 이후 뒤쪽에 정렬되어 있는 비교대상영역의 검색을 멈출 수 있다.In addition, the comparison target regions arranged behind the specified comparison target region c4 may be sequentially compared with the specific region 30 . First, if the comparison target region d5 and the specific region 30 are compared, the sections overlap each other, so that the reference region d can be searched and summoned. Thereafter, the comparison target region e6 and the specific region 30 are compared, and since the sections do not overlap each other, the search for the comparison target region arranged at the back can be stopped.

이와 같이 본 발명의 실시예에 따른 유전체 영역 검색 시스템(1000)은 이진탐색 방법을 이용함으로써 빠르게 원하는 영역정보를 검색할 수 있을 뿐만 아니라, 다양한 구간길이를 가지는 영역정보에 대해서 영역의 구간길이를 보정함으로써 놓치는 영역 없이 검색할 수 있어 탐색 속도와 정확성을 향상시킬 수 있다.As described above, the genome region search system 1000 according to an embodiment of the present invention can not only quickly search for desired region information by using the binary search method, but also correct the section length of the region for region information having various section lengths. By doing so, you can search without missing areas, improving search speed and accuracy.

예를 들어, 본 발명의 실시예에 따른 유전체 영역 검색 시스템은 환자에서 발견된 특정 변이가 수많은 유전자 중에서 어떤 유전자 영역에 있는지를 누락없이 빠르게 검색할 수 있다.For example, the genomic region search system according to an embodiment of the present invention can quickly search for a specific mutation found in a patient in which gene region among numerous genes without omission.

도 7은 종래 기술에 따른 비교예와 본 발명에 따른 실시예를 비교하기 위한 도면이다.7 is a view for comparing a comparative example according to the prior art and an embodiment according to the present invention.

도 7을 참조하면, 비교예에서 x 특정영역(30)과 구간이 겹치는 참조영역(10)을 이진탐색 방법을 이용하여 검색하면, 먼저 e 참조영역(10)과 x 특정영역(30)을 비교하고 겹치는 구간이 없으므로 다시 g 참조영역(10)과 x 특정영역(30)을 비교하여 구간이 겹치므로 최종적으로 g 참조영역(10)를 검색하여 소환하게 된다. 이와 같은 비교예에서는 x 특정영역(30)과 구간이 겹치는 c 참조영역(10)은 누락하게 되어 정확성이 떨어질 수 있다.Referring to FIG. 7 , in the comparative example, when the reference region 10 overlapping the x specific region 30 is searched using the binary search method, the e reference region 10 and the x specific region 30 are first compared. And since there is no overlapping section, the g reference region 10 and the x specific region 30 are compared again, and since the sections overlap, the g reference region 10 is finally searched and summoned. In this comparative example, the x-specific region 30 and the c reference region 10 overlapping the section may be omitted, resulting in poor accuracy.

반면에, 실시예에서 x 특정영역(30)과 구간이 겹치는 e7 비교대상영역을 이진탐색 방법으로 특정한 후, e7 비교대상영역과 매칭된 e 참조영역과 x 특정영역(30)의 구간이 서로 겹치는지 확인하고, e7 비교대상영역의 앞, 뒤쪽에 정렬된 참조영역들과 x 특정영역(30)을 순차적으로 서로 비교하여 겹치지 않은 영역이 있을 때까지 진행할 수 있다.On the other hand, in the embodiment, after specifying the e7 comparison target region where the section overlaps with the x specific region 30 by the binary search method, the section between the e reference region and the x specific region 30 matched with the e7 comparison target region overlaps with each other. , and sequentially comparing the reference regions and the x specific region 30 aligned in front and behind the e7 comparison target region with each other until there is a non-overlapping region.

이와 같이 실시예에서는 x 특정영역(30)과 구간이 겹치는 c 참조영역, f 참조영역, g 참조영역, i 참조영역을 누락없이 검색하여 소환할 수 있다.As described above, in the embodiment, the c reference region, f reference region, g reference region, and i reference region overlapping the section x specific region 30 can be searched and recalled without omission.

도 8 내지 도 11은 본 발명의 실시예에 따른 유전체 영역 검색 시스템이 염색체 1번의 특정영역과 구간이 겹치는 영역을 찾는 것을 보여주기 위한 도면이다.8 to 11 are diagrams illustrating that the genome region search system according to an embodiment of the present invention finds a region where a section overlaps with a specific region of chromosome 1.

도 8을 참조하면, 길이 보정부는 염색체 1번에 대해서 총 5개의 참조영역에 대해 기 설정된 조정길이를 가지는 비교대상영역으로 보정할 수 있다.Referring to FIG. 8 , the length correction unit may correct a comparison target region having a preset adjustment length for a total of five reference regions for chromosome No. 1 .

참조영역은 (기존 시작위치, 기존 끝위치)를 가지는 (10001, 13000), (10035, 10273), (11501, 12300), (10406, 12589), (12601, 12900) 5개 영역이고, 이때 기 설정된 조정길이는 1000이다.The reference area is (10001, 13000), (10035, 10273), (11501, 12300), (10406, 12589), (12601, 12900) 5 areas having (existing start position, existing end position), The set adjustment length is 1000.

(10001, 13000) 위치를 가지는 참조영역은 (10001, 11000), (11001, 12000), (12001, 13000)으로 보정된 3개의 비교대상영역으로 조합될 수 있다.The reference region having the position (10001, 13000) may be combined into three comparison target regions corrected by (10001, 11000), (11001, 12000), and (12001, 13000).

(10035, 10273) 위치를 가지는 참조영역은 (10035, 11034)로 보정된 1개의 비교대상영역으로 조합될 수 있다.Reference regions having positions (10035, 10273) may be combined into one comparison target region corrected by (10035, 11034).

(11501, 12300) 위치를 가지는 참조영역은 (11501, 12501)로 보정된 1개의 비교대상영역으로 조합될 수 있다.Reference regions having positions (11501, 12300) may be combined into one comparison target region corrected by (11501, 12501).

(10406, 12589) 위치를 가지는 참조영역은 (10406, 11405), (11406, 12405), (12406, 13405)로 보정된 3개의 비교대상영역으로 조합될 수 있다.The reference region having the position (10406, 12589) can be combined into three comparison target regions corrected by (10406, 11405), (11406, 12405), and (12406, 13405).

(12601, 12900) 위치를 가지는 참조영역은 (12601, 13600)으로 보정된 1개의 비교대상영역으로 조합될 수 있다.Reference regions having positions (12601, 12900) can be combined into one comparison target region corrected by (12601, 13600).

도 9를 참조하면, 영역 정렬부는 비교대상영역의 시작위치를 기준으로 오름차순으로 영역을 정렬할 수 있다.Referring to FIG. 9 , the region aligning unit may sort regions in an ascending order based on the start position of the comparison target region.

도 10을 참조하면, 이진 탐색부는 정렬된 비교대상영역 중에서 검색하고자 하는 (12600, 12800)의 특정영역과 구간이 겹치는 (12001, 13000)의 비교대상영역을 특정할 수 있다.Referring to FIG. 10 , the binary search unit may specify a comparison target area of ( 12600 , 12800 ) that overlaps with a specific area of ( 12600 , 12800 ) to be searched among the sorted comparison target areas.

다음으로, 도 11을 참조하면, 참조영역 검색부는 특정된 (12001, 13000)의 비교대상영역을 기준으로 매칭된 참조영역, 앞쪽에 정렬된 참조영역, 뒤쪽에 정렬된 참조영역을 특정영역과 비교하여 구간이 겹치는 참조영역을 검색할 수 있다.Next, referring to FIG. 11 , the reference region search unit compares the matched reference region, the front-aligned reference region, and the rear-aligned reference region with the specific region based on the specified comparison region of (12001, 13000). Thus, it is possible to search for a reference region where the sections overlap.

이때 참조영역 검색부는 (12600, 12800)의 특정영역과 비교대상영역이 겹치지 않을 때까지 확인을 진행할 수 있으며 회색으로 표시하였다.At this time, the reference region search unit can check until the specific region of (12600, 12800) and the comparison target region do not overlap, and is displayed in gray.

회색으로 표시된 영역 중에서, (12600, 12800)의 특정영역과 구간이 겹치는 참조영역을 검색할 수 있으며, 여기서 검색된 참조영역은 (10001, 13000)과 (12601, 12900)이며 굵은 글씨로 표시하였다.Among the gray areas, a reference area overlapping a specific area of (12600, 12800) and a section can be searched.

도 12는 종래 기술에 따른 비교예들과 본 발명에 따른 실시예의 효과를 비교하기 위한 도면이다.12 is a view for comparing the effects of Comparative Examples according to the prior art and the Example according to the present invention.

도 12를 참조하면, 본 발명에 따른 실시예와 종래 기술에 따른 비교예들의 속도 및 정확도를 비교한 도표로, 총 25종의 염색체(염색체 1번부터 22번, X, Y, MT 염색체)에 대해서 각각 10000 개의 영역을 임의로 만들었으며, 각 영역의 크기 역시 1부터 1000까지 임의로 지정하였다.12, it is a chart comparing the speed and accuracy of the Example according to the present invention and Comparative Examples according to the prior art, in a total of 25 kinds of chromosomes (chromosomes 1 to 22, X, Y, MT chromosomes) 10000 areas were randomly created for each area, and the size of each area was also arbitrarily designated from 1 to 1000.

그 결과 총 25*10000=250000 개의 영역 정보가 만들어 졌으며 이를 참조영역으로 정의하였다.As a result, a total of 25*10000=250000 pieces of area information were created and this was defined as a reference area.

검색하고자 하는 특정영역을 같은 방법으로 250000 개 생성하였으며, 특정영역과 참조영역 중 겹치는 영역을 찾는 작업을 수행하였다.250000 specific areas to be searched were created in the same way, and an overlapping area between the specific area and the reference area was searched.

비교예1은 binary interval search(Layer et al., 2013) 방법으로, 이진탐색 방법을 이용하면서 특정영역과 참조영역의 구간길이(interval)를 알아내는 방식으로 진행된다. Comparative Example 1 is a binary interval search (Layer et al., 2013) method, and it proceeds by finding out the interval between a specific region and a reference region while using the binary search method.

비교예2는 BEDtools(Aaron and Ira, 2010)에서 사용하는 'binning'방식으로, 유전체 전체를 정해진 길이 만큼의 구간으로 나눈 뒤, 각 영역을 해당하는 구간으로 할당하고 이렇게 나누어진 영역들을 같은 구간끼리 비교하는 방식이다.Comparative Example 2 is a 'binning' method used by BEDtools (Aaron and Ira, 2010), which divides the entire genome into sections of a predetermined length, assigns each region to a corresponding section, and divides the divided regions into the same section. way to compare.

실시예, 비교예1, 및 비교예2의 수행 시간 및 정확도를 보면, 실시예가 비교예들에 비하여 속도 및 정확도에서 우수하다는 것을 알 수 있다.Looking at the execution time and accuracy of Example, Comparative Example 1, and Comparative Example 2, it can be seen that the Example is superior in speed and accuracy compared to the Comparative Examples.

이상에서 설명한 본 발명이 전술한 실시예 및 첨부된 도면에 한정되지 않으며, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지로 치환, 변형 및 변경이 가능하다는 것은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 명백할 것이다.The present invention described above is not limited to the above-described embodiments and the accompanying drawings, and it is in the technical field to which the present invention pertains that various substitutions, modifications and changes are possible without departing from the technical spirit of the present invention. It will be clear to those of ordinary skill in the art.

10: 참조영역 15: 비교대상영역
30: 특정영역 100: 길이 보정부
300: 영역 정렬부 500: 이진 탐색부
700: 참조영역 검색부 900: 데이터베이스
1000: 유전체 영역 검색 시스템10: Reference area 15: Comparison target area
30: specific area 100: length correction unit
300: region alignment unit 500: binary search unit
700: reference region search unit 900: database
1000: genome region search system

Claims

In a genomic region search system that searches for a reference region that overlaps with a specific region to be searched among reference regions consisting of sequence data and having various section lengths (M) expressed by start and end positions,
a length correction unit for correcting the reference region with a combination of comparison regions having a preset adjustment length (L);
a region arranging unit in which a section length of the comparison target region is expressed by a start position and an end position, and sorts the comparison target region in ascending order based on the start position of the comparison target region;
a binary search unit for specifying a comparison target area in which the specific area and a section overlap among the comparison target areas using a binary search method; and
A reference region in which the section overlaps by comparing the specific region with the reference region aligned in front of the comparison target region specified by the binary search method, and compared with the reference region aligned behind the comparison target region specified by the binary search method A genome region search system comprising a reference region search unit to search for

According to claim 1,
The length correction unit is a dielectric region search system for correcting the reference region to n comparison target regions satisfying Equation 1 below.
[Equation 1]

(Where S is the start position of the reference region, E is the end position of the reference region, L is the adjustment length, and n is a constant)

3. The method of claim 2,
The starting position of the comparison target area is

and the end position of the comparison target area is

Human Genome Region Search System.
(Where S is the starting position of the reference area, L is the adjustment length, and k is the

is an integer that satisfies

According to claim 1,
The length correction unit is a dielectric region search system for matching the reference region and the corrected comparison target region with each other.

5. The method of claim 4,
The reference region search unit is a genome region search system that checks whether a section of the reference region matched with the comparison target region specified by the binary search method overlaps with a section of the specific region.

According to claim 1, wherein the binary search unit
By comparing the N/2th comparison target area with the specific area,
If the start position of the N/2-th comparison target region is larger than the end position of the specific region, the search continues for the 1st to N/2-1-th comparison target regions,
If the start position of the specific region is larger than the end position of the N/2-th comparison region, the search continues for the N/2+1-th to the N-th region to be compared,
When the section between the specific region and the comparison target region overlaps, the search is terminated and the corresponding comparison target region is specified.

The method of claim 1, wherein the reference region search unit
Search until the section of the comparison target area aligned in front of the specified comparison target area does not overlap with the section of the specific area,
A genome region retrieval system that searches until a section between the comparison target region and the specific region aligned behind the specified comparison target region does not overlap.