KR101521212B1

KR101521212B1 - Apparatus for sequence alignment based on graph data and method thereof

Info

Publication number: KR101521212B1
Application number: KR1020140054131A
Authority: KR
Inventors: 박상현; 이준수; 여윤구; 노홍찬; 윤영미
Original assignee: 연세대학교 산학협력단
Priority date: 2014-05-07
Filing date: 2014-05-07
Publication date: 2015-05-18
Also published as: WO2015170816A1

Abstract

Disclosed are an apparatus and a method for base sequence alignment based on graph data. According to the present invention, the apparatus for base sequence alignment based on graph data comprise: a reference graph generation unit which converts base sequence data to graph data and generates a reference graph data from the converted graph data; a lead graph generation unit which generates lead graph data from the generated reference graph data; a candidate detection unit which detects candidate paths allowing errors in the generated lead graph data; and a final candidate detection unit which detects one candidate path among the detected candidate paths.

Description

[0001] APPARATUS FOR SEQUENCE ALIGNMENT BASED ON GRAPH DATA AND METHOD THEREOF [0002]

본 발명은 염기 서열 정렬 알고리즘에 관한 것으로서, 특히, 서열 데이터를 변환한 그래프 데이터 기반의 염기 서열 정렬을 위한 장치 및 그 방법에 관한 것이다.The present invention relates to a base sequence sorting algorithm, and more particularly, to an apparatus and a method for base sequence sorting based on graph data converted from sequence data.

2003년 게놈 프로젝트 이후 유전체 정보 해석 기술은 급속도로 발전하였다. 해석 기술의 초창기에는 인간의 전체 유전체를 해석하는데 천문학적인 금액과 시간이 소모되었지만, 이른바 차세대 시퀀싱(Next-Generation Sequencing) 기술이 급속하게 발전함에 따라 해석에 필요한 시간과 금액도 급격하게 감소하였다.Since the genome project in 2003, dielectric information interpretation technology has developed rapidly. In the early days of analytical techniques, astronomical amounts and time were spent in interpreting the entire genome of human beings, but the time and amount of interpretation was drastically reduced as the so-called Next-Generation Sequencing technology rapidly developed.

유전 정보 해석 기술의 발전과 함께 염기 서열 정렬 알고리즘의 중요성 역시 부각되고 있다. 염기 서열 알고리즘은 하나의 서열을 참조 서열에 정렬함으로써 두 서열의 차이를 비교하는 것이다.Along with the development of genetic information analysis techniques, the importance of sequence alignment algorithms has also been highlighted. The nucleotide sequence algorithm compares the differences between two sequences by aligning one sequence to the reference sequence.

이러한 염기 서열 정렬 알고리즘은 유전체 재조립(genome re-sequencing), 유전체 변이의 탐색, TFBS(Transcription Factor Binding Site)의 탐색, 새롭게 발견된 유전자나 단백질의 기능 탐색 등의 생물학 전반에 걸쳐 널리 사용되고 있다.Such sequence alignment algorithms are widely used throughout biology, including genome re-sequencing, searching for genetic variations, searching for transcription factor binding sites (TFBS), and searching for newly discovered genes and protein functions.

급속도로 발전한 차세대시퀀싱 기술을 이용하여 연구하고자 하는 서열을 짧은 길이의 염기 서열인 리드 형태로 대량 생산할 수 있다. 이 리드를 유전체에 정렬하고 차이점을 비교함으로써 해석하고자 하는 염기 서열을 빠르고 저렴하게 분석할 수 있다.Using the next-generation sequencing technology developed at a rapid pace, the sequence to be studied can be mass-produced in the form of a short-length nucleotide sequence. By aligning the leads to the dielectric and comparing the differences, the base sequence to be analyzed can be analyzed quickly and inexpensively.

염기 서열 정렬 알고리즘이 갖추어야 할 조건은 크게 처리량(throughput)과 정렬품질(quality)을 고려할 수 있다. 그 중 차세대시퀀싱 기술이 발전함에 따라 리드의 생산량이 폭발적으로 증가하고 있기 때문에 이를 감당할 수 있는 높은 처리량을 갖춘 서열 정렬 알고리즘이 필요하다.The conditions for the sequence alignment algorithm can be considered in terms of throughput and alignment quality. As the next generation sequencing technology develops, the production volume of the leads is explosively increasing, so a sequence sorting algorithm with high throughput that can cope with it is needed.

따라서 이러한 종래 기술의 문제점을 해결하기 위한 것으로, 본 발명의 목적은 서열 데이터를 변환하여 참조 그래프 데이터를 생성하고 생성된 참조 그래프 데이터로부터 리드 그래프 데이터를 생성하여 생성된 리드 그래프 데이터를 이용하여 최종 후보 데이터를 찾아 내도록 하는 그래프 데이터 기반의 염기 서열 정렬을 위한 장치 및 그 방법을 제공하는데 있다.Accordingly, an object of the present invention is to provide an apparatus and a method for generating reference graph data by generating reference graph data by converting sequence data, generating lead graph data from the generated reference graph data, The present invention also provides an apparatus and a method for sorting a base sequence of a graph data based on a sequence of a plurality of sequences of sequences.

그러나 본 발명의 목적은 상기에 언급된 사항으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.However, the objects of the present invention are not limited to those mentioned above, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

상기 목적들을 달성하기 위하여, 본 발명의 한 관점에 따른 그래프 데이터 기반의 염기 서열 정렬을 위한 장치는 염기서열 데이터를 그래프 데이터로 변환하여 그 변환한 결과로 참조 그래프 데이터를 생성하는 참조 그래프 생성부; 생성된 상기 참조 그래프 데이터로부터 리드 그래프 데이터를 생성하는 리드 그래프 생성부; 생성된 상기 리드 그래프 데이터에서 에러를 허용하는 후보 경로들을 검출하는 후보 검출부; 및 검출된 상기 후보 경로들 중 하나의 후보 경로를 최종 후보 경로로 검출하는 최종 후보 검출부를 포함할 수 있다.According to one aspect of the present invention, there is provided an apparatus for sorting graph data based on a graph, comprising: a reference graph generator for generating reference graph data by converting base sequence data into graph data and converting the graph data; A lead graph generating unit for generating read graph data from the generated reference graph data; A candidate detector for detecting candidate paths that allow an error in the generated lead graph data; And a final candidate detector for detecting a candidate path of the detected candidate paths as a final candidate path.

바람직하게, 상기 참조 그래프 생성부는 상기 염기서열 데이터를 1번째 데이터를 기준으로 (k-1)번까지 순차적으로 잘라내어 그 잘라낸 서열 데이터들을 생성하고, 상기 잘라낸 서열 데이터 각각을 k-mer 형태로 나누어 그 나눈 결과로 다수의 노드로 구성된 k-mer 서열 데이터를 생성하며, 생성된 상기 k-mer 서열 데이터를 하나의 그래프 데이터로 변환하여 그 변환한 결과로 참조 그래프를 생성하는 것을 특징으로 한다.Preferably, the reference graph generator sequentially cuts the base sequence data up to (k-1) times based on the first data, generates cut-out sequence data, divides each of the cut-out sequence data into k- Mer sequence data composed of a plurality of nodes as a result of division, converting the generated k-mer sequence data into one graph data, and generating a reference graph from the result of the conversion.

바람직하게, 상기 참조 그래프 생성부는 상기 잘라낸 서열 데이터 각각을 3-mer 형태로 나누어 다수의 노드로 구성된 3-mer 서열 데이터를 생성하는 것을 특징으로 한다.Preferably, the reference graph generator generates 3-mer sequence data composed of a plurality of nodes by dividing each of the cut-out sequence data into a 3-mer form.

바람직하게, 상기 리드 그래프 생성부는 리드 서열 데이터를 상기 참조 그래프 데이터를 생성하는데 사용한 k-mer로 나누어 그 k-mer 서열 데이터에 상응하는 노드를 상기 참조 그래프 데이터에서 가져와 리드 그래프 데이터를 생성하는 것을 특징으로 한다.Preferably, the lead graph generator divides the lead sequence data into k-mer used for generating the reference graph data, and obtains the node corresponding to the k-mer sequence data from the reference graph data to generate read graph data. .

바람직하게, 상기 후보 검출부는 생성된 상기 리드 그래프 데이터에서 n-홉 미만의 에러를 허용하는 가장 길게 연결 가능한 경로를 후보 경로로 검출하는 것을 특징으로 한다.Preferably, the candidate detecting unit detects the longest connectable path that allows an error of less than n-hop in the generated lead graph data as a candidate path.

바람직하게, 상기 후보 검출부는 생성된 상기 리드 그래프에 있는 각 간선의 마지막 노드로부터 n-홉 미만에 있는 노드를 대상으로 노드 사이의 연결 가능성이 있는지 확인하고, 그 확인한 결과로 상기 연결 가능성이 있는 두 노드 사이에 가상의 경로를 추가하고 두 노드의 오프셋의 정보를 상기 가상의 경로를 추가함으로써 상기 후보 경로로 검출하는 것을 특징으로 한다.Preferably, the candidate detecting unit determines whether there is a possibility of connection between the nodes with respect to nodes that are less than n-hop from the last node of each trunk in the generated lead graph, and as a result of the checking, A virtual path is added between the nodes, and information on the offset of the two nodes is detected as the candidate path by adding the virtual path.

바람직하게, 상기 최종 후보 검출부는 glocal alignment 알고리즘을 이용하여 후보 경로들 중 하나의 최종 후보 경로를 검출하는 것을 특징으로 한다.Preferably, the final candidate detecting unit detects a final candidate path of one of the candidate paths using a glocal alignment algorithm.

본 발명의 다른 한 관점에 따른 그래프 데이터 기반의 염기 서열 정렬을 위한 방법은 염기서열 데이터를 그래프 데이터로 변환하여 그 변환한 결과로 참조 그래프 데이터를 생성하는 단계; 생성된 상기 참조 그래프 데이터로부터 리드 그래프 데이터를 생성하는 단계; 생성된 상기 리드 그래프 데이터에서 에러를 허용하는 후보 경로들을 검출하는 단계; 및 검출된 상기 후보 경로들 중 하나의 후보 경로를 최종 후보 경로로 검출하는 단계를 포함할 수 있다.According to another aspect of the present invention, there is provided a method for sorting nucleotide sequences on a graph data base, comprising the steps of: converting base sequence data into graph data and generating reference graph data as a result of the transformation; Generating lead graph data from the generated reference graph data; Detecting candidate paths that allow an error in the generated lead graph data; And detecting a candidate path of one of the detected candidate paths as a final candidate path.

바람직하게, 상기 참조 그래프 데이터를 생성하는 단계는 상기 염기서열 데이터를 1번째 데이터를 기준으로 (k-1)번까지 순차적으로 잘라내어 그 잘라낸 서열 데이터들을 생성하고, 상기 잘라낸 서열 데이터 각각을 k-mer 형태로 나누어 그 나눈 결과로 다수의 노드로 구성된 k-mer 서열 데이터를 생성하며, 생성된 상기 k-mer 서열 데이터를 하나의 그래프 데이터로 변환하여 그 변환한 결과로 참조 그래프를 생성하는 것을 특징으로 한다.Preferably, the step of generating the reference graph data comprises sequentially cutting the base sequence data up to (k-1) times based on the first data, generating cut-out sequence data, and converting each of the cut- Mer sequence data composed of a plurality of nodes as a result of dividing the k-mer sequence data into a plurality of nodes, converting the generated k-mer sequence data into one graph data, and generating a reference graph from the result of the conversion. do.

바람직하게, 상기 참조 그래프 데이터를 생성하는 단계는 상기 잘라낸 서열 데이터 각각을 3-mer 형태로 나누어 다수의 노드로 구성된 3-mer 서열 데이터를 생성하는 것을 특징으로 한다.Preferably, the step of generating the reference graph data may include generating 3-mer sequence data composed of a plurality of nodes by dividing each of the cut-out sequence data into a 3-mer form.

바람직하게, 상기 리드 그래프 데이터를 생성하는 단계는 리드 서열 데이터를 상기 참조 그래프 데이터를 생성하는데 사용한 k-mer로 나누어 그 k-mer 서열 데이터에 상응하는 노드를 상기 참조 그래프 데이터에서 가져와 리드 그래프 데이터를 생성하는 것을 특징으로 한다.Preferably, the step of generating the lead graph data includes dividing the lead sequence data by the k-mer used to generate the reference graph data, taking a node corresponding to the k-mer sequence data from the reference graph data, .

바람직하게, 상기 후보 경로를 검출하는 단계는 생성된 상기 리드 그래프 데이터에서 n-홉 미만의 에러를 허용하는 가장 길게 연결 가능한 경로를 후보 경로로 검출하는 것을 특징으로 한다.Preferably, the step of detecting the candidate path detects the longest connectable path that allows an error of less than n-hop in the generated lead graph data as a candidate path.

바람직하게, 상기 후보 경로를 검출하는 단계는 생성된 상기 리드 그래프에 있는 각 간선의 마지막 노드로부터 n-홉 미만에 있는 노드를 대상으로 노드 사이의 연결 가능성이 있는지 확인하고, 그 확인한 결과로 상기 연결 가능성이 있는 두 노드 사이에 가상의 경로를 추가하고 두 노드의 오프셋의 정보를 상기 가상의 경로를 추가함으로써 상기 후보 경로로 검출하는 것을 특징으로 한다.Preferably, the step of detecting the candidate path includes checking whether there is a possibility of connection between the nodes with respect to nodes that are less than n-hop from the last node of each trunk in the generated lead graph, A virtual path is added between two probable nodes and the information of the offset of the two nodes is detected as the candidate path by adding the virtual path.

바람직하게, 상기 최종 후보 경로로 검출하는 단계는 glocal alignment 알고리즘을 이용하여 후보 경로들 중 하나의 최종 후보 경로를 검출하는 것을 특징으로 한다.Preferably, the step of detecting the final candidate path detects a final candidate path of one of the candidate paths using a glocal alignment algorithm.

이를 통해, 본 발명은 서열 데이터를 변환하여 참조 그래프 데이터를 생성하고 생성된 참조 그래프 데이터로부터 리드 그래프 데이터를 생성하여 생성된 리드 그래프 데이터를 이용하여 최종 후보 데이터를 찾아 내도록 함으로써, 해석하고자 하는 염기 서열을 빠르게 분석할 수 있는 효과가 있다.Thus, according to the present invention, reference graph data is generated by converting sequence data, lead graph data is generated from the generated reference graph data, and the final candidate data is found by using the generated lead graph data. Thus, Can be analyzed quickly.

도 1은 본 발명의 일 실시예에 따른 염기 서열 정렬을 위한 장치를 나타내는 도면이다.
도 2는 본 발명의 일 실시예에 따른 염기 서열 정렬을 위한 방법을 나타내는 도면이다.
도 3은 서열 데이터를 참조 그래프 데이터로 변환하는 과정을 설명하기 위한 도면이다.
도 4는 참조 그래프 데이터에서 리드 그래프 데이터를 생성하는 과정을 설명하기 위한 도면이다.
도 5는 리드 그래프 데이터에서 후보 경로를 검출하는 과정을 설명하기 위한 도면이다.
도 6은 후보 경로들 중 최종 후보 경로를 검출하는 과정을 설명하기 위한 도면이다.
도 7은 본 발명의 일 실시예에 따른 Trinity을 이용한 서열 정렬 과정을 보여주는 도면이다.1 is a diagram illustrating an apparatus for sorting nucleotides according to an embodiment of the present invention.
2 is a diagram illustrating a method for base sequence alignment according to an embodiment of the present invention.
3 is a diagram for explaining a process of converting sequence data into reference graph data.
4 is a diagram for explaining the process of generating the read graph data from the reference graph data.
5 is a diagram for explaining a process of detecting a candidate path in the read graph data.
6 is a diagram for explaining a process of detecting a final candidate path among candidate paths.
FIG. 7 is a diagram illustrating a sequence alignment process using Trinity according to an embodiment of the present invention. Referring to FIG.

이하에서는, 본 발명의 실시예에 따른 그래프 데이터 기반의 염기 서열 정렬을 위한 장치 및 그 방법을 첨부한 도면을 참조하여 설명한다. 본 발명에 따른 동작 및 작용을 이해하는데 필요한 부분을 중심으로 상세히 설명한다.Hereinafter, an apparatus and method for sorting nucleotide sequences based on graph data according to an embodiment of the present invention will be described with reference to the accompanying drawings. The present invention will be described in detail with reference to the portions necessary for understanding the operation and operation according to the present invention.

또한, 본 발명의 구성 요소를 설명하는 데 있어서, 동일한 명칭의 구성 요소에 대하여 도면에 따라 다른 참조부호를 부여할 수도 있으며, 서로 다른 도면임에도 불구하고 동일한 참조부호를 부여할 수도 있다. 그러나, 이와 같은 경우라 하더라도 해당 구성 요소가 실시예에 따라 서로 다른 기능을 갖는다는 것을 의미하거나, 서로 다른 실시예에서 동일한 기능을 갖는다는 것을 의미하는 것은 아니며, 각각의 구성 요소의 기능은 해당 실시예에서의 각각의 구성요소에 대한 설명에 기초하여 판단하여야 할 것이다.In describing the constituent elements of the present invention, the same reference numerals may be given to constituent elements having the same name, and the same reference numerals may be given thereto even though they are different from each other. However, even in such a case, it does not mean that the corresponding component has different functions according to the embodiment, or does not mean that the different components have the same function. It should be judged based on the description of each component in the example.

특히, 본 발명에서는 서열 데이터를 변환하여 참조 그래프 데이터를 생성하고 생성된 참조 그래프 데이터로부터 리드 그래프 데이터를 생성하여 생성된 리드 그래프 데이터를 이용하여 최종 후보 데이터를 찾아내는 새로운 방안을 제안한다.In particular, the present invention proposes a new scheme for generating reference graph data by converting sequence data, generating lead graph data from the generated reference graph data, and finding the final candidate data using the generated lead graph data.

도 1은 본 발명의 일 실시예에 따른 염기 서열 정렬을 위한 장치를 나타내는 도면이다.1 is a diagram illustrating an apparatus for sorting nucleotides according to an embodiment of the present invention.

도 1에 도시한 바와 같이, 본 발명에 따른 염기 서열 정렬을 위한 장치는 참조 그래프 생성부(110), 리드 그래프 생성부(120), 후보 검출부(130), 최종 후보 검출부(140) 등을 포함하여 구성될 수 있다.1, the apparatus for sorting nucleotide sequences according to the present invention includes a reference graph generator 110, a lead graph generator 120, a candidate detector 130, a final candidate detector 140, and the like. .

참조 그래프 생성부(110)는 염기서열 데이터(sequence data)를 그래프 데이터로 변환하여 그 변환한 결과로 참조 그래프 데이터(reference graph data)를 생성할 수 있다.The reference graph generator 110 may convert the sequence data into graph data and generate reference graph data as a result of the conversion.

리드 그래프 생성부(120)는 생성된 참조 그래프 데이터로부터 리드 그래프 데이터를 생성할 수 있다. 이때, 리드 그래프 생성부(120)는 참조 그래프 데이터에서 리드 서열 데이터를 추출해 추출된 리드 서열 데이터를 이용하여 리드 그래프 데이터를 생성한다.The lead graph generating unit 120 may generate the read graph data from the generated reference graph data. At this time, the lead graph generator 120 extracts the lead sequence data from the reference graph data and generates the read graph data using the extracted lead sequence data.

후보 검출부(130)는 생성된 리드 그래프 데이터에서 에러를 허용하는 후보 경로를 검출할 수 있다. 이때, 후보 검출부(130)는 생성된 리드 그래프 데이터에서 기 설정된 n-홉 미만의 에러를 허용하는 가장 길게 연결 가능한 경로를 후보 경로로 검출한다.The candidate detecting unit 130 can detect a candidate path that allows an error in the generated lead graph data. At this time, the candidate detector 130 detects the path that can be connected for the longest, which permits an error less than the predetermined n-hop, from the generated lead graph data as a candidate path.

최종 후보 검출부(140)는 검출된 후보 경로들 중 하나의 최종 후보 경로를 검출할 수 있다. 이때, 최종 후보 검출부(140)는 glocal alignment 알고리즘을 이용하여 검출된 후보 경로들 중 가장 높은 점수를 갖는 후보 경로를 하나의 최종 후보 경로로 검출한다.The final candidate detection unit 140 may detect one final candidate path among the detected candidate paths. At this time, the final candidate detecting unit 140 detects the candidate path having the highest score among the candidate paths detected using the glocal alignment algorithm as one final candidate path.

이때, glocal alignment 알고리즘은 global alignment와 local alignment의 합성어로, Needleman-Wunsch algorithm과 Smith-Waterman algorithm에 기반을 둔 알고리즘이다.At this time, the glocal alignment algorithm is a combination of global alignment and local alignment, and is based on the Needleman-Wunsch algorithm and the Smith-Waterman algorithm.

도 2는 본 발명의 일 실시예에 따른 염기 서열 정렬을 위한 방법을 나타내는 도면이다.2 is a diagram illustrating a method for base sequence alignment according to an embodiment of the present invention.

도 2에 도시한 바와 같이, 본 발명에 따른 염기 서열 정렬을 위한 장치(이하, 염기서열 정렬장치라고 한다)는 서열 데이터(sequence data)를 그래프 데이터로 변환하여 그 변환한 결과로 참조 그래프 데이터(reference graph data)를 생성할 수 있다(S210).As shown in FIG. 2, an apparatus for aligning a base sequence according to the present invention (hereinafter, referred to as a base sequence alignment apparatus) converts sequence data into graph data, reference graph data (S210).

도 3은 서열 데이터를 참조 그래프 데이터로 변환하는 과정을 설명하기 위한 도면이다.3 is a diagram for explaining a process of converting sequence data into reference graph data.

도 3에 도시한 바와 같이, 원래의 서열 데이터(310)를 1번째 데이터를 기준으로 0, 1, 2, …, (k-1)번까지 순차적으로 잘라내어 잘라낸 서열 데이터를 생성하게 된다.As shown in FIG. 3, the original sequence data 310 is divided into 0, 1, 2, ... , (k-1), and cut out the sequence data.

예컨대, 원래의 서열 데이터가 “AACCTTGGGAACCTTTG…”라고 가정한다. 0번 잘라내는 경우, 아무것도 잘라내지 않기 때문에 잘라낸 서열 데이터는 원래의 서열 데이터(311)가 된다.For example, if the original sequence data is " AACCTTGGGAACCTTTG ... &Quot; In the case of cutting 0 times, since nothing is cut out, the cut sequence data becomes the original sequence data 311.

그리고 1번 잘라내는 경우, 1번째 데이터를 기준으로 하나만을 잘라내어 잘라낸 서열 데이터는 “ACCTTGGGAACCTTTG…”가 되고(312), 2번 잘라내는 경우 1번째 데이터를 기준으로 두개를 잘라내어 잘라낸 서열 데이터는 “CCTTGGGAACCTTTG…”가 된다(313).In case of cutting 1 time, only one piece of data is cut off based on the 1st data and the cut-out sequence data is "ACCTTGGGAACCTTTG ... Quot; (312), and when cutting 2 times, the two sets of data based on the first data are cut out and the cut-out sequence data is " CCTTGGGAACCTTTG ... &Quot; (313).

이러한 방식으로 1번째 데이터를 기준으로 (k-1)개까지 순차적으로 잘라낸다.In this manner, (k-1) pieces of data are sequentially cut out based on the first data.

이후 잘라낸 서열 데이터 각각을 k-mer 형태로 나누어 그 나눈 결과로 다수의 노드로 구성된 k-mer 서열 데이터를 생성한다. 여기서, k는 3인 것이 바람직하지만 반드시 이에 한정되지 않는다.Then, each of the cut-out sequence data is divided into k-mer forms, and k-mer sequence data composed of a plurality of nodes is generated as a result of division. Here, k is preferably 3, but is not limited thereto.

예컨대, 0번 잘라낸 서열 데이터 “AACCTTGGGAACCTTTG…”는 k-mer로 나누어져 k-mer 서열 데이터 “[AAC]-[CTT]-[GGG]-[AAC]-[CTT]-…”가 생성되고, 1번 잘라낸 서열 데이터 “ACCTTGGGAACCTTTG…”는 k-mer로 나누어져 k-mer 서열 데이터 “[ACC]-[TTG]-[GGA]-[ACC]-[TTT]-…”가 생성되며, 2번 잘라낸 서열 데이터 “CCTTGGGAACCTTTG…”는 k-mer로 나누어져 k-mer 서열 데이터 “[CCT]-[TGG]-[GAA]-[CCT]-[TTG]-…”가 생성된다.For example, the sequence data "AACCTTGGGAACCTTTG ... "Is divided into k-mer and the k-mer sequence data" [AAC] - [CTT] - [GGG] - [AAC] - [CTT] - ... " &Quot; is generated, and the 1-time cut-out sequence data " ACCTTGGGAACCTTTG ... "Is divided into k-mer and k-mer sequence data" [ACC] - [TTG] - [GGA] - [ACC] - [TTT] - ... &Quot; is generated, and the 2-time cut-out sequence data " CCTTGGGAACCTTTG ... "Is divided into k-mer and k-mer sequence data" [CCT] - [TGG] - [GAA] - [CCT] - [TTG] - ... &Quot;

이때, 각 노드 사이에는 간선이 존재하고 각 간선에는 오프셋(offset) 정보가 존재하게 된다. 여기서 오프셋 정보는 참조 서열의 위치를 나타낸다. 예컨대, 도 3의 [AAC]와 [CTT] 사이에 있는 '1'의 오프셋 정보는 참조 서열에서 첫 번째에 위치함을 나타낸다.At this time, there is an edge between each node and offset information exists in each edge. Here, the offset information indicates the position of the reference sequence. For example, the offset information of '1' between [AAC] and [CTT] in FIG. 3 indicates that it is located at the first position in the reference sequence.

이렇게 구성된 노드는 추후에 리드 시퀀스에 상응하는 인덱스로 사용될 수 있다.The node thus constructed can be used as an index corresponding to the read sequence in the future.

이러한 과정을 통해 생성된 다수의 노드로 구성된 k-mer 서열 데이터를 하나의 그래프 데이터로 변환하여 그 변환한 결과로 참조 그래프를 생성하게 된다.The k-mer sequence data composed of a plurality of nodes generated through this process is converted into one graph data, and a reference graph is generated as a result of the conversion.

다음으로, 염기서열 정렬장치는 생성된 참조 그래프 데이터로부터 리드 그래프 데이터를 생성할 수 있다(S220).Next, the nucleotide sequence aligner can generate the read graph data from the generated reference graph data (S220).

도 4는 참조 그래프 데이터에서 리드 그래프 데이터를 생성하는 과정을 설명하기 위한 도면이다.4 is a diagram for explaining the process of generating the read graph data from the reference graph data.

도 4에 도시한 바와 같이, 참조 서열 데이터 “…ATCATGCATTGATGTaTGGCAG…”에 리드 서열 데이터 “…ATCATGtATTGATGT TGGCAG…”가 어디에 정렬되는지를 찾는 예를 보여주고 있다. 여기서는 리드 서열 데이터에서 C가 t로 치환되고 a가 삭제되어 있다.As shown in Fig. 4, the reference sequence data " ATCATGCATTGATGTaTGGCAG ... "Lead Sequence Data on" ... ATCATGtATTGATGT TGGCAG ... "To find out where they are aligned. Here, in the lead sequence data, C is replaced by t, and a is deleted.

참조 그래프 데이터에서 리드 그래프 데이터를 생성하기 위하여, 먼저 리드 서열 데이터를 참조 그래프 데이터를 생성하는데 사용한 k-mer로 나누어 그 k-mer 서열 데이터에 상응하는 노드를 참조 그래프 데이터에서 가져와 리드 그래프 데이터를 생성한다.In order to generate the lead graph data from the reference graph data, first, the lead sequence data is divided into the k-mer used for generating the reference graph data, and the node corresponding to the k-mer sequence data is taken from the reference graph data to generate the lead graph data do.

예컨대, 리드 서열에 ATCATG를 3-mer로 나누어 ATC->ATG의 관계를 얻은 후에 참조 그래프 데이터에서 각각 ATC, ATG에 상응하는 노드를 추출한다. 이때, 참조 그래프 데이터의 간선에 존재하는 오프셋 정보도 함께 추출해 두 노드의 관계를 연결한다.For example, ATCATG is divided into 3-mer in the lead sequence, and the ATC-> ATG relationship is obtained. Then, the nodes corresponding to ATC and ATG are extracted from the reference graph data. At this time, the offset information existing in the trunk line of the reference graph data is also extracted and the relation of the two nodes is connected.

이 과정을 통해 최종적으로 리드 그래프 데이터 생성되고, 이를 각 리드에 대해 실시한다.Through this process, lead graph data is finally generated, and this is performed for each lead.

다음으로, 염기서열 정렬장치는 생성된 리드 그래프 데이터에서 에러를 허용하는 후보 경로를 검출할 수 있다(S230). 즉, 염기서열 정렬장치는 생성된 리드 그래프 데이터에서 기 설정된 n-홉(hop) 미만의 에러를 허용하는 가장 길게 연결 가능한 경로를 후보 경로로 검출한다. 여기서, 경로는 2개 이상 연결된 간선을 말한다.Next, the nucleotide sequence aligner may detect a candidate path that allows an error in the generated lead graph data (S230). That is, the base sequence aligner detects the longest connectable path as a candidate path that allows an error of less than a predetermined n-hop in the generated lead graph data. Here, a path refers to two or more connected trunks.

도 5a 내지 도 5b는 리드 그래프 데이터에서 후보 경로를 검출하는 과정을 설명하기 위한 도면이다.5A and 5B are diagrams for explaining a process of detecting a candidate path in the read graph data.

도 5a 내지 도 5b에 도시한 바와 같이, 리드 그래프 내에서 에러를 허용하는 가장 길게 연결 가능한 경로를 후보 경로로 검출하게 된다. 이렇게 에러를 허용하는 가장 긴 경로를 찾는 이유는 다형성(polymorphism)으로 인해 단절된 간선들을 연결하기 위함이다.As shown in FIGS. 5A and 5B, the path that is the longest connectable path that allows an error in the lead graph is detected as a candidate path. The reason for finding the longest path that allows this error is to connect disconnected trunks due to polymorphism.

이러한 후보를 찾기 위해, 리드 그래프에 있는 각 간선의 마지막 노드로부터 n-홉 미만에 있는 노드를 대상으로 노드 사이의 연결 가능성이 있는지 두 오프셋을 비교하여 확인한다.To find these candidates, we compare the two offsets to determine whether there is a possibility of connectivity between the nodes that are less than n-hop from the last node of each trunk in the lead graph.

연결 가능성이 있는 두 노드 사이에 새로운 가상의 경로를 추가하고 두 오프셋의 정보를 통합하여 새로운 가상의 경로를 추가한다. 이때, 두 노드 사이에 다형성이 존재할 경우 새로운 가상 경로에 두 노드 사이에 있는 다형성 정보를 저장한다.Adds a new virtual path between two nodes that are connectable and adds a new virtual path by combining the information of two offsets. At this time, if polymorphism exists between two nodes, polymorphism information between two nodes is stored in a new virtual path.

도 5a를 참조하면, ATG 노드에서 2-홉 미만에 있는 노드와 연결 가능성을 비교한다. 이때, TGA 노드는 ATG 노드의 2-홉 미만에 있는 노드가 아니기 때문에 연결 가능성을 비교하지 않는다.Referring to FIG. 5A, the ATG node compares connectivity with nodes that are less than two-hops. At this time, since the TGA node is not a node that is less than two-hop of the ATG node, the connectivity possibility is not compared.

그리고 TGT 노드에서 TGG 노드는 2-홉 미만에 위치한 노드이기 때문에 연결 가능성을 비교하게 된다. 즉, TGT 노드에 연결된 TGA 노드에서 TGG 노드에 연결된 CAG 노드로 가는 새로운 가상 경로를 추가하고 TGA 노드와 CAG 노드의 두 노드 오프셋의 정보 즉, 10, 60, 200, 17, 67, 501를 통합하여 추가된 가상 경로에 저장하게 된다.In the TGT node, the TGG node is a node located less than two hops, so the connectivity is compared. That is, a new virtual path from the TGA node connected to the TGT node to the CAG node connected to the TGG node is added, and the information of the two node offset of the TGA node and the CAG node, that is, 10, 60, 200, 17, And stored in the added virtual path.

또한 TGT 노드와 TGG 노드 사이에 다형성이 존재하기 때문에 TGA 노드에서 CAG 노드로 가는 새로운 가상 경로에 다형성 정보를 저장한다.Also, polymorphism information is stored in a new virtual path from the TGA node to the CAG node because of the polymorphism between the TGT node and the TGG node.

결과적으로 TGA 노드에서 CAG 노드로 가는 새로운 가상 경로를 초기의 리드 그래프 데이터에 추가하고 TGA 노드에서 CAG 노드로 가는 경로가 하나의 후보 경로가 된다.As a result, a new virtual path from the TGA node to the CAG node is added to the initial read graph data, and the path from the TGA node to the CAG node becomes one candidate path.

도 5b를 참조하면, ATG 노드에서 3-홉 미만에 있는 노드와 연결 가능성을 비교한다. 이때, ATG 노드에서 TGA 노드는 3-홉 미만에 위치한 노드이기 때문에 연결 가능성을 비교하게 된다.Referring to FIG. 5B, the ATG node compares connectivity with nodes that are less than three hops. In this case, the TGA node in the ATG node compares the connectivity possibility because it is a node located less than three hops.

그리고 TGT 노드에서 TGG 노드는 2-홉 미만에 위치한 노드이기 때문에 연결 가능성을 비교하게 된다.In the TGT node, the TGG node is a node located less than two hops, so the connectivity is compared.

즉, ATG 노드에 연결된 ATC 노드에서 TGG 노드에 연결된 CAG 노드로 가는 새로운 가상 경로를 추가하고, ATG 노드, TGT 노드, 및 CAG 노드의 세 노드 오프셋의 정보 즉, 1, 24, 51, 10, 60, 200, 17, 67, 501를 통합하여 추가된 가상 경로에 저장하게 된다.That is, a new virtual path from the ATC node connected to the ATG node to the CAG node connected to the TGG node is added, and information on the three node offset of the ATG node, the TGT node, and the CAG node, that is, 1, 24, 51, , 200, 17, 67, and 501 are integrated and stored in the added virtual path.

다음으로, 염기서열 정렬장치는 검출된 후보 경로들 중 하나의 최종 후보 경로를 검출할 수 있다(S240). 즉, 염기서열 정렬장치는 후보 경로들을 대상으로 가장 높은 점수를 갖는 하나의 최종 후보 경로를 얻기 위해 glocal alignment를 수행한다.Next, the nucleotide sequence aligner may detect a final candidate path of one of the detected candidate paths (S240). In other words, the nucleotide sequence aligner performs glocal alignment to obtain one final candidate path with the highest score for candidate paths.

도 6은 후보 경로들 중 최종 후보 경로를 검출하는 과정을 설명하기 위한 도면이다.6 is a diagram for explaining a process of detecting a final candidate path among candidate paths.

도 6에 도시한 바와 같이, glocal alignment는 상황에 따라 global alignment와 local alignment를 수행한다. global alignment는 시드(정확하게 정렬된 위치) 사이의 서열을 정렬하기 위해 사용되고, local alignment는 맨 앞에 있는 시드의 앞과 맨 뒤에 있는 시드의 뒤를 정렬하는데 사용된다.As shown in FIG. 6, the glocal alignment performs global alignment and local alignment depending on the situation. The global alignment is used to align sequences between seeds (correctly aligned positions), and the local alignment is used to align the seeds before and after the frontmost seed.

이러한 glocal alignment을 통해 가장 높은 점수를 갖는 최종 후보 노드를 검출한다.Through this glocal alignment, the final candidate node with the highest score is detected.

한편, 본 발명에 따른 염기 서열 정렬 방법은 그래프 형태의 데이터를 처리하는데 우수한 In-memory graph 기반 분산처리 시스템 Trinity를 이용하여 서열 정렬을 수행할 수 있다. Meanwhile, the nucleotide sequence alignment method according to the present invention can perform sequence alignment using an in-memory graph-based distributed processing system Trinity for processing graphical data.

도 7은 본 발명의 일 실시예에 따른 Trinity을 이용한 서열 정렬 과정을 보여주는 도면이다.FIG. 7 is a diagram illustrating a sequence alignment process using Trinity according to an embodiment of the present invention. Referring to FIG.

도 7에 도시한 바와 같이, 본 발명에서 적용하는 분산 처리 시스템 Trinity의 구조는 간단하게 역할에 따라 client, slaves, proxy의 3개의 노드 타입으로 구분된다.As shown in FIG. 7, the structure of the distributed processing system Trinity applied in the present invention is divided into three node types of client, slaves, and proxy according to its role.

Client는 slave 또는 proxy로 메시지를 요청하고 이에 대한 응답을 받는다. Slaves는 그래프 데이터를 저장하고 실질적인 연산을 수행한다. Proxy는 slave와 client 사이에 중간 계층으로서 역할을 하는데, slaves들로부터 받은 결과를 통합하여 client로 최종적인 결과를 보낸다.The client requests a message from the slave or proxy and receives a response. Slaves stores graph data and performs actual operations. The proxy acts as an intermediate layer between the slave and the client, integrates the results from the slaves and sends the final result to the client.

본 발명에 따른 염기 서열 정렬 방법 즉, BulkAligner 방법은 총 2개의 단계로 이루어진다.The nucleotide sequence alignment method according to the present invention, namely, the BulkAligner method, consists of two steps in total.

첫번째 단계(step 1)에서, client 또는 proxy는 각 slave로 메시지를 보내고, 메시지를 받은 각 slave는 자신의 디스크에 있는 참조 데이터를 메모리에 올려 참조 그래드를 생성하게 된다.In the first step (step 1), the client or proxy sends a message to each slave, and each slave that receives the message puts the reference data on its disk in memory to generate a reference grant.

두번째 단계(step 2)에서, client 또는 proxy는 서열 정렬을 위한 리드 쿼리(read queries)를 각 slave로 보내고, 각 slave는 리드 쿼리에 대해 서열 정열을 수행한다. 그리고 client 또는 proxy는 각 salve로부터 리드 쿼리에 대한 결과를 받아 통합하게 된다.In the second step (step 2), the client or proxy sends read queries to the slaves for sequence alignment, and each slave performs sequence alignment for the lead query. The client or proxy then receives the results of the lead query from each salve and integrates them.

한편, 이상에서 설명한 본 발명의 실시예를 구성하는 모든 구성 요소들이 하나로 결합하거나 결합하여 동작하는 것으로 기재되어 있다고 해서, 본 발명이 반드시 이러한 실시예에 한정되는 것은 아니다. 즉, 본 발명의 목적 범위 안에서라면, 그 모든 구성 요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다. 또한, 그 모든 구성 요소들이 각각 하나의 독립적인 하드웨어로 구현될 수 있지만, 각 구성 요소들의 그 일부 또는 전부가 선택적으로 조합되어 하나 또는 복수 개의 하드웨어에서 조합된 일부 또는 전부의 기능을 수행하는 프로그램 모듈을 갖는 컴퓨터 프로그램으로서 구현될 수도 있다. 또한, 이와 같은 컴퓨터 프로그램은 USB 메모리, CD 디스크, 플래쉬 메모리 등과 같은 컴퓨터가 읽을 수 있는 저장매체(Computer Readable Media)에 저장되어 컴퓨터에 의하여 읽혀지고 실행됨으로써, 본 발명의 실시예를 구현할 수 있다. 컴퓨터 프로그램의 저장매체로서는 자기 기록매체, 광 기록매체, 캐리어 웨이브 매체 등이 포함될 수 있다.It is to be understood that the present invention is not limited to these embodiments, and all of the elements constituting the embodiments of the present invention described above may be combined or operated in one operation. That is, within the scope of the present invention, all of the components may be selectively coupled to one or more of them. In addition, although all of the components may be implemented as one independent hardware, some or all of the components may be selectively combined to perform a part or all of the functions in one or a plurality of hardware. As shown in FIG. In addition, such a computer program may be stored in a computer-readable medium such as a USB memory, a CD disk, a flash memory, etc., and read and executed by a computer, thereby implementing embodiments of the present invention. As the storage medium of the computer program, a magnetic recording medium, an optical recording medium, a carrier wave medium, or the like may be included.

이상에서 설명한 실시예들은 그 일 예로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or essential characteristics thereof. Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

110: 참조 그래프 생성부
120: 리드 그래프 생성부
130: 후보 검출부
140: 최종 후보 검출부110: reference graph generating unit
120: Lead graph generating section
130: candidate detector
140: final candidate detector

Claims

A reference graph generation unit for converting the base sequence data into graph data and generating reference graph data from the result of the conversion;
A lead graph generating unit for generating read graph data from the generated reference graph data;
A candidate detector for detecting candidate paths that allow an error in the generated lead graph data; And
A final candidate detector for detecting one of the detected candidate paths as a final candidate path;
Wherein the reference graph generator converts the entire original base sequence data into one graph data and generates reference graph data as a result of the conversion,
The candidate detector is configured to determine, from the generated read graph data, whether there is a possibility of connection between the nodes with respect to nodes that are less than n-hop from the last node of each trunk, Is detected as a candidate path. The apparatus for sorting nucleotide sequences based on graph data.

The method according to claim 1,
Wherein the reference graph generating unit comprises:
Sequentially cutting the base sequence data up to (k-1) times based on the first data, generating cut-out sequence data,
K-mer sequence data composed of a plurality of nodes is generated by dividing each of the cut-out sequence data into k-mer forms,
And converting the generated k-mer sequence data into one graph data and generating a reference graph based on the result of the conversion.

3. The method of claim 2,
Wherein the reference graph generating unit comprises:
And dividing each of the cut-out sequence data into a 3-mer form to generate 3-mer sequence data composed of a plurality of nodes.

The method according to claim 1,
Wherein the lead graph generating unit comprises:
The lead sequence data is divided into k-mer used for generating the reference graph data, and a node corresponding to the k-mer sequence data is taken from the reference graph data to generate lead graph data. Apparatus for alignment.

delete

The method according to claim 1,
Wherein the candidate detecting unit comprises:
Checks whether there is a possibility of connection between the nodes, which are located in the n-hop below the last node of each trunk in the generated lead graph,
And a virtual path is added between the two nodes having the possibility of connection as a result of the check, and the information of the offset of the two nodes is detected as the candidate path by adding the virtual path. Lt; / RTI >

The method according to claim 1,
Wherein the final candidate detecting unit comprises:
wherein a final candidate path of one of the candidate paths is detected using a glocal alignment algorithm.

Converting the base sequence data into graph data and generating reference graph data from the result of the conversion;
Generating lead graph data from the generated reference graph data;
Detecting candidate paths that allow an error in the generated lead graph data; And
Detecting one candidate path among the detected candidate paths as a final candidate path;
Wherein the step of generating the reference graph data comprises converting the entire original base sequence data into one graph data and generating reference graph data as a result of the conversion,
Wherein the step of detecting the candidate paths comprises the steps of: determining, from the generated read graph data, whether there is a possibility of connection between the nodes, which are located less than n-hop from the last node of each trunk, And detecting a long connectable path as a candidate path.

9. The method of claim 8,
Wherein the step of generating the reference graph data comprises:
Sequentially cutting the base sequence data up to (k-1) times based on the first data, generating cut-out sequence data,
K-mer sequence data composed of a plurality of nodes is generated by dividing each of the cut-out sequence data into k-mer forms,
And converting the generated k-mer sequence data into one graph data, and generating a reference graph based on the result of the conversion.

10. The method of claim 9,
Wherein the step of generating the reference graph data comprises:
And 3-mer sequence data composed of a plurality of nodes is generated by dividing each of the cut-out sequence data into a 3-mer form.

9. The method of claim 8,
Wherein the generating the read graph data comprises:
The lead sequence data is divided into k-mer used for generating the reference graph data, and a node corresponding to the k-mer sequence data is taken from the reference graph data to generate lead graph data. Method for alignment.

delete

9. The method of claim 8,
Wherein the detecting the candidate path comprises:
Checks whether there is a possibility of connection between the nodes, which are located in the n-hop below the last node of each trunk in the generated lead graph,
And a virtual path is added between the nodes having the possibility of linking as a result of the check, and the information of the offset of the two nodes is detected as the candidate path by adding the virtual path. Lt; / RTI >

9. The method of claim 8,
Wherein the step of detecting the final candidate path comprises:
wherein a final candidate path of one of the candidate paths is detected using a glocal alignment algorithm.