KR20190052416A

KR20190052416A - Method and apparatus for performing join operation of sequence data in a map reduce environment

Info

Publication number: KR20190052416A
Application number: KR1020170148083A
Authority: KR
Inventors: 박경현; 원희선
Original assignee: 한국전자통신연구원
Priority date: 2017-11-08
Filing date: 2017-11-08
Publication date: 2019-05-16

Abstract

Disclosed are a method for performing join operation of sequence data in a MapReduce environment, and a device thereof. According to the present invention, the method for performing join operation of sequence data executed in a MapReduce distributed processing system comprises a map phase, and a reduce phase of receiving values, as list values, coupled to an output of the map phase by the same key. The map phase comprises the following steps: acquiring first subsequence data divided from first and second sequence data as an input value; paring information of all partitions of second sequence data to perform join operation with the first sequence data, with partition information indicating the first subsequence data to generate at least one output key; and coupling the at least one generated output key with the input value to output the coupled value. Accordingly, join operation can be efficiently supported in a MapReduce environment.

Description

[0001] The present invention relates to a method and an apparatus for performing a join operation of sequence data in a map reduction environment,

본 발명은 맵 리듀스 환경에서 시퀀스 데이터의 조인 연산 수행 방법 및 장치에 관한 것으로, 더욱 상세하게는 맵 리듀스 환경에서 시퀀스 데이터의 조인 연산을 수행하기 위해 시퀀스 데이터들을 균등하게 분산하여 조인 연산을 수행함으로써 맵 리듀스 환경에서의 조인 연산을 지원하는 방법에 관한 것이다.The present invention relates to a method and apparatus for performing a join operation on sequence data in a map reduction environment, and more particularly, to a method and apparatus for performing a join operation by distributing sequence data evenly in order to perform join operation of sequence data in a map- To a method for supporting join operations in a map reduction environment.

맵리듀스(MapReduce)는 구글에서 대용량 데이터 처리를 분산 병렬 컴퓨팅에서 처리하기 위한 목적으로 제작하여 2004년 발표한 소프트웨어 프레임워크이다. 맵리듀스는 페타바이트 이상의 대용량 데이터를 신뢰도가 낮은 컴퓨터로 구성된 클러스터 환경에서 병렬 처리를 지원하기 위해서 개발되었으나, 병렬처리, 장애복구, 로드 밸런싱과 같이 병렬처리에서 고려해야할 복잡한 문제들을 해결하여 대용량 분산 병렬 처리 분야에서 가장 대중적인 프레임워크가 되었다.MapReduce is a software framework released by Google in 2004 for the purpose of processing large amounts of data in distributed parallel computing. Although MapReduce was developed to support parallel processing in a cluster environment consisting of unreliable computers with large data over petabytes, it solves the complex problems to be considered in parallel processing such as parallel processing, failover, and load balancing, It has become the most popular framework in the field of processing.

맵리듀스는 함수형 프로그래밍에서 일반적으로 사용되는 맵(Map)과 리듀스(Reduce)라는 함수 기반으로 구성되는데, 맵 함수에서는 키와 값의 쌍인 (key, value)를 입력으로 받고, 사용자에 의해 정의된 map 함수의 작업을 수행하여 또다른 키-값의 쌍을 출력한다.The MapReduce consists of a function called Map and Reduce, which are commonly used in functional programming. The map function receives key and value pairs (key and value) Performs the operation of the map function to output another key-value pair.

리듀스 함수는 키와 키에 대한 값들의 리스트(key, list(values))를 입력으로 받고, 또다른 키와 값의 쌍인 (key2, value2)를 출력한다.The Reduce function takes a list of values for a key and key (key, list (values)) as input, and outputs another key-value pair (key2, value2).

한편, 주식, 네트워크 트래픽과 같은 시퀀스 데이터들은 데이터베이스 응용분야에서 많이 사용되고 있는데, 특히 시퀀스 조인 연산이 질의 시퀀스 데이터와 유사한 시퀀스 데이터를 검색하는 연산으로 많이 활용된다.On the other hand, sequence data such as stocks and network traffic are widely used in database application fields. Especially, sequence join operations are frequently used for searching sequence data similar to query sequence data.

그러나, 현재의 맵리듀스는 데이터 모델, 인덱스, 질의 언어 등을 지원하지 않고 있어 기존의 데이터베이스에서 지원하는 기능을 제공하지 못하며, 특히 조인 연산을 지원하지 않는 문제점이 있다.However, current MapReduce does not support data models, indexes, query languages, etc., and thus does not provide the functions supported by existing databases, and in particular, does not support join operations.

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은, 맵 리듀스 환경에서 시퀀스 데이터의 조인 연산을 수행하는 방법을 제공하는 데 있다.SUMMARY OF THE INVENTION An object of the present invention is to provide a method of performing join operation of sequence data in a map reduction environment.

상기와 같은 문제점을 해결하기 위한 본 발명의 다른 목적은, 시퀀스 데이터의 조인 연산 수행하는 맵 리듀스 분산 처리 시스템을 제공하는 데 있다.It is another object of the present invention to provide a map reduction distributed processing system for performing join operation of sequence data.

상기 목적을 달성하기 위한 본 발명의 일 측면은, 맵 리듀스 분산 처리 시스템에서 수행되는 시퀀스 데이터의 조인 연산 수행 방법을 제공한다.According to an aspect of the present invention, there is provided a method of performing a join operation of sequence data performed in a map reduction distributed processing system.

여기서 맵 리듀스 분산 처리 시스템에서 수행되는 시퀀스 데이터의 조인 연산 수행 방법은, 맵 단계(Map Phase) 및 상기 맵 단계의 출력에 대하여 동일한 키와 결합된 값들을 리스트 입력값(list values)으로 수신하는 리듀스 단계(Reduce Phase)를 포함할 수 있다.Here, a method of performing a join operation of sequence data performed in the map reduction distributed processing system includes receiving, as list values, values associated with the same key for a map phase and an output of the map phase And a Reduce Phase.

여기서 조인 연산 수행 방법은 맵 단계 이후에 셔플링 단계를 더 포함할 수 있다. 여기서 셔플링 단계는, 맵 단계에서의 출력에 대하여 동일한 키와 결합된 값들을 리스트 입력값으로 생성하는 단계를 포함할 수 있다. Here, the method of performing the join operation may further include a shuffling step after the map step. Wherein the shuffling step may include generating, as a list input value, values associated with the same key for the output in the map step.

여기서 상기 맵 단계는, 제1 시퀀스 데이터가 분할된 제1 서브 시퀀스 데이터를 입력값으로 획득하는 단계, 상기 제1 시퀀스 데이터와 조인 연산을 수행할 제2 시퀀스 데이터의 모든 파티션 정보를 상기 제1 서브 시퀀스 데이터를 지시하는 파티션 정보와 페어링하여 적어도 하나의 출력키를 생성하는 단계 및 생성된 적어도 하나의 출력키와 상기 입력값을 결합하여 출력하는 단계를 포함할 수 있다.Wherein the mapping step comprises the steps of: obtaining first subsequence data obtained by dividing the first sequence data as an input value; dividing all partition information of the second sequence data to be joined with the first sequence data into the first sub- Generating at least one output key by pairing with partition information indicating sequential data, and outputting the combined at least one output key and the input value.

여기서 상기 리듀스 단계는, 상기 리스트 입력값을 수신하는 단계, 수신된 리스트 입력값을 상기 제1 시퀀스 데이터의 서브 시퀀스 데이터와 상기 제2 시퀀스 데이터의 서브 시퀀스 데이터로 분리하는 단계 및 분리된 상기 제1 시퀀스 데이터의 서브 시퀀스 데이터와 상기 제2 시퀀스 데이터의 서브 시퀀스 데이터 상호간 조인 연산을 수행하는 단계를 포함할 수 있다.Wherein the step of reducing includes receiving the list input value, separating the received list input value into subsequence data of the first sequence data and subsequence data of the second sequence data, And performing a join operation between the subsequence data of the first sequence data and the subsequence data of the second sequence data.

이때, 상기 제1 시퀀스 데이터가 상기 DB 시퀀스 데이터이면 상기 제2 시퀀스 데이터는 상기 질의 시퀀스 데이터이며, 상기 제1 시퀀스 데이터가 상기 질의 시퀀스 데이터이면, 상기 제2 시퀀스 데이터는 DB 시퀀스 데이터일 수 있다.In this case, if the first sequence data is the DB sequence data, the second sequence data is the query sequence data, and if the first sequence data is the query sequence data, the second sequence data may be DB sequence data.

여기서, 제1 시퀀스 데이터의 서브 시퀀스 데이터와 상기 제2 시퀀스 데이터의 서브 시퀀스 데이터로 분리하는 단계는, 각각의 서브 시퀀스 데이터들이 질의 시퀀스 데이터와 DB 시퀀스 데이터 중 어느 데이터에 속하는지 구분하는 단계일 수 있다.Here, the step of separating the subsequence data of the first sequence data into the subsequence data of the second sequence data may be a step of discriminating which of the query sequence data and the DB sequence data belongs to the respective subsequence data have.

여기서 상기 리듀스 단계는, 상기 맵 리듀스 분산 처리 시스템의 각 노드에서 개별적으로 수행될 수 있다.Here, the reduction step may be performed separately at each node of the map reduction distribution processing system.

여기서 상기 제1 시퀀스 데이터의 서브 시퀀스 데이터와 상기 제2 시퀀스 데이터의 서브 시퀀스 데이터로 분리하는 단계는, 분리된 상기 제1 시퀀스 데이터의 서브 시퀀스 데이터와 상기 제2 시퀀스 데이터의 서브 시퀀스 데이터를 각각 질의 버퍼(buffer)와 데이터 버퍼에 저장하는 단계를 포함할 수 있다. Wherein separating the subsequence data of the first sequence data and the subsequence data of the second sequence data comprises: querying the subsequence data of the first sequence data and the subsequence data of the second sequence data, And storing the data in a buffer and a data buffer.

여기서 상기 제1 시퀀스 데이터의 서브 시퀀스 데이터와 상기 제2 시퀀스 데이터의 서브 시퀀스 데이터 상호간 조인 연산을 수행하는 단계는, 상기 제1 시퀀스 데이터의 서브 시퀀스 데이터와 상기 제2 시퀀스 데이터의 서브 시퀀스 데이터를 질의 시퀀스 데이터와 DB 시퀀스 데이터로 입력하여 SM-T 알고리즘을 수행하는 단계를 더 포함할 수 있다.The step of performing a join operation between the subsequence data of the first sequence data and the subsequence data of the second sequence data may include querying the subsequence data of the first sequence data and the subsequence data of the second sequence data And performing the SM-T algorithm by inputting the sequence data and the DB sequence data.

여기서 상기 제1 시퀀스 데이터의 서브 시퀀스 데이터와 상기 제2 시퀀스 데이터의 서브 시퀀스 데이터 상호간 조인 연산을 수행하는 단계는, 상기 제1 시퀀스 데이터의 서브 시퀀스 데이터와 상기 제2 시퀀스 데이터의 서브 시퀀스 데이터를 질의 시퀀스 데이터와 DB 시퀀스 데이터로 입력하여 SM-D 알고리즘을 수행하는 단계를 더 포함할 수 있다.The step of performing a join operation between the subsequence data of the first sequence data and the subsequence data of the second sequence data may include querying the subsequence data of the first sequence data and the subsequence data of the second sequence data And performing SM-D algorithm by inputting the sequence data and the DB sequence data.

상기 목적을 달성하기 위한 본 발명의 다른 측면은, 시퀀스 데이터의 조인 연산 수행하는 맵 리듀스 분산 처리 시스템을 제공한다.According to another aspect of the present invention, there is provided a map reduction distributed processing system for performing a join operation of sequence data.

여기서 맵 리듀스 분산 처리 시스템은 적어도 하나의 프로세서 및 상기 적어도 하나의 프로세서가 적어도 하나의 단계를 수행하도록 지시하는 명령어들(instructions)을 저장하는 메모리(memory)를 포함할 수 있다.Wherein the map redistribution processing system may include at least one processor and a memory storing instructions that direct the at least one processor to perform at least one step.

여기서 상기 적어도 하나의 단계는, 맵 단계(Map Phase) 및 상기 맵 단계의 출력에 대하여 동일한 키와 결합된 값들을 리스트 입력값(list values)으로 수신하는 리듀스 단계(Reduce Phase)를 포함할 수 있다.Wherein the at least one step may include a Reduce Phase for receiving list values as values associated with the same key for the map phase and the output of the map step have.

여기서 상기 적어도 하나의 단계는 맵 단계 이후에 셔플링 단계를 더 포함할 수 있다. Wherein the at least one step may further comprise a shuffling step after the map step.

여기서 셔플링 단계는, 맵 단계에서의 출력에 대하여 동일한 키와 결합된 값들을 리스트 입력값으로 생성하는 단계를 포함할 수 있다. Wherein the shuffling step may include generating, as a list input value, values associated with the same key for the output in the map step.

상기와 같은 본 발명에 따른 맵 리듀스 분산 처리 시스템에서 수행되는 시퀀스 데이터의 조인 연산 수행 방법 및 맵 리듀스 분산 처리 시스템을 이용할 경우에는 리듀스에서 공평하게 작업을 수행할 수 있어 맵리듀스 환경의 연산 능력을 최대한 활용할 수 있다.In the case of using the method of performing the join operation of the sequence data and the map reduction distributed processing system performed in the map reduction distributed processing system according to the present invention as described above, it is possible to perform operations fairly in the reduced system, You can make the most of your abilities.

따라서, 맵 리듀스 환경에서 조인 연산을 수행하는 비용을 감소시킬 수 있는 장점이 있다.Therefore, there is an advantage that the cost of performing the join operation in the map reduction environment can be reduced.

도 1은 본 발명의 일 실시예에 따른 시퀀스 데이터의 조인 연산 수행 방법에서 맵 단계를 설명하기 위한 예시도이다.
도 2는 본 발명의 일 실시예에 따른 맵 단계 이후에 셔플링이 수행되는 과정을 설명하기 위한 예시도이다.
도 3은 본 발명의 일 실시예에 따른 리듀스 단계를 설명하기 위한 예시도이다.
도 4는 본 발명의 일 실시예에 따른 맵 리듀스 분산 처리 시스템에서 수행되는 시퀀스 데이터의 조인 연산 수행 방법에 대한 흐름도이다.
도 5는 본 발명의 제1 실시예에 따른 조인 연산을 수행하는 방법에 대한 흐름도이다.
도 6은 본 발명의 제2 실시예에 따른 조인 연산을 수행하는 방법에 대한 흐름도이다.
도 7은 본 발명의 일 실시예에 따른 시퀀스 데이터의 조인 연산 수행하는 맵 리듀스 분산 처리 시스템에 대한 구성도이다.1 is an exemplary diagram for explaining a map step in a method of performing a join operation of sequence data according to an embodiment of the present invention.
FIG. 2 is an exemplary diagram illustrating a process of shuffling after a map step according to an embodiment of the present invention. Referring to FIG.
FIG. 3 is an exemplary diagram illustrating a reduction step according to an embodiment of the present invention. Referring to FIG.
4 is a flowchart illustrating a method of performing a join operation of sequence data performed in a map reduction distributed processing system according to an embodiment of the present invention.
5 is a flowchart of a method of performing a join operation according to the first embodiment of the present invention.
6 is a flowchart of a method of performing a join operation according to a second embodiment of the present invention.
7 is a configuration diagram of a map reduction distributed processing system for performing a join operation of sequence data according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. Like reference numerals are used for like elements in describing each drawing.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다. The terms first, second, A, B, etc. may be used to describe various elements, but the elements should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application Do not.

기존의 서브 시퀀스 매칭 알고리즘은 질의된 시퀀스 데이터와 유사한 시퀀스 데이터의 위치를 찾는 알고리즘이다. 싱글 머신에서 수행되는 서브 시퀀스 매칭 알고리즘은 계산 비용을 줄이기 위해 R* 트리를 구축하여 범위 탐색을 수행한다. 그러나 맵 리듀스는 Shared Nothing 아키텍쳐를 지양하기 때문에 트리 기반의 글로벌한 인덱스를 구축하는 것은 높이 비용(depth cost)가 발생한다. 따라서 본 발명에서는 기존의 서브 시퀀스 매칭 알고리즘과 달리 맵 리듀스 환경의 각 노드가 독립적으로 서브 시퀀스 매칭 알고리즘을 수행할 수 있는 방안을 제안한다. The conventional subsequence matching algorithm is an algorithm for finding the position of sequence data similar to the inquired sequence data. The subsequence matching algorithm performed on a single machine constructs an R * tree to perform a range search to reduce computation cost. However, because mapping eliminates the Shared Nothing architecture, constructing a tree-based global index results in a depth cost. Therefore, unlike the conventional subsequence matching algorithm, the present invention proposes a method in which each node of the map reduction environment can independently perform a subsequence matching algorithm.

이하, 본 발명에 따른 바람직한 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 시퀀스 데이터의 조인 연산 수행 방법에서 맵 단계를 설명하기 위한 예시도이다.1 is an exemplary diagram for explaining a map step in a method of performing a join operation of sequence data according to an embodiment of the present invention.

도 1을 참조하면, 시퀀스 데이터의 조인 연산을 수행하는 과정은 질의 시퀀스 데이터와 유사한 시퀀스 데이터를 데이터베이스에서 찾는 과정이므로, 질의 시퀀스 데이터와 데이터베이스에 저장된 시퀀스 데이터인 DB 시퀀스 데이터가 입력 파일이 될 수 있다. 다만 DB 시퀀스 데이터는 지칭하는 명칭을 정의한 것일 뿐이고, 실제로 데이터베이스에 저장된 시퀀스 데이터만을 의미하는 것은 아니며, 질의 시퀀스 데이터와 비교 대상이 되는 시퀀스 데이터는 모두 포함될 수 있다. Referring to FIG. 1, a process of performing a join operation of sequence data is a process of searching the database for sequence data similar to the query sequence data, so that the query sequence data and the DB sequence data, which is sequence data stored in the database, . However, the DB sequence data does not only mean the sequence data stored in the database but may include both the query sequence data and the sequence data to be compared.

도 1을 참조하면, DB시퀀스 데이터(D)는 목적에 따라 분할되어 D1과 D2두 개의 서브 시퀀스 데이터로 분리될 수 있으며, 각 서브 시퀀스 데이터 내에는 시퀀스 데이터들(d1 ~ d4)이 포함될 수 있다. 이때, D1과 D2는 각각의 서브 시퀀스 데이터를 지시하는 기호로 사용될 수 있다. 이하에서는 D1과 D2와 같이 서브 데이터 각각을 지칭하는 기호를 모두 파티션 정보로 지칭될 수 있다. 여기서, 시퀀스 데이터를 분할하는 예를 들면, 시퀀스 데이터가 시계열 데이터일때 월별로 분할하여 특정 월에 대한 데이터들로 구성된 서브 시퀀스 데이터들로 분할될 수 있다.Referring to FIG. 1, the DB sequence data D may be divided into two subsequence data items D1 and D2 according to the purpose, and the sequence data items d1 to d4 may be included in each subsequence data item . At this time, D1 and D2 can be used as symbols indicating respective subsequence data. Hereinafter, symbols indicating respective sub data such as D1 and D2 may be referred to as partition information. Here, for example, when the sequence data is time series data, the sequence data may be divided into months and divided into subsequence data composed of data for a specific month.

마찬가지로, 질의 시퀀스 데이터(Q)는 목적에 따라 분할되어 Q1과 Q2 두 개의 파티션 정보로 지시되는 서브 시퀀스 데이터들로 분리될 수 있으며, 각 서브 시퀀스 데이터 내에는 시퀀스 데이터들(q1 ~ q4)이 포함될 수 있다. 이때 Q1과 Q2는 각각의 서브 시퀀스 데이터를 지시하는 기호로 사용될 수 있으며, 통칭하여 파티션 정보로 지칭될 수 있다.Similarly, the query sequence data Q may be divided into subsequence data that is divided according to the purpose and indicated by two pieces of partition information of Q1 and Q2, and each subsequence data includes sequence data q1 to q4 . At this time, Q1 and Q2 can be used as symbols indicating respective subsequence data, collectively referred to as partition information.

본 발명의 일 실시예에 따른 맵 단계는, 입력 데이터로 수신된 시퀀스 데이터를 읽으며, 입력 데이터에 대한 파티션 정보를 조인 연산을 수행할 시퀀스 데이터에 대한 모든 파티션 정보와 페어링하여, 입력된 시퀀스 데이터에 대한 키를 생성하는 것에 그 특징이 있다.The map step according to an embodiment of the present invention reads the sequence data received as input data, and pairs the partition information on the input data with all the partition information on the sequence data to be subjected to the join operation, The key is to generate the key.

제1 맵 단계를 참조하면, 파티션 정보 D1에 의해 지시되는 DB 시퀀스 데이터의 서브 시퀀스 데이터을 입력 데이터로 수신하게 되면, 파티션 정보인 D1을 조인 연산을 수행하게 될 질의 시퀀스의 모든 파티션 정보들(Q1, Q2)과 페어링하여 하나 이상의 키를 생성할 수 있다.Referring to the first map step, when subsequence data of DB sequence data indicated by the partition information D1 is received as input data, all the partition information (Q1, Q2) of the query sequence to be subjected to the join operation, Q2) to generate one or more keys.

제1 맵 단계에서는 모든 페어링이 이루어지면, (D1, Q1) 및 (D1, Q2)가 키로 생성될 수 있다.In the first map step, if all pairings are made, (D1, Q1) and (D1, Q2) can be generated by key.

이때, 제1 맵 단계의 결과로는 앞서 생성된 키들((D1, Q1), (D1, Q2))에 대하여 입력 데이터인 서브 시퀀스 D1의 각 데이터들(d1, d2)이 결합되어 출력될 수 있다.At this time, as a result of the first map step, the respective data d1 and d2 of the subsequence D1, which are input data, are combined and output to the previously generated keys (D1, Q1, D1 and Q2) have.

앞에서와 마찬가지로, DB시퀀스 데이터의 또 다른 서브 시퀀스 D2에 대해서도, 또 다른 맵 단계가 이루어지고, 그에 따른 결과로서 D2와 질의 서퀀스의 서브 시퀀스 데이터인 Q1, Q2가 페어링되어 (D2, Q1), (D2, Q2)가 키 값으로 생성될 수 있다.As described above, another map step is also performed for another subsequence D2 of the DB sequence data. As a result, D2 and Q1, Q2, which are subsequence data of the query sequence, are paired (D2, Q1) (D2, Q2) can be generated as a key value.

제2 맵 단계를 참조하면, 질의 시퀀스 데이터(Q)가 파티셔닝된 서브 시퀀스 데이터인 Q1이 입력 데이터로 수신될 수 있고, 수신된 입력 데이터를 한줄씩 데이터값을 읽으면서, 서브 시퀀스 데이터의 파티션 정보 Q1과 DB시퀀스 데이터의 모든 파티션 정보인 D1, D2를 페어링하여 키 값인 (D1, Q1) 및 (D2, Q1)을 생성할 수 있다. 생성된 키 값들은 입력 데이터인 Q1과 결합하여 제 맵 단계의 결과로 출력될 수 있다.Referring to the second map step, Q1, which is the subsequence data in which the query sequence data Q is partitioned, can be received as input data, and while reading the data values of the received input data line by line, It is possible to generate key values (D1, Q1) and (D2, Q1) by pairing Q1 and D1 and D2, which are all partition information of the DB sequence data. The generated key values may be output as a result of the map step in combination with the input data Q1.

정리하면, 질의 시퀀스 데이터와 DB 시퀀스 데이터가 파티셔닝된 서브 시퀀스 데이터들이 각각 맵 단계의 입력으로 수신되며, 각각의 맵 단계에서는 입력된 서브 시퀀스 데이터들에 대한 키 값을 생성할 수 있다. 이때 생성된 키 값은 All Pair방식으로 입력된 서브 시퀀스 데이터와 조인 연산될 시퀀스 데이터의 모든 파티션 정보들과 페어링되어 생성될 수 있다. In summary, the subsequence data in which the query sequence data and the DB sequence data are partitioned is received as the input of the map step, respectively, and the key value for the input subsequence data can be generated in each map step. The generated key value may be generated by pairing all the partition information of the subsequence data input by the All Pair method and the sequence data to be joined.

예를 들어, 입력 데이터가 DB 시퀀스 데이터의 서브 시퀀스 데이터 중에서 파티션 정보 D1에 의해 지시되는 데이터라면, 맵 단계는 (D1,Q1), (D1,Q2) 키를 생성하고, 2개의 (키,값) 집합인 <(D1,Q1), DB 서브 시퀀스 데이터>, <(D1,Q2), DB 서브 시퀀스 데이터>을 생성할 수 있다.For example, if the input data is the data indicated by the partition information D1 among the subsequence data of the DB sequence data, the map step generates the keys (D1, Q1) and (D1, Q2) (D1, Q1), DB subsequence data, < (D1, Q2), DB subsequence data >

여기서는 질의 시퀀스 데이터와 DB 시퀀스 데이터를 각각 2개의 서브 시퀀스 데이터로 분할하여 각각 2개의 파티션 정보를 갖는 것을 예로 들었으나, 2개 이상의 복수의 서브 시퀀스로 분할될 수 있으므로, 질의 시퀀스 데이터와 DB 시퀀스 데이터의 파티션 정보들은 각각 2개 이상일 수 있다.Although query sequence data and DB sequence data are each divided into two subsequence data and two partition information pieces are provided as examples, they can be divided into two or more subsequences. Therefore, the query sequence data and the DB sequence data The partition information may be two or more.

도 2는 본 발명의 일 실시예에 따른 맵 단계 이후에 셔플링이 수행되는 과정을 설명하기 위한 예시도이다.FIG. 2 is an exemplary diagram illustrating a process of shuffling after a map step according to an embodiment of the present invention. Referring to FIG.

도 2를 참조하면, 맵 단계를 여러 번 수행하여 모든 서브 시퀀스들에 대하여 키 값 생성이 완료되면, 셔플링 단계는 동일한 키를 갖는 값들로 서브 시퀀스 데이터들을 묶을 수 있다.Referring to FIG. 2, when the key value generation is completed for all the subsequences by performing the map step a plurality of times, the shuffling step may group the subsequence data with values having the same key.

예를 들어, 도 1에 따른 제1 맵 단계를 거쳐 생성된 키 (D1, Q1)를 갖는 DB 서브 시퀀스 데이터들에는 d1과 d2가 있을 수 있고, 제2 맵 단계를 거쳐 생성된 키 (D1, Q1)를 갖는 질의 서브 시퀀스 데이터들에는 q1과 q2가 있을 수 있다.For example, DB subsequence data having keys (D1, Q1) generated through the first map step according to FIG. 1 may have d1 and d2, and the keys D1, Q1) may have q1 and q2 in the query subsequence data.

셔플링 단계에서는 제1 맵 단계와 제2 맵 단계에 따라 생성된 <키-서브시퀀스 데이터>에 대하여 동일한 키를 갖는 서브 시퀀스 데이터들을 하나로 묶을 수 있다.In the shuffling step, the subsequence data having the same key may be grouped into <key-subsequence data> generated according to the first map step and the second map step.

도 2를 참조하면, 셔플링 단계는 동일한 키 (D1, Q1)를 갖는 서브 시퀀스 데이터들 (d1, d2, q1, q2)을 모두 묶어 리스트 값(d1 d2, q1, q2)을 생성할 수 있다. 여기서 생성된 <키, 리스트 값>은 이후에서 설명하는 리듀스 단계에서의 입력으로 제공될 수 있다. 특히 생성된 <키, 리스트 값> 마다 서로 다른 노드에서 리듀스 단계를 각각 수행할 수 있고, 생성된 리스트 값들은 키마다 가급적 균등한 데이터 크기를 갖기 때문에, 맵 리듀스 환경에 따른 분산 처리를 효율적으로 수행할 수 있다.Referring to FIG. 2, the shuffling step may generate the list values d1 d2, q1 and q2 by grouping all of the subsequence data d1, d2, q1 and q2 having the same key D1 and Q1 . The < key, list value > generated here may be provided as an input in the reduction step to be described later. In particular, it is possible to perform the redisection step at different nodes for each generated < key, list value >, and since the generated list values have a uniform data size for each key as much as possible, . &Lt; / RTI >

도 3은 본 발명의 일 실시예에 따른 리듀스 단계를 설명하기 위한 예시도이다.FIG. 3 is an exemplary diagram illustrating a reduction step according to an embodiment of the present invention. Referring to FIG.

본 발명의 일 실시예에 따른 리듀스 단계의 입력으로서, 동일한 키 값을 갖는 시퀀스 데이터들의 리스트 값이 획득될 수 있는데, 도 3을 참조하면, 동일한 키 값인 (D1, Q1)에 대한 시퀀스 데이터들(d1, d2, q1, q2)들이 리스트 값으로 획득될 수 있다.As an input of the reduction step according to an embodiment of the present invention, a list value of sequence data having the same key value can be obtained. Referring to FIG. 3, sequence data for the same key value (D1, Q1) (d1, d2, q1, q2) can be obtained as a list value.

여기서 리스트 값은 그 데이터의 종류에 따라 질의 시퀀스 데이터와 DB 시퀀스 데이터로 구분되는 과정을 갖는데, 예를 들어 리스트 값(d1, d2, q1, q2)에 대하여 질의 시퀀스 데이터인 q1, q2를 질의 서브 시퀀스 데이터로 구분하고, DB 시퀀스 데이터인 d1, d2를 DB 서브 시퀀스 데이터로 구분할 수 있다.Here, the list value is divided into query sequence data and DB sequence data according to the type of the data. For example, the query sequence data q1, q2 for the list values (d1, d2, q1, q2) Sequence data, and DB sequence data d1 and d2 can be divided into DB subsequence data.

다음으로, 구분된 질의 서브 시퀀스 데이터와 DB 서브 시퀀스 데이터를 이용하여 조인 연산(또는 유사 조인 연산으로 지칭할 수도 있다.)을 수행할 수 있는데, 이때 질의 서브 시퀀스 데이터의 개별 데이터(q1, q2)와 DB 서브 시퀀스 데이터의 개별 데이터(d1, d2) 사이의 모든 매칭 관계에 대하여 거리를 측정함으로써 조인 연산을 수행할 수 있다. 여기서 거리는 유클리시안 거리(Euclidean Distance)를 의미할 수 있다. 예를 들어, (d1, d2) 와 (q1, q2) 사이의 매칭 방법으로서, (q1, d1), (q1, d2), (q2, d1), (q2, d2)의 매칭 관계가 있을 수 있고 매칭된 데이터 상호간에 거리를 측정할 수 있다.Next, the individual data (q1, q2) of the query subsequence data may be subjected to a join operation (or a similar join operation) using the divided query subsequence data and DB subsequence data. And the individual data d1 and d2 of the DB subsequence data can be performed by measuring the distances for all the matching relationships. Here, the distance may mean Euclidean distance. For example, there can be a matching relation between (q1, d1), (q1, d2), (q2, d1), (q2, d2) as a matching method between (d1, d2) And the distance between the matched data can be measured.

도 3을 참조하면, 매칭 관계에 따라 측정된 결과의 일부로서, q1과 d1 사이의 거리로 4, q1과 d2 사이의 거리로 2, q2와 d2 사이의 거리로 10이 측정된 것을 확인할 수 있다.Referring to FIG. 3, it can be seen that as a part of the results measured according to the matching relationship, 4 is measured as a distance between q1 and d1, 2 is measured as a distance between q1 and d2, and 10 as a distance between q2 and d2 .

도 4는 본 발명의 일 실시예에 따른 맵 리듀스 분산 처리 시스템에서 수행되는 시퀀스 데이터의 조인 연산 수행 방법에 대한 흐름도이다.4 is a flowchart illustrating a method of performing a join operation of sequence data performed in a map reduction distributed processing system according to an embodiment of the present invention.

도 4를 참조하면, 시퀀스 데이터의 조인 연산 수행 방법은, 맵 단계(Map Phase, S100) 및 상기 맵 단계의 출력에 대하여 동일한 키와 결합된 값들을 리스트 입력값(list values)으로 수신하는 리듀스 단계(Reduce Phase, S120)를 포함할 수 있다.Referring to FIG. 4, a method of performing a join operation of sequence data includes a map phase (S100) and a reduction step of receiving values combined with the same key for output of the map step as list values (Reduce Phase, S120).

여기서 조인 연산 수행 방법은 맵 단계(S100) 이후에 셔플링 단계(S110)를 더 포함할 수 있다. 여기서 셔플링 단계(S110)는, 맵 단계에서의 출력에 대하여 동일한 키와 결합된 값들을 리스트 입력값으로 생성하는 단계(S111)를 포함할 수 있다. 따라서, 셔플링 단계(S110)를 통해 리듀스 단계(S120)에서의 입력 데이터가 준비될 수 있다. 여기서 셔플링 단계(S110)는 개별 사용자에 의해 정의될 수도 있고, 맵 리듀스 프레임워크를 통해 미리 정의되어 제공됨으로써 분산 처리 시스템에서 자동적으로 수행될 수도 있다.Here, the method of performing the join operation may further include a shuffling step (S110) after the map step (S100). Here, the shuffling step S110 may include a step S111 of generating, as a list input value, values combined with the same key for the output in the map step. Therefore, the input data in the reduction step S120 can be prepared through the shuffling step S110. Here, the shuffling step (S110) may be defined by an individual user, or may be automatically performed in a distributed processing system by being predefined and provided through a map reduction framework.

여기서 상기 맵 단계(S100)는, 제1 시퀀스 데이터가 분할된 제1 서브 시퀀스 데이터를 입력값으로 획득하는 단계(S101), 상기 제1 시퀀스 데이터와 조인 연산을 수행할 제2 시퀀스 데이터의 모든 파티션 정보를 상기 제1 서브 시퀀스 데이터를 지시하는 파티션 정보와 페어링하여 적어도 하나의 출력키를 생성하는 단계(S102) 및 생성된 적어도 하나의 출력키와 상기 입력값을 결합하여 출력하는 단계(S103)를 포함할 수 있다.Here, the map step S100 may include a step S101 of obtaining the first subsequence data obtained by dividing the first sequence data as an input value, a step S102 of adding all the partitions of the second sequence data to be joined with the first sequence data (S102) of generating at least one output key by pairing the information with partition information indicating the first subsequence data, and combining the generated at least one output key and the input value (S103) .

여기서 상기 리듀스 단계(S120)는, 상기 리스트 입력값을 수신하는 단계(S121), 수신된 리스트 입력값을 상기 제1 시퀀스 데이터의 서브 시퀀스 데이터와 상기 제2 시퀀스 데이터의 서브 시퀀스 데이터로 분리하는 단계(S122) 및 분리된 상기 제1 시퀀스 데이터의 서브 시퀀스 데이터와 상기 제2 시퀀스 데이터의 서브 시퀀스 데이터 상호간 조인 연산을 수행하는 단계(S123)를 포함할 수 있다.Here, the reduction step S120 may include receiving the list input value (S121), separating the received list input value into subsequence data of the first sequence data and subsequence data of the second sequence data And performing a join operation between the subsequence data of the first sequence data and the subsequence data of the second sequence data separated in operation S123 and S123.

이때, 상기 제1 시퀀스 데이터가 상기 DB 시퀀스 데이터이면 상기 제2 시퀀스 데이터는 상기 질의 시퀀스 데이터이며, 상기 제1 시퀀스 데이터가 상기 질의 시퀀스 데이터이면, 상기 제2 시퀀스 데이터는 DB 시퀀스 데이터일 수 있다. 따라서, 제1 시퀀스 데이터의 서브 시퀀스 데이터와 상기 제2 시퀀스 데이터의 서브 시퀀스 데이터로 분리하는 단계(S122)는 각각의 서브 시퀀스 데이터들이 질의 시퀀스 데이터와 DB 시퀀스 데이터 중 어느 데이터에 속하는지 구분하는 단계일 수 있다.In this case, if the first sequence data is the DB sequence data, the second sequence data is the query sequence data, and if the first sequence data is the query sequence data, the second sequence data may be DB sequence data. Thus, the step of separating the subsequence data of the first sequence data into the subsequence data of the second sequence data (S122) comprises the steps of discriminating which of the query sequence data and the DB sequence data belongs to the respective subsequence data Lt; / RTI >

여기서 상기 리듀스 단계(S120)는, 상기 맵 리듀스 분산 처리 시스템의 각 노드에서 개별적으로 수행될 수 있다.Here, the reduction step (S120) may be performed individually at each node of the map reduction distribution processing system.

여기서 상기 제1 시퀀스 데이터의 서브 시퀀스 데이터와 상기 제2 시퀀스 데이터의 서브 시퀀스 데이터로 분리하는 단계(S122)는, 분리된 상기 제1 시퀀스 데이터의 서브 시퀀스 데이터와 상기 제2 시퀀스 데이터의 서브 시퀀스 데이터를 각각 질의 버퍼(buffer)와 데이터 버퍼에 저장하는 단계를 포함할 수 있다. 여기서 질의 버퍼와 데이터 버퍼는 질의 시퀀스 데이터와 DB 시퀀스 데이터를 구분하여 각각 저장하고 있어 이후 조인 연산을 수행하는 알고리즘이 동작될 때, 질의 서브 시퀀스 데이터와 DB 서브 시퀀스 데이터를 용이하게 구별하여 획득할 수 있다.Here, the step (S122) of separating the subsequence data of the first sequence data and the subsequence data of the second sequence data comprises a step (S122) of separating the subsequence data of the first sequence data and the subsequence data May be stored in a query buffer and a data buffer, respectively. Here, the query buffer and the data buffer store the query sequence data and the DB sequence data separately, and when the algorithm for performing the join operation is operated, the query subsequence data and the DB subsequence data can be easily distinguished and acquired have.

이하에서는, 구체적으로 앞서 리듀스 단계에서 도출되는 질의 서브 시퀀스 데이터와 DB 서브 시퀀스 데이터 상호간에 조인 연산을 수행하는 방법에 대하여 상세히 설명한다.Hereinafter, a method of performing a join operation between query subsequence data and DB subsequence data derived in the preceding step will be described in detail.

도 5는 본 발명의 제1 실시예에 따른 조인 연산을 수행하는 방법에 대한 흐름도이다.5 is a flowchart of a method of performing a join operation according to the first embodiment of the present invention.

도 5를 참조하면, 제1 실시예에 따른 조인 연산 수행 방법은 DB 서브 시퀀스 데이터와 질의 서브 시퀀스 데이터 상호간의 거리를 직접 비교하여 두 시퀀스간의 유사성을 판단하는 알고리즘(또는 SM-D 알고리즘으로 지칭)이라 할 수 있다. Referring to FIG. 5, in the join operation performing method according to the first embodiment, an algorithm (or an SM-D algorithm) for determining the similarity between two sequences by directly comparing distances between DB subsequence data and query subsequence data, .

SM-D 알고리즘은, DB 시퀀스 데이터와 질의 시퀀스의 거리를 직접 측정하는 기본적인(brute force) 알고리즘으로, DB 시퀀스 데이터를 질의 시퀀스 데이터와 동일한 길이를 갖는 슬라이딩 윈도우로 분할하는 단계(S200), 분할된 슬라이딩 윈도우와 질의 시퀀스 데이터 사이의 거리를 측정하는 단계(S210), 측정된 거리가 미리 설정된 임계값(ε) 이하인지 판단하는 단계(S220) 및 임계값 이하로 판단되면, 두 시퀀스는 매칭 또는 유사한 것으로 판단하여 결과를 저장하는 단계(S230)를 포함할 수 있다. The SM-D algorithm is a brute force algorithm that directly measures the distance between the DB sequence data and the query sequence. The SM-D algorithm divides the DB sequence data into sliding windows having the same length as the query sequence data (S200) A step S210 of determining a distance between the sliding window and the query sequence data, a step S220 of determining whether the measured distance is equal to or less than a predetermined threshold value?, And if it is determined that the distance is equal to or smaller than the threshold value, And storing the result (S230).

여기서, 분할된 슬라이딩 윈도우와 질의 시퀀스 데이터 사이의 거리를 측정하는 단계(S210) 내지 결과를 저장하는 단계(S230)는 앞서 분할된 모든 슬라이딩 윈도우들에 대하여 반복하여 수행될 수 있다.Here, the step of measuring the distance between the divided sliding window and the query sequence data (S210) and the step of storing the result (S230) may be repeatedly performed for all the previously divided sliding windows.

즉, SM-D 알고리즘은 도 4에 따른 조인 연산을 수행하는 단계(S123)의 대상인 제1 시퀀스 데이터의 서브 시퀀스 데이터와 제2 시퀀스 데이터의 서브 시퀀스 데이터를 입력으로 수신할 수 있다. 이때, 제1 시퀀스 데이터의 서브 시퀀스 데이터 및 제2 시퀀스 데이터의 서브 시퀀스 데이터는 앞서 설명한 SM-D 알고리즘의 질의 시퀀스 데이터와 DB 시퀀스 데이터가 될 수 있다.That is, the SM-D algorithm can receive, as input, the subsequence data of the first sequence data and the subsequence data of the second sequence data, which are objects of the step S123 of performing the join operation according to FIG. At this time, the subsequence data of the first sequence data and the subsequence data of the second sequence data may be the query sequence data and DB sequence data of the SM-D algorithm described above.

여기서 DB 시퀀스 데이터를 질의 시퀀스 데이터와 동일한 길이를 갖는 슬라이딩 윈도우로 분할하는 단계(S200)의 예를 들면, DB 시퀀스 데이터의 길이가 5이고 질의 시퀀스 데이터의 길이가 3일때, DB 시퀀스 데이터는 길이가 3인 3개의 슬라이드 윈도우로 분할된다. 즉, 시퀀스 데이터가 연속적인 데이터로 이루어져 있으므로, DB 시퀀스 데이터의 길이가 n이고, 질의 시퀀스 데이터의 길이가 n보다 작거나 같은 a이면, DB 시퀀스 데이터는 n-a+1 만큼의 슬라이드 윈도우로 분할될 수 있다. For example, when the length of the DB sequence data is 5 and the length of the query sequence data is 3, for example, the DB sequence data is divided into the sliding window having the length equal to the length of the query sequence data (S200) 3 < / RTI > That is, if the length of the DB sequence data is n and the length of the query sequence data is equal to or smaller than n, the DB sequence data is divided into the slide windows of n-a + 1 because the sequence data is composed of continuous data .

도 6은 본 발명의 제2 실시예에 따른 조인 연산을 수행하는 방법에 대한 흐름도이다.6 is a flowchart of a method of performing a join operation according to a second embodiment of the present invention.

제2 실시예에 따른 조인 연산을 수행하는 방법은 도 5에 따른 SM-D 알고리즘을 개선한 알고리즘으로서, 트리구조를 이용하여 후보 시퀀스 데이터들을 구하고 후보 시퀀스 데이터들만을 대상으로 거리를 측정하여 유사도를 측정하는 알고리즘(이하에서 SM-T 알고리즘으로 지칭할 수 있다)이다.The method of performing the join operation according to the second embodiment is an algorithm improving the SM-D algorithm shown in FIG. 5, in which candidate sequence data is obtained using a tree structure, distance is measured only on candidate sequence data, (Hereinafter referred to as the SM-T algorithm).

여기서, SM-T 알고리즘은 입력데이터로 질의 버퍼와 데이터 버퍼를 가질 수 있다. 데이터 버퍼는 DB 시퀀스 데이터들의 집합이고 질의 버퍼는 질의 시퀀스 데이터들의 집합일 수 있다.Here, the SM-T algorithm can have a query buffer and a data buffer as input data. The data buffer may be a collection of DB sequence data and the query buffer may be a collection of query sequence data.

SM-T 알고리즘은 먼저 DB 시퀀스 데이터에 대하여 특징 벡터를 추출하여 트리 구조를 구성하는 것을 전제로 한다. 따라서, SM-T 알고리즘의 첫번째 단계로서 DB 시퀀스 데이터에 대하여 특징 벡터를 추출하여 트리 구조를 구성하는 단계를 포함할 수 있다. 구체적으로, 트리 구조를 구성하는 단계에서, 먼저 데이터 버퍼에서 DB 시퀀스 데이터를 획득하고, 각 DB 시퀀스 데이터를 미리 설정된 길이(ω)를 갖는 슬라이드 윈도우로 분할할 수 있다. 도 5에 따른 SM-D 알고리즘에서는 질의 시퀀스 데이터의 길이를 가지는 슬라이드 윈도우로 분할하였지만 SM-T 알고리즘은 질의 시퀀스 데이터의 길이보다 작은 크기로 DB 시퀀스 데이터를 분할할 수 있다. 따라서, 미리 설정된 길이(ω)는 질의 시퀀스 데이터보다 작은 길이일 수 있다. 다음으로, 각 슬라이드 윈도우에 대해 특징벡터를 추출하고, 추출된 특징 벡터를 트리 데이터 구조의 각 노드에 삽입하여 트리 구조를 구성할 수 있다. 이때, 특징 벡터는 일반적으로 DB 시퀀스 데이터의 차수(dimensio) 보다 작은 차수로 추출할 수 있는데, 작은 차수로 추출하는 과정은 데이터 검색 연산 시간을 줄이기 위해 데이터의 차원을 축소하는 과정일 수 있다. 앞서 설명한 과정을 데이터 버퍼에 있는 모든 DB 시퀀스 데이터들을 대상으로 반복적으로 수행함으로써 트리 구조를 구성할 수 있다.The SM-T algorithm first assumes that a tree structure is constructed by extracting feature vectors from DB sequence data. Therefore, as a first step of the SM-T algorithm, a step of constructing a tree structure by extracting a feature vector from DB sequence data may be included. Specifically, in the step of constructing the tree structure, DB sequence data may be obtained first in the data buffer, and each DB sequence data may be divided into slide windows having a predetermined length (?). In the SM-D algorithm according to FIG. 5, the SM-T algorithm is divided into a slide window having the length of the query sequence data, but the SM-T algorithm can divide the DB sequence data into a size smaller than the length of the query sequence data. Therefore, the predetermined length? Can be shorter than the query sequence data. Next, a tree structure can be constructed by extracting a feature vector for each slide window and inserting the extracted feature vector into each node of the tree data structure. In this case, the feature vector can be extracted in an order smaller than the dimension of the DB sequence data. In the case of extracting the feature vector in a smaller order, the process of reducing the dimension of the data may be performed to reduce the data search operation time. The tree structure can be constructed by repeatedly performing the above-described process on all DB sequence data in the data buffer.

SM-T 알고리즘의 두 번째 단계는 구축한 트리를 대상으로 질의를 수행하는 단계일 수 있다. 질의를 수행하는 단계란, 트리 구조에서 질의 시퀀스 데이터의 특징 벡터와 유사한 특징 벡터들을 획득하는 단계를 의미할 수 있다.The second step of the SM-T algorithm may be a step of querying the constructed tree. The step of performing the query may mean acquiring feature vectors similar to the feature vector of the query sequence data in the tree structure.

질의를 수행하는 단계에서, 먼저, 질의 버퍼로부터 질의 시퀀스 데이터를 획득하고, 획득된 질의 시퀀스 데이터를 미리 설정된 길이(ω)를 갖는 적어도 하나의 디스조인트(disjoint) 윈도우로 분할할 수 있다(S300). 여기서 디스조인트 윈도우는 DB 시퀀스 데이터가 슬라이드 윈도우로 분할되면, 그 다음 위치부터 다시 분할되는 윈도우일 수 있다. 예를 들어, DB시퀀스 데이터의 길이가 6이고 ω의 길이가 3이면 2개의 디스조인트 윈도우로 분할될 수 있다. 다음으로, 각 디스 조인트 윈도우에 대해 질의 시퀀스 데이터의 제1 특징 벡터를 추출할 수 있고(S310), 추출된 제1 특징 벡터로 앞서 첫번째 단계를 통해 구성된 트리 구조에 대해 범위검색을 수행하여, 제1 특징 벡터 값과 인접한 DB 시퀀스 데이터의 특징 벡터 집합(또는 하나 이상의 제2 특징 벡터로 지칭)을 획득할 수 있다. 즉, 질의를 수행하는 단계는, DB 시퀀스 데이터의 특징 벡터가 저장된 트리 데이터 구조에서 범위 검색을 수행하여, 제1 특징 벡터와 유사한 하나 이상의 제2 특징 벡터를 획득하는 단계(S320)를 포함할 수 있다.In performing the query, the query sequence data may first be obtained from the query buffer, and the obtained query sequence data may be divided into at least one disjoint window having a predetermined length (?) (S300) . Here, the disjoint window may be a window that is divided again from the next position when the DB sequence data is divided into slide windows. For example, if the length of DB sequence data is 6 and the length of ω is 3, it can be divided into two disjoint windows. Next, the first feature vector of the query sequence data can be extracted for each disjoint window (S310), the range search is performed for the tree structure constructed in the first step preceding the extracted first feature vector, (Or referred to as one or more second feature vectors) of a feature vector value and adjacent DB sequence data. That is, the step of performing the query may include performing a range search in the tree data structure in which the feature vector of the DB sequence data is stored (S320) to obtain one or more second feature vectors similar to the first feature vector have.

SM-T 알고리즘의 세번째 단계는 검색결과를 대상으로 시퀀스 데이터 상호간 실제 유사성을 판단하는 단계일 수 있다. 즉, 유사성을 판단하는 단계는, 획득된 하나 이상의 제2 특징 벡터와 상응하는 DB 시퀀스 데이터 및, 제1 특징 벡터와 상응하는 질의 시퀀스 데이터 상호간의 거리를 측정하는 단계(S330), 측정된 거리가 미리 설정된 임계값보다 작은지 판단하는 단계(S340), 미리 설정된 임계값보다 작으면, 비교된 두 시퀀스 데이터는 유사하다고 판단하여 결과를 저장하는 단계(S350)를 포함할 수 있다.The third step of the SM-T algorithm may be to determine the actual similarity between the sequence data with respect to the search results. That is, the step of determining similarity includes a step (S330) of measuring a distance between DB sequence data corresponding to the obtained one or more second feature vectors and query sequence data corresponding to the first feature vector (S330) (S340). If the difference is smaller than a preset threshold value, it is determined that the compared two sequence data are similar, and the result is stored (S350).

앞서 설명한 단계 S310 내지 단계 S350은 단계 S300에서 분할된 모든 디스 조인트 윈도우들에 대하여 반복하여 수행될 수 있다.Steps S310 to S350 described above may be repeatedly performed on all the disjoint windows divided in step S300.

도 7은 본 발명의 일 실시예에 따른 시퀀스 데이터의 조인 연산 수행하는 맵 리듀스 분산 처리 시스템에 대한 구성도이다.7 is a configuration diagram of a map reduction distributed processing system for performing a join operation of sequence data according to an embodiment of the present invention.

도 7을 참조하면 맵 리듀스 분산 처리 시스템(100)은, 적어도 하나의 프로세서(110), 상기 적어도 하나의 프로세서가 적어도 하나의 단계를 수행하도록 지시하는 명령어들(instructions)을 저장하는 메모리(memory, 120)를 포함할 수 있다.Referring to FIG. 7, the map redistributed processing system 100 includes at least one processor 110, a memory (not shown) that stores instructions that direct the at least one processor to perform at least one step , 120).

여기서 맵 리듀스 분산 처리 시스템(100)은 분산된 여러 노드 상호간의 유무선 네트워크를 통해 통신할 수 있는 통신 모듈(130)을 더 포함할 수 있다Here, the map reduction distribution processing system 100 may further include a communication module 130 capable of communicating over a wired / wireless network between a plurality of distributed nodes

여기서 맵 리듀스 분산 처리 시스템(100)은 분산된 여러 노드의 전부 또는 일부에 내장되어 입력 또는 출력 데이터를 저장하거나, 데이터 처리 과정의 중간 생성 데이터를 저장하는 저장소(140)를 더 포함할 수 있다.Here, the map reduction distribution processing system 100 may further include a storage unit 140 which is built in all or a part of a plurality of distributed nodes and stores input or output data or stores intermediate generation data of a data processing process .

여기서 상기 적어도 하나의 단계는, 맵 단계 이후에 셔플링 단계를 더 포함할 수 있다. 여기서 셔플링 단계는, 맵 단계에서의 출력에 대하여 동일한 키와 결합된 값들을 리스트 입력값으로 생성하는 단계를 포함할 수 있다.Wherein the at least one step may further comprise a shuffling step after the map step. Wherein the shuffling step may include generating, as a list input value, values associated with the same key for the output in the map step.

이때, 상기 제1 시퀀스 데이터가 상기 DB 시퀀스 데이터이면 상기 제2 시퀀스 데이터는 상기 질의 시퀀스 데이터이며, 상기 제1 시퀀스 데이터가 상기 질의 시퀀스 데이터이면, 상기 제2 시퀀스 데이터는 DB 시퀀스 데이터일 수 있다. 따라서, 제1 시퀀스 데이터의 서브 시퀀스 데이터와 상기 제2 시퀀스 데이터의 서브 시퀀스 데이터로 분리하는 단계는 각각의 서브 시퀀스 데이터들이 질의 시퀀스 데이터와 DB 시퀀스 데이터 중 어느 데이터에 속하는지 구분하는 단계일 수 있다.In this case, if the first sequence data is the DB sequence data, the second sequence data is the query sequence data, and if the first sequence data is the query sequence data, the second sequence data may be DB sequence data. Thus, the step of separating the subsequence data of the first sequence data into the subsequence data of the second sequence data may be a step of distinguishing which of the query sequence data and the DB sequence data belongs to each subsequence data .

여기서 상기 제1 시퀀스 데이터의 서브 시퀀스 데이터와 상기 제2 시퀀스 데이터의 서브 시퀀스 데이터로 분리하는 단계는, 분리된 상기 제1 시퀀스 데이터의 서브 시퀀스 데이터와 상기 제2 시퀀스 데이터의 서브 시퀀스 데이터를 각각 질의 버퍼(buffer)와 데이터 버퍼에 저장하는 단계를 포함할 수 있다.Wherein separating the subsequence data of the first sequence data and the subsequence data of the second sequence data comprises: querying the subsequence data of the first sequence data and the subsequence data of the second sequence data, And storing the data in a buffer and a data buffer.

여기서, 맵 리듀스 분산 처리 시스템(100)은 하나의 장치로 한정되는 것이 아니고 복수의 데이터 처리 장치(또는, 복수의 노드로 지칭될 수 있다.)가 결합된 것으로 이해되어야 한다. 따라서, 도 1 내지 도 6에서 설명한 조인 연산 수행 방법은 그 전부 또는 일부가 복수의 노드에서 분산하여 수행될 수 있는 것으로 해석되어야 한다.Here, it is to be understood that the mapping reduction distributed processing system 100 is not limited to a single apparatus but may be a combination of a plurality of data processing apparatuses (or may be referred to as a plurality of nodes). Therefore, it should be understood that all or some of the join operation performing methods described in FIGS. 1 to 6 can be performed in a distributed manner in a plurality of nodes.

본 발명에서 복수의 데이터 처리 장치를 이루는 개별 장치의 예를 들면, 통신 가능한 데스크탑 컴퓨터(desktop computer), 랩탑 컴퓨터(laptop computer), 노트북(notebook), 스마트폰(smart phone), 태블릿 PC(tablet PC), 모바일폰(mobile phone), 스마트 워치(smart watch), 스마트 글래스(smart glass), e-book 리더기, PMP(portable multimedia player), 휴대용 게임기, 네비게이션(navigation) 장치, 디지털 카메라(digital camera), DMB(digital multimedia broadcasting) 재생기, 디지털 음성 녹음기(digital audio recorder), 디지털 음성 재생기(digital audio player), 디지털 동영상 녹화기(digital video recorder), 디지털 동영상 재생기(digital video player), PDA(Personal Digital Assistant) 등일 수 있다.Examples of the individual devices constituting the plurality of data processing apparatuses according to the present invention include a portable computer, a desktop computer, a laptop computer, a notebook, a smart phone, a tablet PC A mobile phone, a smart watch, a smart glass, an e-book reader, a portable multimedia player (PMP), a portable game machine, a navigation device, a digital camera, A digital multimedia broadcasting (DMB) player, a digital audio recorder, a digital audio player, a digital video recorder, a digital video player, a PDA (Personal Digital Assistant) ) And the like.

또한, 맵 리듀스 분산 처리 시스템(100)의 동작 방법은 앞서 설명한 도 1 내지 도 6에 따른 동작 방법을 포함할 수 있는 것으로 해석되어야 하며, 중복 설명을 방지하기 위해 자세한 설명은 생략한다.In addition, the operation method of the map reduction distribution processing system 100 should be interpreted as including the operation method according to the above-described first to sixth embodiments, and a detailed description thereof will be omitted in order to avoid redundant description.

본 발명에 따른 방법들은 다양한 컴퓨터 수단을 통해 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능 매체에 기록되는 프로그램 명령은 본 발명을 위해 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.The methods according to the present invention can be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions recorded on the computer readable medium may be those specially designed and constructed for the present invention or may be available to those skilled in the computer software.

컴퓨터 판독 가능 매체의 예에는 롬(ROM), 램(RAM), 플래시 메모리(flash memory) 등과 같이 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함될 수 있다. 프로그램 명령의 예에는 컴파일러(compiler)에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터(interpreter) 등을 사용해서 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함할 수 있다. 상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 적어도 하나의 소프트웨어 모듈로 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Examples of computer-readable media include hardware devices that are specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions may include machine language code such as those produced by a compiler, as well as high-level language code that may be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate with at least one software module to perform the operations of the present invention, and vice versa.

또한, 상술한 방법 또는 장치는 그 구성이나 기능의 전부 또는 일부가 결합되어 구현되거나, 분리되어 구현될 수 있다. Furthermore, the above-mentioned method or apparatus may be implemented by combining all or a part of the structure or function, or may be implemented separately.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the present invention as defined by the following claims It can be understood that

Claims

A method of performing a join operation of sequence data performed in a map reduction distributed processing system,
Map Phase; And
And a Reduce Phase for receiving values associated with the same key for outputs of the map step as list values,
The map step includes:
Obtaining first subsequence data obtained by dividing the first sequence data as an input value;
Generating at least one output key by pairing all partition information of the second sequence data to be joined with the first sequence data with partition information indicating the first subsequence data; And
And combining and outputting the generated at least one output key and the input value to perform a join operation of the sequence data.