KR101515304B1

KR101515304B1 - Reduce-side join query processing method for hadoop-based reduce-side join processing system

Info

Publication number: KR101515304B1
Application number: KR1020130135266A
Authority: KR
Inventors: 이정준; 정형용
Original assignee: 한국산업기술대학교산학협력단
Priority date: 2013-11-08
Filing date: 2013-11-08
Publication date: 2015-07-02

Abstract

The present invention relates to a reduce-side join query processing method of a Hadoop-based reduce-side join processing system, wherein the method comprises the steps of: (a) generating bitmap section filter data based on a result key value extracted from a result of a join query and storing the bitmap section filter data in a BIF database; (b) when a new join query is input, searching the BIF database for bitmap section filter data corresponding to the new join query; (c) when the bitmap section filter data is searched in the step (b), transmitting the searched bitmap section filter data to each mapper of the Hadoop system; (d) transforming, by each of the mappers, a record which is filtered based on the bitmap section filter data to a reducer of the Hadoop system; and (e) outputting, by the reducer, results for the new join query on the basis of the record transmitted from each of the mappers.

Description

[0002] REDUCE-SIDE JOIN QUERY PROCESSING METHOD FOR HADOOP-BASED REDUCE-SIDE JOIN PROCESSING SYSTEM OF REDUCE-

본 발명은 하둡 기반의 리듀스-사이드 조인 처리 시스템의 리듀스-사이드 조인 질의 처리 방법에 관한 것으로서, 보다 상세하게는 조인 질의와 상관이 없는 레코드들에 의해 발생하는 네트워크 및 컴퓨팅 비용을 구간필터를 이용하여 감소시키고, 구간필터를 비트로 표현하여 메모리 사용의 최적화가 가능한 하둡 기반의 리듀스-사이드 조인 처리 시스템의 리듀스-사이드 조인 질의 처리 방법에 관한 것이다.
The present invention relates to a reduction-side join query processing method of a Hadoop-based redundancy-side join processing system, and more particularly to a method and apparatus for processing a network and a computing cost generated by records that are not related to a join query, Side join query processing system of the Hadoop based reduction-side join processing system capable of optimizing the memory usage by expressing the interval filter in bits.

근래에 빅 데이터(Big data)에 관한 관심과 연구가 증가하고 있다. 기존에는 빅 데이터를 분석하고 관리하는 기술의 부족과, 분석 및 관리에 소요되는 비용의 과다로 인해 빅 데이터를 활용할 수 없었다. 그러나 최근의 클라우드 컴퓨팅 기술의 발전으로 빅 데이터를 보다 쉽고 저렴하게 관리 및 분석이 가능해졌다. 이에 맞춰, 정치, 사회, 경제 및 문화 등 다양한 분야로 빅 데이터의 활용범위가 넓어지고 있다.Recently, interest and research on Big data are increasing. Previously, Big Data could not be utilized because of the lack of technology to analyze and manage big data and the excessive cost of analysis and management. However, with the recent development of cloud computing technology, it became easier and cheaper to manage and analyze big data. In line with this, the scope of application of Big Data is expanding to various fields such as politics, society, economy and culture.

기존 데이터 웨어하우스 분야에서도 클라우드 컴퓨팅 기술들이 적용되고 있다. 데이터 웨어하우스는 다양한 운영시스템의 데이터베이스로부터 추출하고 통합한 대형 데이터베이스를 운영한다. 데이터 웨어하우스의 주요 특징으로는 반복되는 패턴 질의가 많고 대량의 데이터가 일정 주기 별로 업로드 되는 특징을 가지고 있다. 데이터 웨어하우스 시스템은 또한 대량의 정형, 비정형 데이터를 처리하기 위하여 하둡(Hadoop)과 하이브(Hive) 등을 활용하고 있다.Cloud computing technologies are also being applied to existing data warehouses. The data warehouse operates a large database that is extracted and integrated from databases of various operating systems. The main features of the data warehouse are that there are many repeated pattern queries and a large amount of data is uploaded at regular intervals. Data warehouse systems also use Hadoop and Hive to process large amounts of structured and unstructured data.

아파치(Apache) 오픈 소스 프로젝트인 하둡은 대표적인 클라우드 컴퓨팅 기술로 대량의 데이터를 분산처리해주는 맵리듀스(Mapreduce)와, 대량의 데이터를 저장하는 HDFS(Hadoop Distributed File System)로 구성된다. 그리고 하이브는 하둡을 기반으로 SQL과 유사한 HiveQL을 통하여 맵리듀스를 보다 쉽게 사용하고 관리할 수 있도록 지원한다.Apache Open source project Hadoop consists of Mapreduce, which distributes a large amount of data with typical cloud computing technology, and Hadoop Distributed File System (HDFS), which stores a large amount of data. And Hive makes it easier to use and manage MapReduce through HiveQL, which is based on Hadoop, similar to SQL.

그러나, 하둡과 하이브는 관계형 DBMS(Database management system)와 다른 구조를 가지고 있기 때문에 기존의 관계형 DBMS에서 처리되던 데이터 연산을 처리하는데 많은 비용이 소모된다. 하둡과 하이브에서 데이터 처리 역할을 하는 맵리듀스는 비공유(Shared Nothing) 구조를 가지고 있다. 맵리듀스는 다양한 데이터를 입력으로 받을 수 있는 장점이 있는 반면, 데이터를 처리하기 위한 단순한 구조를 가지고 있다.However, since Hadoop and Hive have a structure different from that of a relational DBMS (database management system), it takes a lot of money to process the data operations that have been processed in the existing relational DBMS. The MapReduce, which acts as data processing in Hadoop and the hive, has a shared nothing structure. While MapReduce has the advantage of receiving various data as input, it has a simple structure for processing data.

맵리듀스의 처리과정은 크게 맵과 리듀스로 분리할 수 있는데 분산되어 있는 데이터를 각 맵 노드의 맵퍼에서 가공을 하고, 그 작업결과를 소수의 리듀스로 전송하여 리듀스를 통해 처리된다. 이 때, 맵리듀스의 분산구조에서 맵과 리듀스로 넘어가는 단계에서만 데이터간의 이동이 이루어지기 때문에 이 과정에서 병목현상이 일어난다. The process of MapReduce can be divided into Map and Reduce. The distributed data is processed in the mapper of each map node, and the result of the process is transmitted by a small number of Reduce and processed through Reduce. In this case, the bottleneck occurs in the process because the data is moved only at the stage of transition from maple deuce to distributed map.

도 1은 하둡을 구성하는 맵리듀스의 데이터 처리 과정을 설명하기 위한 도면이다. 여기서, 조인 질의(Join Query)의 처리과정을 도 1을 참조하여 설명하면, 랩리듀스의 데이터 처리 과정은 세 가지 과정으로 구성된다. 첫 번째 처리 과정인 맵 과정에서는 분산되어 있는 데이터를 읽어 들여 키(Key)와 밸류(Value)의 쌍으로 가공한다.FIG. 1 is a diagram for explaining a data processing process of a MapReduce constituting Hadoop. Hereinafter, a process of join query will be described with reference to FIG. 1. The process of data processing of the lumpidus consists of three processes. In the map process, which is the first process, the distributed data is read and processed into a pair of key and value.

맵퍼와 리듀스 사이의 셔플 과정은 하둡 프레임 워크에 의해 맵에서 출력된 데이터를 키 값에 따라 정렬한 후 밸류들의 리스트로 변환하여 리듀스로 전송한다. 그리고 리듀스 과정은 키와 리스트의 형태로 넘어온 데이터를 처리하여 결과물을 출력하게 된다.The shuffling process between the mapper and the Reduce classifies the data output from the map by the Hadoop framework into a list of values after they are sorted according to the key value and sends them to the Reduce. The redescription process processes the data in the form of key and list and outputs the result.

하둡은 다양한 종류의 데이터를 처리할 수 있는 유연성을 제공하는 반면 하둡을 사용하고자 하는 프로그래머들뿐만 아니라 기존에 데이터 분석가들에게 하둡이 요구하는 분산처리 프로그램 구성을 위한 학습을 요구하고 있다.While Hadoop provides the flexibility to process a wide variety of data, it requires learning from the data analysts as well as programmers wishing to use Hadoop to construct distributed processing programs that Hadoop requires.

하이브는 이러한 사용자들에게 SQL 언어와 유사한 방식으로 데이터 처리 편의를 제공하는 프레임워크이다. 하이브는 하둡의 내부 프로젝트로 페이스북(facebook)에서 만든 하둡 기반의 데이터 웨어하우스 시스템이다. 하둡의 개념이나 구조는 기존 데이터베이스 툴을 이용하던 사용자들에게 생소하고 많은 노력과 시간을 요구하는데, 기존 사용자들에게 보다 쉬운 스크립트 언어나 SQL 형태를 제공하기 위한 연구가 이루어져 왔다. 이에, 하이브는 SQL과 유사한 HiveQL 언어를 통해 보다 쉽게 맵리듀스 프로그래밍이 가능하게 한다.Hive is a framework that provides data handling convenience to these users in a manner similar to the SQL language. Hive is Hadoop's internal project, a data warehouse system based on Hadoop built on facebook. The concept and structure of Hadoop is unfamiliar to users who have used existing database tools and requires a lot of time and effort. Research has been conducted to provide easier scripting languages and SQL types to existing users. Hive's HiveQL language, which is similar to SQL, makes MapReduce programming easier.

한편, 조인 질의의 처리와 관련하여, 하둡의 맵리듀스는 대표적으로 맵-사이드 조인 방법과, 리듀스-사인 조인 방법을 제공하고 있다.On the other hand, Hadoop's MapReduce provides a map-side join method and a reduce-sign join method in connection with the processing of join query.

맵-사이드 조인은 맵퍼에서 조인이 이루어지는 조인 방법으로 맵퍼에서 조인이 이루어지기 때문에 리듀스 단계가 없어 리듀스-사이드 조인보다 빠른 장점을 가지고 있다. 그러나 맵-사이드 조인 방법은 특정요구사항을 만족할 경우만 사용가능한데, 조인에 사용할 각각의 입력 데이터가 동일한 개수의 파티션으로 나뉘어져 있어야하고 동일한 데이터 내에서는 동일한 조인키에 의해 정렬되어 있는 경우에만 사용 가능한 제약이 있다.A map-side join is a join method in which a join is performed in a mapper. Since the join is performed in the mapper, there is no redescing step, which is faster than a reduce-side join. However, the map-side join method can only be used if it meets certain requirements. Each input data to be used in the join must be divided into the same number of partitions, and only constraints that can be used if they are sorted by the same join key within the same data .

반면, 리듀스-사이드 조인 방법은 보다 일반적인 경우에 쓰이는 조인으로 리듀스 과정에서 조인이 이루어진다. 도 2를 참조하여 설명하면, 맵에서는 맵이 블록으로부터 레코드들을 읽어들이고, 조인키 값을 키로 각 레코드에 테이블 정보를 담은 태그를 붙여 밸류로 출력한다.On the other hand, the reduction-side join method is a join used in the redescription process in a more general case. Referring to FIG. 2, in a map, a map reads records from a block, and a tag containing table information is attached to each record with a key value as a key, and the value is output to a value.

이와 같은 맵의 결과물은 정렬되어 각 맵의 로컬 디스크에 임시로 저장되고, 각각의 맵의 처리가 완료되면, 모든 결과물이 조인키 별로 네트워크를 통해 각 리듀서의 로컬 디스크로 복사된다. 그리고 리듀스에서는 이 임시 파일을 병합/정렬과정을 거친 뒤에 각 조인키 별로 모인 레코드들을 태그를 이용하여 조인을 실행한다.The result of such map is sorted and temporarily stored in the local disk of each map. When the processing of each map is completed, all the results are copied to the local disk of each reducer through the network by the join key. Then, in Reduce, after the temporary file is merged / sorted, a join is performed using the tags that are collected by each join key.

그러나 하둡의 맵리듀스의 리듀스-사이드 조인 방법은 조인 질의와 상관이 없는 레코드들이 네트워크를 통해 리듀서로 전송되고, 리듀서 또한 이를 포함하여 정렬 등의 처리 과정을 수행하기 때문에, 네트워크 및 컴퓨팅 비용의 불필요한 낭비를 야기하는 문제가 있다.However, the Reduce-Side join method of Hadoop's MapReduce does not require any unnecessary network and computing costs because the records that are not related to the join query are sent to the reducer through the network, There is a problem that causes waste.

도 3은 하둡의 맵리듀스의 리듀스-사이드 조인 방법에서 일어나는 네트워크 및 컴퓨팅 비용의 소모 과정을 설명하기 위한 도면이다. 도 3을 참조하여 설명하면, 상술한 바와 같이, 맵 과정에서 Table1과 Table2의 모든 레코드들을 읽어 들여 리듀스 과정으로 보내지게 된다. 그리고 리듀스 과정에서 두 테이블에 공통적으로 조인키 B를 가지고 있음을 알고 조인키 B를 가진 레코드를 결과로 출력하게 된다.FIG. 3 is a view for explaining a process of consuming network and computing costs in the method of reducing-side joining Hadoop's MapReduce. Referring to FIG. 3, as described above, all the records of Table 1 and Table 2 are read during the mapping process and sent to the redescription process. Then, in the redescription process, it knows that it has a common key B in both tables and outputs a record with join key B as a result.

하지만, 이 과정에서 조인이 일어나지 않는 조인키 A의 레코드들과 조인키 C의 레코드들까지 리듀스 과정으로 전송되며, 조인키 A와 조인키 C를 가진 레코드들을 셔플 과정과 리듀스 과정에서 처리하는 불필요한 비용이 소모됨을 알 수 있다.
However, in this process, the records of the join key A that are not joined and the records of the join key C are transmitted to the redescription process, and the records having the join key A and the join key C are processed in the shuffling process and the re- It can be seen that the unnecessary cost is consumed.

이에, 본 발명은 상기와 같은 문제점을 해소하기 위해 안출된 것으로서, 조인 질의와 상관이 없는 레코드들에 의해 발생하는 네트워크 및 컴퓨팅 비용을 감소시키고, 메모리 사용의 최적화가 가능한 하둡 기반의 리듀스-사이드 조인 처리 시스템의 리듀스-사이드 조인 질의 처리 방법을 제공하는데 그 목적이 있다.
SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and it is an object of the present invention to provide a Hadoop-based reduced-side memory which can reduce network and computing costs caused by records that are not correlated with join queries, And to provide a method for processing a redo-side join query of a join processing system.

상기 목적은 본 발명에 따라, 하둡 기반의 리듀스-사이드 조인 처리 시스템의 리듀스-사이드 조인 질의 처리 방법에 있어서, (a) 상기 하둡 기반의 리듀스-사이드 조인 처리 시스템에서 조인 질의의 결과로부터 추출된 결과키 값에 기초하여 비트맵 구간 필터 데이터가 생성되어 BIF 데이터베이스에 저장되는 단계와, (b) 상기 하둡 기반의 리듀스-사이드 조인 처리 시스템에 신규 조인 질의가 입력되는 경우, 상기 하둡 기반의 리듀스-사이드 조인 처리 시스템에서 상기 BIF 데이터베이스에 상기 신규 조인 질의에 대응하는 비트맵 구간 필터 데이터가 검색되는 단계와, (c) 상기 (b) 단계에서 비트맵 구간 필터 데이터가 검색되는 경우, 상기 하둡의 각 맵퍼로 상기 검색된 비트맵 구간 필터 데이터가 전송되는 단계와, (d) 상기 각 맵퍼가 상기 비트맵 구간 필터 데이터에 기초한 필터링을 통해 필터링된 레코드를 상기 하둡의 리듀서로 전송하는 단계와, (e) 상기 리듀서가 상기 각 맵퍼로부터 전송된 레코드에 기초하여 상기 신규 조인 질의에 대한 결과를 출력하는 단계를 포함하며; 상기 하둡 기반의 리듀스-사이드 조인 처리 시스템에 의해 수행되는 상기 (a) 단계는 (a1) 상기 조인 질의의 결과로부터 상기 결과키 값이 추출되는 단계와; (a2) 조인키 도메인이 BIF 간격 단위로 복수의 BIF 구간으로 분할되는 단계와; (a3) 상기 추출된 결과키 값이 상기 각 BIF 구간에 속하는지 여부가 체크되는 단계와; (a4) 상기 복수의 BIF 구간에 대한 정보와, 상기 추출된 결과키 값이 상기 각 BIF 구간에 속하는지 여부에 대한 BIF 값이 포함된 비트맵 구간 필터 데이터가 생성되어 상기 BIF 데이터베이스에 저장되는 단계를 포함하며; 상기 (d) 단계에서 상기 맵퍼는 상기 비트맵 구간 필터 데이터의 상기 복수의 BIF 구간 중 상기 신규 조인 질의에 포함된 테이블 레코드의 조인키 값이 속하는 BIF 구간 내의 레코드를 상기 리듀서로 전송하는 것을 특징으로 하는 하둡 기반의 리듀스-사이드 조인 처리 시스템의 리듀스-사이드 조인 질의 처리 방법.에 의해서 달성된다.The above object is achieved according to the present invention by a method for processing a reduction-side join query of a Hadoop-based reduced-side join processing system, the method comprising: (a) comparing the result of the join query in the Hadoop- (B) if a new join query is input to the Hadoop-based redox-side join processing system, the Hadoop-based Side join filter data corresponding to the new join query is retrieved from the BIF database in the redesed-side join processing system of the BSS; (c) if bitmap interval filter data is retrieved in the step (b) And transmitting the retrieved bitmap interval filter data to each of the mapper of the Hadoop; and (d) And (e) outputting a result for the new join query based on the record sent from each mapper, wherein the reducer comprises: ; The step (a) performed by the Hadoop-based reduction-side join processing system includes the steps of: (a1) extracting the result key value from the result of the join query; (a2) dividing the join key domain into a plurality of BIF intervals in units of BIF intervals; (a3) checking whether the extracted result key value belongs to each BIF section; (a4) bitmap interval filter data including information on the plurality of BIF sections and a BIF value as to whether the extracted result key value belongs to each BIF section is generated and stored in the BIF database ; In the step (d), the mapper transmits to the reducer a record in a BIF section to which a join key value of a table record included in the new join query among the plurality of BIF sections of the bitmap section filter data belongs. Side join query processing method of a Hadoop based reduced-side join processing system.

여기서, 상기 (a2) 단계에서 상기 BIF 간격은 수학식 L = D/R (여기서, L은 상기 BIF 간격이고, D는 상기 조인키 도메인의 원소 개수이고, R은 상기 추출된 결과키 값의 개수이다)에 의해 산출될 수 있다.Herein, in the step (a2), the BIF interval may be expressed by the following equation: L = D / R where L is the BIF interval, D is the number of elements of the join key domain, R is the number ). &Lt; / RTI >

그리고, 상기 비트맵 구간 필터 데이터는 상기 BIF 간격과 상기 조인키 도메인의 시작 조인키 값을 더 포함하며, 상기 (d) 단계에서 상기 맵퍼는 상기 신규 조인 질의에 포함된 테이블 레코드의 조인키 값과 상기 시작 조인키 값 간의 간격에 기초하여 오프셋 값을 산출하고, 상기 신규 조인 질의에 포함된 테이블 레코드의 조인키 값, 상기 오프셋 값, 상기 BIF 간격에 기초하여 상기 신규 조인 질의에 포함된 테이블 레코드의 조인키 값이 속하는 BIF 구간을 추출하며, 상기 추출된 BIF 구간의 상기 BIF 값에 기초하여, 상기 조회된 BIF 구간 내의 레코드의 상기 리듀서로의 전송 여부를 판단할 수 있다.In addition, the bitmap interval filter data further includes a BIF interval and a start join key value of the join key domain, and in the step (d), the mapper adds a join key value of a table record included in the new join query, Wherein the offset value is calculated on the basis of an interval between the start join key values, and based on the join key value, the offset value, and the BIF interval of the table record included in the new join query, The BIF section to which the join key value belongs is extracted and based on the extracted BIF value of the extracted BIF section, it is possible to determine whether the record in the inquired BIF section is transmitted to the reducer.

또한, 상기 맵퍼는 수학식 I = (V-Base)/L (여기서, I는 상기 신규 조인 질의에 포함된 테이블 레코드의 조인키 값이 속하는 BIF 구간의 인덱스이고, V는 상기 신규 조인 질의에 포함된 테이블 레코드의 조인키 값이고, Base는 상기 조인키 도메인의 시작 조인키 값이고, L는 상기 BIF 구간이다)에 의해 상기 신규 조인 질의에 포함된 테이블 레코드의 조인키 값이 속하는 BIF 구간을 추출할 수 있다.In addition, the mapper may be expressed by a formula I = (V-Base) / L, where I is the index of the BIF section to which the join key value of the table record included in the new join query belongs and V is included in the new join query A base key is a start join key value of the join key domain, and L is the BIF interval), the BIF section to which the join key value of the table record included in the new join query belongs is extracted can do.

여기서, (f1) 상기 (b) 단계에서 비트맵 구간 필터 데이터가 검색되지 않는 경우, 상기 하둡 기반의 리듀스-사이드 조인 처리 시스템에 의해 기 설정된 처리 방법을 통해 상기 신규 조인 질의의 결과가 출력되는 단계와; (f2) 상기 하둡 기반의 리듀스-사이드 조인 처리 시스템에 의해 상기 (f1) 단계에서 출력되는 상기 신규 조인 질의의 결과로부터 결과키 값을 추출하는 단계와; (f3) 상기 상기 하둡 기반의 리듀스-사이드 조인 처리 시스템에 의해 상기 (f2) 단계에서 추출된 결과키 값에 대해 상기 (a2) 단계 내지 상기 (a4) 단계가 수행되어 상기 신규 조인 질의에 대한 비트맵 구간 필터 데이터가 생성되어 상기 BIF 데이터베이스에 저장되는 단계를 더 포함할 수 있다.
If the bitmap section filter data is not retrieved in step (b), the result of the new join query is output through a processing method previously set by the Hadoop-based redis-side join processing system ; (f2) extracting a result key value from the result of the new join query output in the step (f1) by the Hadoop-based reduction-side join processing system; (f3) The steps (a2) to (a4) are performed on the resultant key value extracted in the step (f2) by the Hadoop-based reduction-side join processing system, And the bitmap interval filter data is generated and stored in the BIF database.

상기와 같은 구성을 통해, 본 발명에 따르면, 조인 질의와 상관이 없는 레코드들에 의해 발생하는 네트워크 및 컴퓨팅 비용을 감소시키고, 메모리 사용의 최적화가 가능한 하둡 기반의 리듀스-사이드 조인 처리 시스템의 리듀스-사이드 조인 질의 처리 방법이 제공된다.
According to the present invention as described above, according to the present invention, it is possible to reduce the network and computing costs caused by records that are not correlated with the join query, and to reduce the memory usage of the Hadoop-based reduced- A deuce-side join query processing method is provided.

도 1은 하둡을 구성하는 맵리듀스의 데이터 처리 과정을 설명하기 위한 도면이고,
도 2는 하둡의 맵리듀스의 리듀스-사인 조인 방법을 설명하기 위한 도면이고,
도 3은 하둡의 맵리듀스의 리듀스-사이드 조인 방법에서 일어나는 네트워크 및 컴퓨팅 비용의 소모 과정을 설명하기 위한 도면이고,
도 4는 본 발명에 따른 하둡 기반의 리듀스-사이드 조인 처리 시스템의 리듀스-사이드 조인 질의 처리 방법을 설명하기 위한 도면이고,
도 5는 본 발명에 따른 하둡 기반의 리듀스-사이드 조인 처리 시스템의 리듀스-사이드 조인 질의 처리 방법에서 비트맵 구간 필터 데이터가 생성되는 과정을 설명하기 위한 도면이고,
도 6은 본 발명에 따른 비트맵 구간 필터 데이터를 이용한 필터링의 효과를 설명하기 위한 도면이고,
도 7은 본 발명에 따른 비트맵 구간 필터 데이터의 각 요소들의 관계를 나타낸 도면이고,
도 8은 본 발명에 따른 하둡 기반의 리듀스-사이드 조인 처리 시스템의 리듀스-사이드 조인 질의 처리 방법을 구현하기 위한 하이브 시스템의 예를 나타낸 도면이고,
도 9 내지 도 12는 본 발명에 따른 하둡 기반의 리듀스-사이드 조인 처리 시스템의 리듀스-사이드 조인 질의 처리 방법의 실험 결과를 나타낸 도면이다.FIG. 1 is a diagram for explaining a data processing process of a MapReduce constituting Hadoop,
FIG. 2 is a diagram for explaining a reduction-sign joining method of Hadoop's MapReduce,
FIG. 3 is a view for explaining a process of consumption of network and computing costs in the method of reducing-side joining Hadoop's MapReduce,
4 is a view for explaining a reduction-side join query processing method of the Hadoop-based reduction-side join processing system according to the present invention,
5 is a diagram for explaining a process of generating bitmap interval filter data in a method of processing a reduction-side join query of a Hadoop-based reduction-side join processing system according to the present invention,
FIG. 6 is a diagram for explaining the effect of filtering using bitmap interval filter data according to the present invention, and FIG.
FIG. 7 is a diagram illustrating a relationship among elements of bitmap interval filter data according to the present invention,
FIG. 8 is a diagram illustrating an example of a hive system for implementing a method for processing a redo-side join query of a redo-side join processing system based on the Hadoop system according to the present invention,
9 to 12 are diagrams showing experimental results of a method for processing a reduction-side join query in the Hadoop-based reduction-side join processing system according to the present invention.

이하에서는 첨부된 도면을 참조하여 본 발명에 따른 실시예들을 상세히 설명한다. 여기서, 본 명세서에 기재된 BIF는 'Bitmap for Interval Filter' 의 약자로 본 발명의 기술적 사상을 표현하기 위해 임의로 정의된 명칭이다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. Here, the BIF described in this specification is an abbreviation of 'Bitmap for Interval Filter' and is a name arbitrarily defined for expressing the technical idea of the present invention.

도 4는 본 발명에 따른 하둡 기반의 리듀스-사이드 조인 처리 시스템의 리듀스-사이드 조인 질의 처리 방법을 설명하기 위한 도면이다. 도 4를 참조하여 설명하면, 하둡 기반의 리듀스-사이드 조인 처리 시스템에 질의가 입력되면(S40), 입력된 질의가 조인 질의인지 여부가 판단된다(S41). 여기서, 본 발명에서는 하둡 기반의 리듀스-사이드 조인 처리 시스템이 하둡 기반의 하이브 시스템인 것을 예로 하여 설명한다.
S41 단계에서 하둡 기반의 리듀스-사이드 조인 처리 시스템에 의해 입력된 질의가 조인 질의가 아닌 것으로 판단되면, 기존의 하둡 기반의 리듀스-사이드 조인 처리 시스템, 예를 들어 하둡 기반의 하이브 시스템의 질의 처리 과정이 수행되고(S50), 그 결과가 출력된다(S46).4 is a diagram for explaining a reduction-side join query processing method of the Hadoop-based reduction-side join processing system according to the present invention. Referring to FIG. 4, when a query is input to the redo-side join processing system based on Hadoop (S40), it is determined whether the inputted query is a join query (S41). Here, the present invention will be described by taking as an example a Hadoop-based reduction-side join processing system being a Hadoop-based hive system.
If it is determined in step S41 that the query input by the Hadoop based redundancy-side join processing system is not a join query, the existing Hadoop-based redes-side join processing system, for example, the query of the Hadoop based hive system Processing is performed (S50), and the result is output (S46).

반면, S41 단계에서 입력된 질의가 조인 질의인 것으로 판단되면, 하둡 기반의 리듀스-사이드 조인 처리 시스템이 BIF 데이터베이스에 입력된 질의, 즉 조인 질의에 대응하는 비트맵 구간 필터 데이터가 존재하는지 여부를 검색한다(S42). 여기서, 비트맵 구간 필터 데이터는 해당 비트맵 구간 필터 데이터의 생성시 입력된 조인 질의와, S40 단계에서 입력된 조인 질의가 유사 질의인지 여부에 따라 결정된다. 여기서, 유사 질의란 동일한 테이블들을 동일한 조인 조건으로 조인하는 질의를 의미하는 것으로 sql 문의 select 절이나 group by 절, having 절은 다를 수 있고, where 절에 상수와 비교하는 다른 조건이 추가될 수도 있다.On the other hand, if it is determined that the query input in step S41 is a join query, the Hadoop-based redes-side join processing system determines whether the query inputted to the BIF database, that is, bitmap section filter data corresponding to the join query exists (S42). Here, the bitmap interval filter data is determined according to whether the join query input at the time of generating the bitmap interval filter data and the join query input at step S40 are similar queries. Here, the similar query means a query joining the same tables to the same join condition. The select clause, the group by clause, and the having clause of the sql statement may be different, and other conditions for comparing constants may be added to the where clause.

여기서, BIF 데이터베이스에 저장된 비트맵 구간 필터 데이터는 이전의 조인 질의 처리 과정에서 해당 조인 질의에 대한 결과를 이용하여 생성되어 BIF 데이터베이스에 등록되어 업데이트되는데, 이하에서는 도 5를 참조하여 본 발명에 따른 하둡 기반의 리듀스-사이드 조인 처리 시스템의 하둡 기반 리듀스-사이드 조인 질의 처리 방법에서의 비트맵 구간 필터 데이터 생성 과정에 대해 상세히 설명한다.Here, the bitmap interval filter data stored in the BIF database is generated using the result of the join query in the previous join query process, and is registered and updated in the BIF database. Hereinafter, referring to FIG. 5, A process of generating bitmap interval filter data in a Hadoop-based reduction-side join query processing method of a redundancy-side join processing system based on the present invention will be described in detail.

먼저, 조인 질의의 결과로부터 결과키 값이 추출된다(S60). 여기서, 결과키 값은 조인 질의의 결과, 즉 결과 레코드들이 갖는 조인키 값들이다. 그런 다음, 조인키 도메인이 BIF 간격 단위로 복수의 BIF 구간으로 분할된다(S62). 여기서, BIF 간격은 하나의 BIF 구간의 길이, 즉 BIF 구간 길이를 의미한다.First, the result key value is extracted from the result of the join query (S60). Here, the result key value is the result of the join query, that is, the join key values of the result records. Then, the join key domain is divided into a plurality of BIF intervals on a BIF interval basis (S62). Here, the BIF interval means the length of one BIF interval, i.e., the length of the BIF interval.

여기서, BIF 구간의 분할을 위한 BIF 간격은 사용자의 설정에 따라 등록될 수 있으며, 본 발명에서는 BIF 간격이 조인키 도메인의 원소 개수와, 추출된 결과키 값의 개수에 기초하여 [수학식 1]을 통해 산출되는 것을 예로 한다.
Here, the BIF interval for dividing the BIF interval can be registered according to the setting of the user. In the present invention, the BIF interval is calculated based on Equation (1) based on the number of elements of the key domain in which the BIF interval is joined, As shown in Fig.

[수학식 1][Equation 1]

L = D/R
L = D / R

[수학식 1]에서 L은 BIF 간격이고, D는 조인키 도메인의 원소 개수이고, R은 추출된 결과키 값의 개수이다.In Equation (1), L is the BIF interval, D is the number of elements of the join key domain, and R is the number of extracted result key values.

그런 다음, 분할된 각각의 BIF 구간에 결과키 값이 속하는지 여부, 즉 해당 BIF 구간에 결과키 값이 존재하는지 여부를 체크한다(S63). 여기서, 해당 BIF 구간에 결과키 값이 존재하는 경우, 해당 BIF 구간에 결과키 값이 존재하는지 여부에 대한 BIF 값이 1로 설정된다(S64). 즉, BIF 값의 비트 값이 1로 설정된다. 반면, 해당 BIF 구간에 결과키 값이 존재하지 않는 경우, 해당 BIF 구간에 대한 BIF 값이 0으로 설정된다(S65). 즉, BIF 값의 비트 값이 0으로 설정된다.Then, it is checked whether or not the result key value belongs to each divided BIF section, that is, whether there is a result key value in the corresponding BIF section (S63). Here, if the result key value exists in the corresponding BIF section, the BIF value for whether the result key value exists in the corresponding BIF section is set to 1 (S64). That is, the bit value of the BIF value is set to 1. On the other hand, if the result key value does not exist in the corresponding BIF section, the BIF value for the corresponding BIF section is set to 0 (S65). That is, the bit value of the BIF value is set to zero.

상기 과정에서, 모든 BIF 구간의 설정이 완료될 때까지(S68), 다음 BIF 구간에 대한 조회를 진행하여(S68), BIF 구간에 대한 BIF 값을 설정하게 된다.In the above process, until the setting of all the BIF sections is completed (S68), the search for the next BIF section is performed (S68), and the BIF value for the BIF section is set.

이하에서는, [표 1]을 참조하여, 상기와 같은 과정을 통해 비트맵 구간 필터 데이터가 생성되는 예에 대해 설명한다.
Hereinafter, an example in which bitmap interval filter data is generated through the above process will be described with reference to Table 1.

[표 1][Table 1]

[표 1]에서는 두 개의 테이블에 대한 조인 질의가 수행되는 것을 예로 하고 있다. 테이블 A와 테이블 B를 조인하는 경우, 두 테이블의 조인키 값의 조인키 도메인은 0 ~ 39의 값을 가진다.[Table 1] shows an example in which a join query is performed on two tables. When joining table A and table B, the join key domain of the join key values of the two tables has a value of 0 to 39.

그리고, 그리고 조인키 도메인 상의 조인키 값 중 두 테이블에 모두 속하는 질의 결과의 레코드의 조인키 값, 즉 결과키 값은 0, 1, 5, 9, 19, 23, 29, 36, 37, 39이 된다. 즉, 결과로 출력된 레코드의 조인키 값들이 결과키 값으로 추출된다.1, 5, 9, 19, 23, 29, 36, 37, and 39, the join key values of the query result records belonging to both tables among the join key values on the join key domain do. That is, the join key values of the output records are extracted as the result key values.

여기서, BIF 간격은 조인키 도메인의 원소 개수에서 결과키 값의 개수로 나눈 값으로 산출되는 바, 상기 예제에서는 40 / 10 = 4가 된다. 그리고, BIF 간격을 이용하여 전체 조인키 도메인을 분할하면, [표 2]의 같이 10개의 BIF 구간으로 분할된다.
Here, the BIF interval is calculated by dividing the number of elements of the join key domain by the number of result key values. In this example, 40/10 = 4. Then, if the entire join key domain is divided using the BIF interval, it is divided into 10 BIF sections as shown in [Table 2].

[표 2][Table 2]

[표 2]에서 인덱스 값은 BIF 구간을 구분하기 위한 ID이고, BIF 값은 각각의 BIF 구간에 결과키 값인 0, 1, 5, 9, 19, 23, 29, 36, 37, 39이 존재하는지 여부에 따라 설정된다.In Table 2, the index value is an ID for identifying the BIF section, and the BIF value indicates whether the result key values 0, 1, 5, 9, 19, 23, 29, 36, 37, 39 exist in each BIF section As shown in FIG.

BIF 값은 결과키 값이 해당 BIF 구간에 존재하는 경우 1로, 존재하지 않는 경우에는 O으로 설정된다. BIF 값에 의해 조인 질의에 대한 결과 레코드와 상관이 없는 조인키 값의 구간을 확인할 수 있게 된다.The BIF value is set to 1 when the result key value exists in the corresponding BIF section, and to 0 when the result key value does not exist. The BIF value enables to check the interval of the join key value which is not correlated with the result record for the join query.

이와 같이, BIF 값이 0인 BIF 구간을 활용하여 필터링하면, 테이블 A에서 5, 7, 12, 21, 23, 27, 35과 테이블 B에서 4, 5, 23, 26, 32, 34의 조인키 값을 가진 레코드들을 제거할 수 있다.In this way, when the BIF interval is filtered using the BIF interval of 0, the join keys of 5, 7, 12, 21, 23, 27, and 35 in Table A and 4, 5, 23, 26, You can remove records with values.

상기와 같은 과정을 통해, 복수의 BIF 구간에 대한 정보와, 결과키 값이 각 BIF 구간에 속하는지 여부에 대한 BIF 값이 포함된 비트맵 구간 필터 데이터가 생성되어(S69) BIF 데이터베이스에 저장된다. 여기서, 본 발명에서는 비트맵 구간 필터 데이터에 BIF 간격과, 조인키 도메인의 시작 조인키 값, 즉, [표 1]에서 조인키 값 '0'이 포함되어 생성될 수 있다.Through the above process, bitmap section filter data including information on a plurality of BIF sections and a BIF value as to whether the result key value belongs to each BIF section is generated (S69) and stored in the BIF database . In the present invention, the BIF interval and the start join key value of the join key domain, that is, the join key value '0' in [Table 1] can be generated in the bitmap interval filter data.

도 6은 본 발명에 따른 비트맵 구간 필터 데이터를 이용한 필터링의 효과를 설명하기 위한 도면이다. 도 6을 참조하여 설명하면, 본 발명에 따른 비트맵 구간 필터 데이터의 기본 개념은, 상술한 바와 같이, 조인키 값 하나하나가 아닌 조인키 값들의 구간을 이용하여 필터링하는 방식이다. 즉, 전체의 조인키 도메인의 구간을 일정한 간격, 예컨대, 상술한 BIF 간격으로 나누고, 해당 BIF 구간에 결과키 값이 존재하는지 여부를 체크하게 된다.FIG. 6 is a diagram for explaining the effect of filtering using bitmap interval filter data according to the present invention. Referring to FIG. 6, the basic concept of bitmap interval filter data according to the present invention is a method of filtering by using an interval of join key values rather than a single join key value, as described above. That is, the entire join key domain is divided by a predetermined interval, for example, the BIF interval, and whether a result key value exists in the corresponding BIF interval is checked.

그리고 비트맵 구간 필터 데이터를 비트맵 형태로 저장하여, 조인키 값들 하나하나가 아닌 구간을 저장함으로써, 보다 많은 양의 조인키 값의 정보를 메모리에 저장할 수 있게 되어, 메모리 사용의 최적화가 가능하게 된다.By storing the bitmap interval filter data in a bitmap form and storing the interval rather than each of the join key values, information of a larger amount of the join key value can be stored in the memory, do.

다시, 도 4를 참조하여 설명하면, S42 단계에서 S40 단계에서 입력된 조인 질에 대응하는 비트맵 구간 필터 데이터가 검색되면, 비트맵 구간 필터 데이터가 하둡의 각 맵퍼로 전송된다(S43). 여기서, 비트맵 구간 필터 데이터의 분배는 하둡의 DistributedCashe에 의해 수행되는 것을 예로 한다.Referring again to FIG. 4, if the bitmap section filter data corresponding to the join quality inputted in step S40 is retrieved, the bitmap section filter data is transmitted to each mapper of Hadoop (S43). Here, it is assumed that distribution of bitmap interval filter data is performed by Hadoop DistributedCache.

그리고, 각 맵퍼는 비트맵 구간 필터 데이터를 이용하여, 자신의 테이블의 레코드를 필터링하게 된다(S44). 여기서, 각 맵퍼는 비트맵 구간 필터 데이터의 복수의 BIF 구간 중 조인 질의에 포함된 테이블 레코드의 조인키 값이 속하는 BIF 구간 내의 레코드 만을 리듀서로 전송하여, 나머지 레코드들을 필터링하게 된다.Then, each mapper uses the bitmap section filter data to filter records of its own table (S44). Here, each mapper transmits only the records in the BIF section to which the join key value of the table record included in the join query belongs in the plurality of BIF sections of the bitmap section filter data to the reducer, and filters the remaining records.

이하에서는 본 발명에 따른 맵퍼의 필터링 방법을, 도 7을 참조하여 설명한다. 도 7은 본 발명에 따른 비트맵 구간 필터 데이터의 각 요소들의 관계를 나타낸 도면이다.Hereinafter, a method of filtering a mapper according to the present invention will be described with reference to FIG. FIG. 7 is a diagram illustrating a relationship among elements of bitmap interval filter data according to the present invention.

도 7에서, D는 조인키 도메인을 나타내고, Base는 조인키 도메인의 시작 조인키 값이다. 그리고 L은 BIF 구간이고, I는 BIF 구간의 인덱스 값이다. 오프셋 값인 Offset은 시작 조인키 값과 조인키 값 간의 거리를 나타내며, 현재 입력된 조인 질의에 포함된 테이블 레코드의 조인키 값이 V로 정의된다.In Fig. 7, D represents the join key domain, and Base is the start join key value of the join key domain. L is the BIF interval, and I is the index value of the BIF interval. The offset value Offset represents the distance between the start join key value and the join key value, and the join key value of the table record included in the currently input join query is defined as V. [

먼저, 맵퍼는 현재 입력된 조인 질의에 포함된 테이블 레코드의 조인키 값과 시작 조인키 값 간의 간격을 이용하여 오프셋 값을 산출한다. 현재 조인 질의에 포함된 테이블 레코드의 조인키 값이 37인 것을 예로 하여, [표 1]과 [표 2]의 예제를 참조하여 설명하면, 조인키 값 V가 37이고, BIF 간격은 4이고, 시작 조인키 값 Base는 0이다.First, the mapper calculates the offset value using the interval between the join key value and the start join key value of the table record included in the currently input join query. The joining key value V is 37, the BIF interval is 4, and the joining key value is 37. The joining key value V is 37 and the joining key value of the table record included in the current joining query is 37 and the example of [Table 1] The start join key value Base is zero.

오프셋 값은 V와 Base 간의 거리, 즉 편차에 의해 계산되어 37로 산출된다. 여기서, 맵퍼는 산출된 오프셋 값과 BIF 간격을 이용하여 조인키 값 V가 속하는 BIF 구간을 추출한다. 즉, BIF 간격이 4이고 오프셋 값이 37인 경우, 오프셋 값을 BIF 간격으로 나눈 몫을 이용하여 조인키 값 V가 속한 BIF 구간을 추출할 수 있게 된다.The offset value is calculated by the distance between V and Base, that is, the deviation, and is calculated as 37. Here, the mapper extracts the BIF section to which the join key value V belongs using the calculated offset value and the BIF interval. That is, when the BIF interval is 4 and the offset value is 37, the BIF interval to which the join key value V belongs can be extracted using the quotient obtained by dividing the offset value by the BIF interval.

이와 같은 방법은 [수학식 2]와 같이 정의될 수 있다.
Such a method can be defined as [Equation 2].

[수학식 2]&Quot; (2) "

I = (V-Base)/L
I = (V-Base) / L

여기서, I는 현재 입력된 조인 질의에 포함된 테이블 레코드의 조인키 값이 속하는 BIF 구간의 인덱스이고, V는 현재 입력된 조인 질의에 포함된 테이블 레코드의 조인키 값이고, Base는 조인키 도메인의 시작 조인키 값이고, L는 BIF 구간이다)이다.Here, I is the index of the BIF section to which the join key value of the table record included in the currently input join query belongs, V is the join key value of the table record included in the currently input join query, Start join key value, and L is the BIF interval).

본 예제에서는 조인키 값 37이 인덱스 값이 9인 BIF 구간에 속하는 것으로 추출되는데, 인덱스 값이 0부터 시작되는 것을 예로 한 것으로, 인덱스 값의 시작 번호에 따라 그 계산이 달라질 수 있음은 물론이다.In this example, the join key value 37 is extracted as belonging to the BIF section having an index value of 9, and the index value starts from 0, for example, and the calculation may be changed according to the index number.

상기와 같이 현재의 조인키 값이 속한 BIF 구간이 추출되면, 맵퍼는 추출된 BIF 구간의 BIF 값, 즉 비트 값에 기초하여, 해당 BIF 구간에 속한 레코드의 전송 여부를 결정함으로써, 필터링 과정을 수행하게 된다. 즉, BIF 값이 1인 경우에는 해당 BIF 구간에 속하는 레코드들은 리듀서로 전송되며(S45), BIF 값이 0인 경우에는 필터링되어 제거된다.When the BIF section to which the current join key value belongs is extracted as described above, the mapper determines whether to transmit the record belonging to the BIF section based on the BIF value of the extracted BIF section, that is, the bit value, . That is, when the BIF value is 1, the records belonging to the corresponding BIF section are transmitted to the reducer (S45). When the BIF value is 0, the records are filtered and removed.

상기와 같은 과정을 통해 각 맵퍼들은 BIF 값이 1인 BIF 구간의 레코드 들만을 리듀서로 전송하게 됨으로써, 입력된 조인 질의에 무관한 레코드들의 전송에 따른 네트워크 부하를 제거할 수 있게 된다.Through the above process, each mapper transmits only the records of the BIF section having the BIF value of 1 to the reducer, thereby eliminating the network load due to the transmission of the records irrelevant to the inputted join query.

또한, 리듀서의 경우에도 맵퍼에 의해 필터링된 레코드들만을 수신하여 조인 질의를 처리하여 결과를 출력(S46)함으로써, 기존의 리듀스-사이드 조인 방법에서보다 적은 레코드를 처리하게 되어 컴퓨팅 비용을 줄일 수 있게 된다.Also, in the case of the reducer, only the records filtered by the mapper are received, the join query is processed, and the result is output (S46), thereby reducing the computation cost by processing fewer records in the existing re- .

그리고, 맵퍼에 의한 필터링 과정에서 조인키 값이 속하는 BIF 구간을 검색하고, BIF 값, 즉 비트 값을 찾는데 소요되는 시간이 필터의 크기, 즉 비트맵 구간 필터 데이터의 크기와 무관하게 빠르게 처리될 수 있게 된다.In the filtering process by the mapper, the BIF section in which the join key value belongs is searched, and the time required to find the BIF value, that is, the bit value, can be quickly processed irrespective of the size of the filter, .

한편, 도 4의 S42 단계에서 비트맵 구간 필터 데이터가 검색되지 않는 경우, 본 발명에 따른 하둡 기반의 리듀스-사이드 조인 처리 시스템은 기존의 방식, 예컨대, 기존의 하둡 기반의 하이브 시스템으로 조인 질의 처리를 수행하게 된다(S47). 예컨대, 기존의 맵리듀스의 리듀스-사인 조인 방법이나 맵-사이드 조인 방법, 또는 그 조합을 통해 조인 질의의 결과를 생성하게 된다.If the bitmap section filter data is not found in step S42 of FIG. 4, the Hadoop-based reduced-side join processing system according to the present invention can perform join query processing using an existing method, for example, a conventional Hadoop- (S47). For example, the result of the join query is generated through the Reduce-Sine join method, the map-side join method, or a combination thereof of the existing MapReduce.

이 때, 기존의 방식을 통해 처리된 조인 질의의 결과로부터 결과키 값이 추출되며(S48), 도 5에 도시된 바와 같은 과정의 수행을 통해 해당 조인 질의에 대한 비트맵 구간 필터 데이터가 생성되어 BIF 데이터베이스에 저장(S49)됨으로써, 이후의 유사한 조인 질의의 입력시 활용될 수 있게 된다.At this time, the result key value is extracted from the result of the join query processed through the existing method (S48), and bitmap interval filter data for the corresponding join query is generated through the process shown in FIG. 5 Stored in the BIF database (S49), so that it can be utilized in inputting similar similar join queries thereafter.

도 8은 본 발명에 따른 하둡 기반 리듀스-사이드 조인 질의 처리 방법을 구현하기 위한 하이브 시스템의 예를 나타낸 도면이다. 기존의 하둡 기반 하이브 시스템에 본 발명에 따른 조인 질의 처리 방법의 구현을 위해 QueryMatcher, JoinkeyDistributer, RecordFilter, JoinkeyExtractor, 및 BIFGenerator가 추가된다.8 is a diagram illustrating an example of a hive system for implementing a Hadoop-based reduction-side join query processing method according to the present invention. A QueryMatcher, a JoinkeyDistributer, a RecordFilter, a JoinkeyExtractor, and a BIF Generator are added to an existing Hadoop based hive system to implement a join query processing method according to the present invention.

사용자로부터 들어온 질의를 QueryMatcher가 조인 질의 여부를 판단하고, 조인 질의인 경우 BIF 데이터베이스에 비트맵 구간 필터 데이터가 존재하는지 검색하게 된다.QueryMatcher determines whether or not the query received from the user is a query. If the query is a join query, it searches the BIF database for bitmap interval filter data.

비트맵 구간 필터 데이터가 존재하는 경우, JoinkeyDistributer가 검색된 비트맵 구간 필터 데이터를 각 맵퍼로 분배한다. 그리고, 각 맵퍼에 구현된 RecordFilter가 비트맵 구간 필터 데이터를 이용하여 상술한 필터링 과정을 수행하게 된다.If there is bitmap interval filter data, the JoinkeyDistributer distributes the retrieved bitmap interval filter data to each mapper. Then, the RecordFilter implemented in each mapper performs the above-described filtering process using the bitmap interval filter data.

반면, 비트맵 구간 필터 데이터가 존재하지 않는 경우, JoinkeyExtractor가 결과키 값을 추출하고, 추출된 결과키 값을 이용하여 BIFGenerator가 해당 조인 질의의 구긴 필터 비트맵 데이터를 생성하게 된다.On the other hand, if the bitmap interval filter data does not exist, the JoinkeyExtractor extracts the result key value, and the BIF generator generates the filtered bitmap data of the corresponding join query using the extracted result key value.

도 8에서는 QueryMatcher가 질의를 분석하는 SemanticAnalyzer에 구현되고, JoinkeyDistributer가 ExecDrier에 구현되고, RecordFilter가 맵퍼, 즉 ExecMapper에 구현되고, JoinkeyExtrator가 ExecReducer에 구현되는 것을 예로 하고 있다. 여기서, BIFGenerator는 조인 질의의 질의 흐름과 무관하게 동작하도록 설계되는 것을 예로 하고 있다.In FIG. 8, an example is shown in which a QueryMatcher is implemented in a SemanticAnalyzer that analyzes a query, a JoinkeyDistributer is implemented in ExecDrier, a RecordFilter is implemented in a mapper, i.e., ExecMapper, and a JoinkeyExtrator is implemented in ExecReducer. Here, the BIF Generator is designed to operate independently of the query flow of the join query.

이하에서는, 도 9 내지 도 12를 참조하여 본 발명에 따른 하둡 기반 리듀스-사이드 조인 질의 처리 방법을 하이브(HIVE)에 적용한 실험 결과를 예로 하여 설명한다.Hereinafter, referring to FIGS. 9 to 12, a description will be given of an experiment result of applying the Hadoop-based reduction-side join query processing method according to the present invention to the hive HIVE.

실험에서는 본 발명에 따른 하둡 기반 리듀스-사이드 조인 질의 처리 방법의 처리 동작을 두 가지로 분류하여 진행하였다. 사용자 질의에 해당하는 구간 필터 미트맵 데이터가 없을 경우는 질의 처리 과정에서 결과키 값을 추출한다. 그리고 구간 필터 미트맵 데이터가 있을 경우에는 질의에 해당하는 구간 필터 미트맵 데이터를 각 맵퍼에 분배하여 조인결과와 상관이 없는 레코드들을 제거한다. 이 두 과정과 기존의 하이브의 질의 처리 성능을 실험을 통해 비교하였다.In the experiment, the processing operation of the Hadoop based reduction-side join query processing method according to the present invention was classified into two processes. If there is no interval filter mit map data corresponding to the user query, the result key value is extracted in the query processing. If there is segment map filter map map data, the segment filter map map data corresponding to the query is distributed to each mapper, and records having no correlation with the join result are removed. Experimental results show that the query processing performance of these two processes is compared with that of existing hives.

실험은 이진조인, 즉 하나의 맵리듀스 잡으로 조인을 처리하는 경우과 멀티웨이조인, 즉 두 개 이상의 맵리듀스 잡으로 조인을 처리하는 경우로 나누어 실험하였다. 실험에 사용되는 데이터 모델은 데이터 웨어하우스 시스템의 성능측정에 주로 사용되는 TPC-H의 데이터 모델을 사용한다. 그 중에서도 이진조인은 LineItem 테이블과 Orders 테이블을 사용하였고, 멀티웨이조인은 두 테이블의 customer 테이블을 추가하여 사용하였다. 여기서 사용된 SF(Scale Factor)는 테이블 별 레코드 생성 개수를 결정하는 계수이다.Experiments were divided into two cases: binary joins, that is, joining with one mapping task, and multiway joining, ie joining with two or more MapReduce tasks. The data model used in the experiment uses the data model of TPC-H which is mainly used for measuring the performance of the data warehouse system. Among them, the binary join uses the LineItem table and the Orders table, and the multiway join uses the customer table of the two tables. The SF (Scale Factor) used here is a coefficient that determines the number of record generation per table.

실험에 사용된 조인 질의는 [표 3]과 같다. TPC-H의 실험 질의는 집계함수와 같이 다른 요인들에 대한 영향이 많이 있기 때문에 다음과 같은 조인조건과 간단한 where조건을 질의를 사용하였다. where절을 붙이는 이유는 조인키를 조인 결과로부터 추출하기 때문에 조인 조건뿐 만 아니라 where절 조건도 담고 있기 때문이다. 또한 where절 조건은 조인 선택율을 주기 위해 사용한다. 이진조인에서는 orderdate를 where절 조건으로 사용하였고, 아래의 값의 도메인을 갖는다. 그리고 멀티웨이조인에서는 mktsegment조건을 where절 조건에 추가적으로 사용하였다.
The join query used in the experiment is shown in [Table 3]. Because TPC-H has many influences on other factors such as aggregation function, the following join condition and simple where condition are used for the query. The reason for attaching the where clause is that it contains not only the join condition but also the where clause condition because the join key is extracted from the join result. Also, the where clause condition is used to give the join selectivity. In the binary join, orderdate is used as the condition of the where clause. In addition, the mktsegment condition is added to the where clause condition in the multiway join.

orderdate :1992-01-01 ~ 1998-12-31orderdate: 1992-01-01 ~ 1998-12-31

mktsegment : AUTOMOBILE, BUILDING, FURNITURE, MACHINERY, HOUSEHOLD
mktsegment: AUTOMOBILE, BUILDING, FURNITURE, MACHINERY, HOUSEHOLD

[표 3][Table 3]

도 9는 이진조인에서 노드수 증에 따른 조인질의 처리 시간을 측정한 그래프이다. 그래프를 설명하는데 있어, 비트맵 구간 필터 데이터가 없는 경우를 'BIF가 없는'것으로, 비트맵 구간 필터 데이터가 있는 경우를 'BIF를 사용'한 것으로 정의하여 설명한다.9 is a graph illustrating a query query processing time according to the number of nodes in a binary join. In describing the graph, the case where the bitmap interval filter data is absent is defined as 'without BIF' and the case where bitmap interval filter data is defined as 'using BIF'.

도 9를 참조하여 설명하면, 기존의 하이브와 BIF가 없는 경우의 실험 결과를 함께 살펴보면, 두 그래프 모두 노드 수가 증가할수록 질의 처리시간이 감소하는 것을 볼 수 있다. BIF가 없는 경우는 질의처리 중에 결과키 값을 추출해 내므로, 도 9에 도시된 바와 같이, 조인키 값을 추출하는 과정은 성능의 큰 영향을 미치지 않음을 볼 수 있다.Referring to FIG. 9, it can be seen that the query processing time decreases as the number of nodes increases in both graphs. If there is no BIF, the result key value is extracted during the query processing. Therefore, as shown in FIG. 9, it can be seen that the process of extracting the join key value has no significant effect on performance.

BIF를 사용한 경우와 기존의 하이브의 실험 결과를 함께 살펴보면 BIF를 사용한 경우의 질의 처리시간이 기존의 하이브에 비해 20% ~ 26% 정도 성능이 개선이 되었음을 확인할 수 있다. 이는 맵퍼에서 필터링을 통해 조인 질의와 무관한 레코드들이 제거되어 셔플과정에서의 파일 I/O와 네트워크 비용이 감소하기 때문이다.In the case of using BIF and experiment results of existing hive, we can confirm that the query processing time of BIF is improved by 20% ~ 26% compared with the existing hive. This is because filtering in the mapper removes records that are not related to the join query, reducing file I / O and network costs in the shuffle process.

도 10은 멀티웨이조인에서 노드 수의 증에 따른 조인질의 처리 시간을 측정한 그래프이고, 도 11은 이진조인에서 데이터 크기에 따른 조인질의 처리 시간을 측정한 그래프이고, 도 12는 멀티웨이조인에서 데이터 크기에 따른 조인질의 처리 시간을 측정한 그래프이다.FIG. 10 is a graph illustrating a query query processing time according to an increase in the number of nodes in a multi-way join, FIG. 11 is a graph illustrating a query query processing time according to a data size in a binary join, This is a graph measuring the query processing time according to the data size.

도 10 내지 도 12에 도시된 바와 같이, BIF를 사용한 경우, 즉 본 발명에 따른 하둡 기반 리듀스-사이드 조인 질의 처리 방법이 적용된 경우, 기존의 하이브와 비교할 때 그 성능이 향상되었음을 확인할 수 있다.
As shown in FIGS. 10 to 12, when the BIF is used, that is, when the Hadoop-based redox-side join query processing method according to the present invention is applied, it can be confirmed that the performance is improved as compared with the existing hive.

비록 본 발명의 몇몇 실시예들이 도시되고 설명되었지만, 본 발명이 속하는 기술분야의 통상의 지식을 가진 당업자라면 본 발명의 원칙이나 정신에서 벗어나지 않으면서 본 실시예를 변형할 수 있음을 알 수 있을 것이다. 발명의 범위는 첨부된 청구항과 그 균등물에 의해 정해질 것이다.Although several embodiments of the present invention have been shown and described, those skilled in the art will appreciate that various modifications may be made without departing from the principles and spirit of the invention . The scope of the invention will be determined by the appended claims and their equivalents.

Claims

A method for processing a redo-side join query of a redo-side join processing system based on Hadoop,
(a) generating bitmap interval filter data based on a resultant key value extracted from a result of a join query in the Hadoop-based reduction-side join processing system and storing the bitmap interval filter data in a BIF database;
(b) if a new join query is input to the Hadoop-based redundancy-side join processing system, the redo-side join processing system of the Hadoop classifies the bitmap section filter corresponding to the new join query into the BIF database Data is retrieved,
(c) when the bitmap interval filter data is retrieved in the step (b), transmitting the retrieved bitmap interval filter data to each mapper of Hadoop;
(d) transmitting, by the respective mapper, a filtered record through filtering based on the bitmap interval filter data to the Hadoop reducer;
(e) outputting a result of the new join query based on the record transmitted from each mapper by the reducer;
The step (a) performed by the Hadoop-based reduction-side join processing system
(a1) extracting the result key value from the result of the join query;
(a2) dividing the join key domain into a plurality of BIF intervals in units of BIF intervals;
(a3) checking whether the extracted result key value belongs to each BIF section;
(a4) bitmap interval filter data including information on the plurality of BIF sections and a BIF value as to whether the extracted result key value belongs to each BIF section is generated and stored in the BIF database ;
In the step (d), the mapper transmits to the reducer a record in a BIF section to which a join key value of a table record included in the new join query among the plurality of BIF sections of the bitmap section filter data belongs. A method for processing a redo-side join query of a Hadoop-based redo-side join processing system.

The method according to claim 1,
In step (a2), the BIF interval may be expressed by the following equation: L = D / R (where L is the BIF interval, D is the number of elements of the join key domain, and R is the number of the extracted result key values) Side joining processing system of the Hadoop based reduction-side join processing system.

3. The method of claim 2,
Wherein the bitmap interval filter data further includes a BIF interval and a start join key value of the join key domain,
In the step (d), the mapper
An offset value is calculated based on an interval between a join key value of the table record included in the new join query and the start join key value,
Extracts a BIF section to which a join key value of a table record included in the new join query belongs based on a join key value of the table record included in the new join query, the offset value, and the BIF interval,
Side join processing of the Hadoop based reduction-side join processing system, based on the BIF value of the extracted BIF section, whether or not a record in the extracted BIF section is transmitted to the reducer Query processing method.

The method of claim 3,
The mapper may be represented by equation
I = (V-Base) / L
(Where I is an index of a BIF section to which a join key value of a table record included in the new join query belongs, V is a join key value of a table record included in the new join query, The join key value is a start join key value, and L is the BIF interval), the BIF section to which the join key value of the table record included in the new join query belongs is extracted. Reduce - Side join query processing method.

5. The method according to any one of claims 1 to 4,
(f1) outputting the result of the new join query through a predetermined processing method by the Hadoop-based redes-side join processing system when the bitmap section filter data is not retrieved in the step (b) ;
(f2) extracting a result key value from the result of the new join query output in the step (f1) by the Hadoop-based reduction-side join processing system;
(f3) The steps (a2) to (a4) are performed on the resultant key value extracted in the step (f2) by the Hadoop-based reduction-side join processing system, And generating bitmap interval filter data and storing the bitmap interval filter data in the BIF database. The method of claim 1,