KR20180085633A

KR20180085633A - Method and apparatus for processing query

Info

Publication number: KR20180085633A
Application number: KR1020170009426A
Authority: KR
Inventors: 정문영; 이태휘; 김성수; 송혜원; 원종호
Original assignee: 한국전자통신연구원
Priority date: 2017-01-19
Filing date: 2017-01-19
Publication date: 2018-07-27
Also published as: US20180203896A1

Abstract

Provided are a query processing method and apparatus, which can increase query processing speed. The query processing method comprises selecting a partition corresponding to an input query if there are partitions in a data table when the query is input; selecting at least one partition column set corresponding to the input query when there is at least one partition column set in the selected partition; and processing the query for the selected partition column set.

Description

[0001] The present invention relates to a method and apparatus for processing a query,

본 발명은 질의를 처리하는 방법 및 장치에 관한 것이다. The present invention relates to a method and apparatus for processing queries.

최근 빅데이터 처리가 중요하게 연구되면서 대규모의 데이터의 병렬 처리를 지원하기 위한 오픈 소스 프로젝트인 하둡(Hadoop)이 많이 연구되고 있다. 하둡은 대규모의 데이터를 분산 저장하고 관리하기 위한 플랫폼인 하둡 파일 시스템(Hadoop Distribute File System: HDFS)과 대규모의 데이터를 분산 병렬 처리하기 위한 프레임워크인 맵리듀스(MapReduce: MR)로 이루어지며, 맵리듀스를 이용한 많은 질의 처리 기법이 연구되고 있다. Recently, Hadoop (Hadoop), an open-source project to support parallel processing of large-scale data, has been studied extensively as major data processing has been studied. Hadoop is made up of Hadoop Distribute File System (HDFS), a platform for distributed storage and management of large amounts of data, and MapReduce (MR), a framework for large scale distributed parallel processing of data. Many query processing techniques using deuce have been studied.

SQL(Structured Query Language)-on-Hadoop은, 하둡 분산 파일 시스템(HDFS)에 저장된 데이터에 대한 SQL 질의 처리를 제공하는 시스템이다. 대부분의 SQL-on-Hadoop 시스템들은 기존 하둡에서 제공하는 맵리듀스 아키텍처를 사용하지 않고 새로운 분산처리 모델과 프레임워크를 기반으로 구현된다. 아파치 하이브(Hive), 아파치 타조(Tajo), 클라우데라의 임팔라(Impala), 페이스북의 프레스토(Presto) 등 수많은 SQL-on-Hadoop 시스템들이 있다. Structured Query Language (SQL) -on-Hadoop is a system that provides SQL query processing for data stored in the Hadoop Distributed File System (HDFS). Most SQL-on-Hadoop systems are built on a new distributed processing model and framework without using the existing MapleDesk architecture from Hadoop. There are a number of SQL-on-Hadoop systems, including Apache Hive, Apache Tajo, Impala in Claudia, and Presto on Facebook.

　SQL-on-Hadoop 시스템은 여러 노드에 분산되어 있는 대용량 데이터에 대한 질의를 분산하여 처리할 수 있으나, 질의를 처리하는 노드에 데이터를 이동시키는 단계에서 많은 디스크 I/O(Input/Output) 및 네트워크 전송이 필요하므로 질의 처리 속도가 늦어지게 된다.　 HDFS 기반의 분산된 데이터에 대한 느린 처리 속도를 향상시키기 위해서, 실체화 뷰(Materialized view), 질의 컬럼셋 (Query Column Sets), 데이터 파티션 등의 기술들이 활용되고 있다. The SQL-on-Hadoop system can distribute queries for large amounts of data distributed across multiple nodes, but in the process of moving data to the nodes that process the queries, many disk I / O and network Since the transmission is required, the query processing speed becomes slow. Techniques such as materialized views, query column sets, and data partitions have been utilized to improve the slow throughput of HDFS-based distributed data.

본 발명이 해결하고자 하는 과제는 질의 처리 속도를 보다 향상시킬 수 있는 방법 및 장치를 제공하는 것이다. A problem to be solved by the present invention is to provide a method and apparatus capable of further improving a query processing speed.

본 발명의 특징에 따른 질의 처리 방법은, 질의 처리 장치가, 질의를 처리하는 방법으로서, 질의가 입력되면 데이터 테이블에 파티션이 있는 경우, 상기 입력된 질의에 대응하는 파티션을 선택하는 단계; 상기 선택된 파티션에 적어도 하나의 파티션 컬럼셋이 있는 경우, 상기 입력된 질의에 대응하는 적어도 하나의 파티션 컬럼셋을 선택하는 단계; 및 상기 선택된 파티션 컬럼셋에 대해 상기 질의를 처리하는 단계를 포함한다. According to an aspect of the present invention, there is provided a method of processing a query, the method comprising: selecting a partition corresponding to the input query, when a query is input; Selecting at least one partitioned column set corresponding to the input query if the selected partition has at least one partitioned column set; And processing the query for the selected partitioned column set.

상기 파티션 컬럼셋은 상기 데이터 테이블이 적어도 하나의 수평 파티션으로 분할되어 있는 경우, 상기 수평 파티션 각각에 대하여 상기 데이터 테이블을 구성하는 적어도 하나의 컬럼을 묶은 컬럼셋을 캐시 테이블로 저장한 데이터 구조일 수 있다. Wherein the partition column set is a data structure in which, when the data table is divided into at least one horizontal partition, a column set including at least one column constituting the data table for each of the horizontal partitions is stored as a cache table have.

상기 데이터 테이블의 파티션 각각에 대하여 적어도 하나의 파티션 컬럼셋이 선택적으로 형성되어 있으며, 상기 파티션별로 형성된 파티션 컬럼셋의 수와 파티션 컬럼셋을 형성하는 컬럼의 종류가 다를 수 있다. At least one partition column set is selectively formed for each partition of the data table, and the number of partition column sets formed for each partition may be different from the type of columns forming the partition column set.

상기 파티션 컬럼셋을 선택하는 단계는 상기 입력된 질의의 조건절을 분석하고, 상기 선택된 파티션에 대하여 형성된 적어도 하나의 파티션 컬럼셋이 형성되어 있는 경우, 상기 분석 결과를 토대로 상기 적어도 하나의 파티션 컬럼셋 중에서 하나의 파티션 컬럼셋을 선택할 수 있다. Wherein the step of selecting the partitioned column set comprises: analyzing a condition of the input query; if at least one partitioned column set formed for the selected partition is formed, selecting one of the at least one partitioned column set You can select one partition column set.

한편, 상기 질의 처리 방법은, 상기 데이터 테이블에 파티션이 없는 경우, 상기 데이터 테이블에 대하여 상기 질의를 처리하는 단계; 및 상기 선택된 파티션에 파티션 컬럼셋이 없는 경우, 상기 선택된 파티션에 대하여 상기 질의를 처리하는 단계를 더 포함할 수 있다. The query processing method may further include: processing the query for the data table when the data table does not have a partition; And if the selected partition does not have a partition column set, processing the query for the selected partition.

상기 질의 처리 장치는 분산 질의 처리 엔진일 수 있다. The query processing device may be a distributed query processing engine.

본 발명의 다른 특징에 따른 구성 방법은, 질의 처리를 위한 컬럼셋을 구성하는 방법으로서, 질의의 워크로드를 분석하여 데이터 테이블을 복수의 수평 파티션으로 나누는 단계; 및 각 수평 파티션에 대하여 상기 질의의 워크로드의 분석 결과를 토대로, 상기 데이터 테이블을 구성하는 적어도 하나의 컬럼을 묶은 파티션 컬럼셋을 선택적으로 적어도 하나 구성하는 단계를 포함한다. According to another aspect of the present invention, there is provided a method of constructing a column set for query processing, the method comprising: analyzing a workload of a query to divide the data table into a plurality of horizontal partitions; And selectively configuring at least one partition column set including at least one column constituting the data table based on the analysis result of the query workload for each horizontal partition.

상기 수평 파티션별로 형성된 파티션 컬럼셋의 수가 다를 수 있다. The number of partition column sets formed for each horizontal partition may be different.

상기 수평 파티션별로 파티션 컬럼셋을 구성하는 컬럼의 종류가 다를 수 있다. The types of columns constituting the partition column set may be different for each horizontal partition.

상기 구성하는 단계는, 상기 파티션 컬럼셋을 캐시 테이블로 저장하는 단계를 포함할 수 있다. The configuring may include storing the partitioned column set as a cache table.

상기 구성하는 단계는, 각 수평 파티션에 대하여 복수의 파티션 컬럼셋들이 형성되어 있는 경우, 적어도 하나의 수평 파티션에 대하여 상기 복수의 파티션 컬럼셋들 중 적어도 2개의 파티션 컬럼셋을 하나로 통합하는 단계를 더 포함할 수 있다. The step of configuring may further include, if the plurality of partition column sets are formed for each horizontal partition, consolidating at least two of the plurality of partition column sets into at least one horizontal partition .

본 발명의 또 다른 특징에 따른 질의 처리 장치는, 질의를 입력받도록 구성되는 입출력부; 그리고 상기 입출력부와 연결되고, 질의 처리를 수행하는 프로세서를 포함하며, 상기 프로세서는, 상기 입출력부를 통해 질의가 입력되면 데이터 테이블의 수평 파티션들 중에서 상기 입력된 질의에 대응하는 수평 파티션을 선택하고, 상기 선택된 수평 파티션에 적어도 하나의 파티션 컬럼셋이 있는 경우, 상기 입력된 질의에 대응하는 적어도 하나의 파티션 컬럼셋을 선택하고, 상기 선택된 파티션 컬럼셋에 대해 상기 질의를 처리하도록 구성된다. According to another aspect of the present invention, there is provided a query processing apparatus including: an input / output unit configured to receive a query; And a processor connected to the input / output unit and performing a query process, wherein the processor selects a horizontal partition corresponding to the input query from the horizontal partitions of the data table when a query is input through the input / output unit, Select at least one partitioned column set corresponding to the input query if the selected horizontal partition has at least one partitioned column set, and process the query for the selected partitioned column set.

상기 프로세서는, 상기 입력된 질의의 조건절을 분석하고, 상기 선택된 파티션에 대하여 적어도 하나의 파티션 컬럼셋이 형성되어 있는 경우에, 상기 분석 결과를 토대로 상기 적어도 하나의 파티션 컬럼셋 중에서 하나의 파티션 컬럼셋을 선택하도록 구성될 수 있다. Wherein the processor analyzes the condition of the input query and, if at least one partition column set is formed for the selected partition, selects one of the at least one partition column set based on the analysis result . &Lt; / RTI >

상기 수평 파티션에 대응하는 데이터 블록과 상기 파티션 컬럼셋에 대응하는 데이터 블록이 분산 파일 시스템의 여러 노드에 분산 저장되어 있으며, 상기 질의 처리 장치는 상기 입력된 질의에 대응하는 수평 파티션의 파티션 컬럼셋에 대응하는 데이터 블록을 읽어와 질의를 처리할 수 있다. Wherein the data block corresponding to the horizontal partition and the data block corresponding to the partition column set are distributedly stored in various nodes of the distributed file system, and the query processing device includes a partition column set of the horizontal partition corresponding to the input query The corresponding data block can be read and the query can be processed.

본 발명의 실시 예에 따르면, 파티션과 질의 컬럼셋을 통합적으로 이용한 파티션 컬럼셋을 이용하여 질의를 처리함으로써, 질의 처리 속도를 향상시킬 수 있다. 또한, 분산 데이터의 테이블에 대해 수평 파티션을 구축하고, 각 수평 파티션마다 질의 워크로드를 분석하여 파티션 컬럼셋을 구축함으로서, 질의에서 처리하는 데이터를 미리 필터링하여 질의 처리 성능을 높일 수 있다. According to an embodiment of the present invention, a query processing speed can be improved by processing a query using a partitioned column set using a partition and a query column set integrally. In addition, by constructing a horizontal partition for a table of distributed data and analyzing a query workload for each horizontal partition, a partition column set is constructed, and the query processing performance can be improved by filtering the data processed in the query in advance.

특히, OLAP(OnLine Analytical Processing) 연산이 많은 분산 질의 처리 시스템에서, 질의에 대응하는 파티션 컬럼셋만을 읽고 이를 해당 노드로 전송하면 되므로, 질의 처리 속도를 높일 수 있다. Particularly, in a distributed query processing system having a large number of OnLine Analytical Processing (OLAP) operations, only the partition column set corresponding to the query can be read and transmitted to the corresponding node, thereby increasing the query processing speed.

도 1은 본 발명의 실시 예에 따른 파티션 컬럼셋을 나타낸 예시도이다.
도 2는 본 발명의 실시 예에 따른 파티션 컬럼셋을 구성하는 과정의 흐름도이다.
도 3은 본 발명의 실시 예에 따른 질의 처리 방법의 흐름도이다.
도 4는 본 발명의 실시 예에 따른 질의 처리 과정을 나타낸 예시도이다.
도 5는 본 발명의 실시 예에 따른 질의 처리 장치의 구조도이다. Figure 1 is an illustration of a partitioned column set according to an embodiment of the present invention.
2 is a flowchart of a process of configuring a partitioned column set according to an embodiment of the present invention.
3 is a flowchart of a query processing method according to an embodiment of the present invention.
4 is a diagram illustrating an example of a query process according to an embodiment of the present invention.
5 is a structural diagram of a query processing apparatus according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시 예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. Throughout the specification, when an element is referred to as "comprising ", it means that it can include other elements as well, without excluding other elements unless specifically stated otherwise.

이하, 도면을 참조하여 본 발명의 실시 예에 따른 질의 처리 방법 및 장치에 대하여 설명한다. Hereinafter, a method and apparatus for processing a query according to an embodiment of the present invention will be described with reference to the drawings.

본 발명의 실시 예에서는 데이터베이스 파티션(partition)과 질의 컬럼셋(column set)을 통합한 파티션 컬럼셋(partition column set)을 이용하여 질의를 처리한다.In the embodiment of the present invention, a query is processed by using a partition column set which includes a database partition and a query column set.

데이터베이스의 파티셔닝(partitioning)이란 테이블을 파티션이라 부르는 작은 파트로 물리적으로 나누는 것이다. 수평 파티션은 테이블의 데이터 즉, 레코드를 특정 키 컬럼(Key Column)의 값을 기준으로 여러 개의 서브 테이블로 나누는 방법으로, 레코드를 나누는 기준에 따라 범위 파티션이나 해쉬(hash) 파티션 등의 방법들이 주로 사용된다. 수직 파티션은 테이블의 데이터를 여러 개의 서브 테이블로 나누는데, 각 서브 테이블이 서로 겹치지 않는, 서로소(disjoint sets)인 컬럼들을 가진 서브 테이블이 되도록 나누는 방법이다. 이때, 원래 테이블의 키 컬럼들은 모든 서브 테이블에 중복된다. 데이터베이스 파티션을 이용하면, 질의가 들어왔을 때 테이블 전체의 데이터를 처리하지 않고 질의에 필요한 데이터베이스 파티션만 이용하여 처리하면 되므로 질의 성능을 높일 수 있다. 　특히, 분산된 데이터를 처리할 때는 질의 트리(AST(Abstract syntax tree) Tree 등)의 중간 과정에서 생성되는 데이터를 노드 사이에 전송하기 위한 비용이 매우 높으므로, 데이터베이스 파티션을 이용하면 필요 없는 데이터를 초기에 필터링 하여 처리속도를 향상시킬 수 있다. Database partitioning is the physical partitioning of a table into small parts called partitions. A horizontal partition is a method of dividing the data of a table into several sub-tables based on the value of a specific key column. According to a criterion for dividing a record, a method such as a range partition or a hash partition Is used. Vertical partitions divide the data in a table into several sub-tables, each sub-table being a sub-table with disjoint sets of columns that do not overlap. At this time, the key columns of the original table are duplicated in all the subtables. With database partitions, query performance can be improved by processing only the database partitions needed for the query without processing the entire table data when the query is received. Particularly, when processing distributed data, it is very costly to transfer data generated between intermediate nodes of a query tree (abstract syntax tree), etc., between nodes. Therefore, by using database partitions, The processing speed can be improved by filtering at an early stage.

질의 컬럼셋은, 질의 워크로드(workload)를 분석하여 질의의 WHERE, GROUP BY, HAVING 절 등에 자주 사용되는 컬럼들만을 물리적으로 실체화 한 것으로, 추후 질의가 들어왔을 때 질의 컬럼셋을 이용하여 질의 처리 속도를 높일 수 있다. 질의 컬럼셋은 SQL(Structured Query Language)-on-Hadoop 시스템과 같이 주로 OLAP(OnLine Analytical Processing) 연산이 많은 시스템에서 질의 처리 속도를 높이는 데 효과가 있다. 질의 컬럼셋은 원래 테이블의 모든 컬럼을 저장하지는 않는다는 점에서 수평 파티션과는 구별된다. The query column set is a physical realization of only the columns frequently used in the WHERE, GROUP BY, HAVING clause, etc. of the query by analyzing the query workload. When the query is later received, It can speed up. The query column set is effective for increasing the query processing speed in a system that mainly includes on-line analytical processing (OLAP) operations such as a Structured Query Language (SQL) -on-Hadoop system. A query column set is distinct from a horizontal partition in that it does not store all the columns of the original table.

본 발명의 실시 예에서는 데이터베이스 파티션과 질의 컬럼셋을 통합하여 질의를 처리하며, 구체적으로, 수평파티션과 질의 컬럼셋을 통합한 파티션 컬럼셋 구조를 제공한다. 데이터베이스의 테이블(데이터 테이블이라고도 명명됨)을 수평 파티션으로 나누어 물리적으로 저장하고, 나누어진 수평 파티션에 대해서 컬럼셋을 생성하여 물리적으로 저장하는 데이터 구조를 제공한다. 여기서, 테이블의 수평 파티션에 대한 컬럼셋을 파티션 컬럼셋이라 명명한다. In the embodiment of the present invention, a query is processed by integrating a database partition and a query column set, and specifically, a partition column set structure integrating a horizontal partition and a query column set is provided. It provides a data structure that physically stores a table (also called a data table) of the database in horizontal partitions, and physically stores a set of columns for the divided horizontal partitions. Here, the column set for the horizontal partition of the table is called the partition column set.

도 1은 본 발명의 실시 예에 따른 파티션 컬럼셋을 나타낸 예시도이다. Figure 1 is an illustration of a partitioned column set according to an embodiment of the present invention.

첨부한 도 1에서와 같이, 데이터 테이블(100)(도 1의 (a))에 대하여 파티션 컬럼셋(121, 122, 131, 141, 142)을 형성한다. 1, partition column sets 121, 122, 131, 141 and 142 are formed for the data table 100 (FIG. 1 (a)).

데이터 테이블(100)을 기준(예를 들어, 범위나 해쉬값)에 따라 중복되지 않고 모든 데이터를 포함하도록 나눈 수평 파티션으로 나눈다. 예를 들어, 도 1에서와 같이, SHIPDATE의 값에 따라 데이터 테이블(100)을 세 개의 수평 파티션(120, 130, 140)으로 나눌 수 있다(도 1의 (b)). 즉, SHIPDATE의 값이 "1994-01-01" 보다 작은 데이터를 포함하는 수평 파티션1(120), SHIPDATE의 값이 "1994-01-01"보다 크거나 같고, "1997-01-01" 보다 작은 데이터를 포함하는 수평 파티션2(130), 그리고 SHIPDATE의 값이 "1997-01-01"보다 큰 데이터를 포함하는 수평 파티션3(140)으로 나눌 수 있다. Divides the data table 100 into horizontal partitions that are divided to include all the data in accordance with the reference (for example, a range or a hash value) without overlapping. For example, as shown in FIG. 1, the data table 100 can be divided into three horizontal partitions 120, 130, and 140 according to the value of SHIPDATE (FIG. 1B). That is, if the value of SHIPDATE is greater than or equal to "1994-01-01", and the value of SHIPDATE is greater than or equal to "1997-01-01" A horizontal partition 2 130 including small data, and a horizontal partition 3 140 having a value of SHIPDATE including data larger than "1997-01-01 ".

또한, 데이터 테이블(100)에 대하여 질의 컬럼셋(110, 111)을 구성할 수 있다. 　질의 컬럼셋은 데이터 테이블에 대해서 자주 사용되는 컬럼셋을 캐시 테이블로 저장한 데이터 구조이다. 질의 컬럼셋은 데이터 테이블에서 질의에서 WHERE, HAVING, GROUPBY와 같은 절에서 자주 같이 사용되는 컬럼들을 묶어서 저장한 것으로, 질의 컬럼셋은 테이블에 따라 0개 이상 생성될 수 있다. 예를 들어, 데이터 테이블(100)에 대하여, {ORDERKEY, PARTKEY, LINENUMBER, SUPPKEY}에 대한 질의 컬럼셋(110)과, {ORDERKEY, TAX, QUANTITY, SHIPDATE}에 대한 컬럼셋(111)인 총 두 개의 컬럼셋을 구성할 수 있다(도 1의 (b)).In addition, the query column set 110, 111 can be configured for the data table 100. A query column set is a data structure that stores frequently used column sets for a data table as a cache table. A query column set is a set of frequently used columns in a query such as WHERE, HAVING, GROUPBY, and so on. A set of query columns can be created according to a table. For example, for the data table 100, a query column set 110 for {ORDERKEY, PARTKEY, LINENUMBER, SUPPKEY} and a column set 111 for {ORDERKEY, TAX, QUANTITY, SHIPDATE} (FIG. 1 (b)).

이와 같이, 데이터 테이블(100)에 대하여 구성될 수 있는 수평 파티션(120, 130, 140) 그리고 질의 컬럼셋(110, 111)의 개념을 토대로, 본 발명의 실시 예에서는 데이터 테이블에 대하여 파티션 컬럼셋을 구성한다. 파티션 컬럼셋은 데이터 테이블의 수평 파티션 각각에 대해서 자주 사용되는 질의 컬럼셋들을 캐시 테이블로 저장한 데이터이다. 예를 들어, 데이터 테이블(100)에 대하여 구성된 수평 파티션(120, 130, 140) 각각에 대하여 자주 같이 사용되는 컬럼들을 묶어서 파티션 컬럼셋(121, 122, 131, 141, 42)을 구성할 수 있다(도 1의 (d)). 이러한 파티션 컬럼셋 구성시, 파티션 컬럼셋의 수와 컬럼셋을 이루는 컬럼의 종류는 파티션에 따라 다를 수 있다. Based on the concept of the horizontal partitions 120, 130, and 140 and the query column sets 110 and 111 that can be configured for the data table 100 as described above, in the embodiment of the present invention, . A partitioned column set is a set of frequently used query columns for each horizontal partition of the data table as a cache table. For example, the partition columns 121, 122, 131, 141, and 42 may be configured by grouping frequently used columns for each of the horizontal partitions 120, 130, and 140 configured for the data table 100 (Fig. 1 (d)). When configuring such a partitioned column set, the number of partitioned column sets and the type of columns that make up the column set may vary from partition to partition.

이와 같이, 파티션 컬럼셋을 구성하면, 전체 데이터 테이블에서 질의에 따라 많은 불필요한 데이터를 미리 필터링하여 처리할 수 있으므로 질의 처리속도를 높일 수 있다. By configuring the partition column set in this way, a lot of unnecessary data can be filtered and processed in accordance with the query in the entire data table, and the query processing speed can be increased.

도 2는 본 발명의 실시 예에 따른 파티션 컬럼셋을 구성하는 과정의 흐름도이다. 2 is a flowchart of a process of configuring a partitioned column set according to an embodiment of the present invention.

여기서는 분산 질의 처리 엔진에서 파티션 컬럼셋을 구성하는 것을 예로 들어 설명하나, 본 발명은 이에 한정되는 것은 아니다. Here, an example of configuring a partition column set in the distributed query processing engine is described as an example, but the present invention is not limited thereto.

분산 질의 처리 엔진에서는 질의 처리에 따른 질의 워크로드를 저장해 놓는다. 파티션 컬럼셋을 구성하기 위하여, 첨부한 도 2에서와 같이, 우선 질의 워크로드를 분석하여(S100) 후보 수평 파티션을 선택한다(S110). 질의 워크로드 분석 결과를 토대로 어떠한 기준에 따라 수평 파티션을 나눌지를 결정하여 데이터 테이블에 대하여 수평 파티션을 구성하고, 구성된 수평 파티션 중에서 적어도 하나를 후보 수평 파티션으로 선택할 수 있다. 또는 데이터 테이블에 대하여 이미 수평 파티션들이 구성되어 있는 경우, 질의 워크로드 분석 결과를 토대로 이미 구성된 수평 파티션 중에서 적어도 하나를 후보 수평 파티션으로 선택할 수 있다. The distributed query processing engine stores the query workload according to the query processing. In order to construct a partition column set, as shown in FIG. 2, the query workload is first analyzed (S100) and the candidate horizontal partition is selected (S110). Based on the query workload analysis result, it is possible to determine a horizontal partition according to a certain criterion to construct a horizontal partition for the data table, and to select at least one of the configured horizontal partitions as the candidate horizontal partition. Or if the horizontal partitions are already configured for the data table, at least one of the horizontal partitions already configured may be selected as the candidate horizontal partition based on the query workload analysis result.

이후, 선택된 적어도 하나의 후보 수평 파티션 각각에 대하여, 후보 파티션 컬럼셋을 구성한다(S120). 후보 수평 파티션 각각에 대하여 질의 워크로드 분석 결과를 토대로 자주 사용되는 컬럼들을 묶어서 후보 파티션 컬럼셋을 구성한다. 예를 들어, 과거에 자주 사용된 컬럼셋들이 이후에도 사용될 거라는 가정하에, 해당 컬럼셋에 대응하는 컬럼들을 묶어서 후보 파티션 컬럼셋을 구성할 수 있다. Thereafter, for each selected at least one candidate horizontal partition, a candidate partition column set is configured (S120). Based on the query workload analysis result for each candidate horizontal partition, the candidate partition column set is constructed by grouping frequently used columns. For example, a candidate partition column set can be constructed by grouping the columns corresponding to that column set, assuming that the column sets that were used frequently in the past will be used later.

이와 같이 생성된 후보 파티션 컬럼셋들은 질의 워크로드에 너무 잘 맞아서, 특정 질의를 실행할 때는 빠르게 처리하지만, 전체 질의 성능은 오히려 나빠질 수 있다. 몇 개의 후보 파티션 컬럼셋 후보를 통합하면 특정 질의 처리는 덜 빠르게 수행될 수 있지만, 전체 질의 성능을 높일 수 있다. 따라서, 전체 질의 성능을 높이기 위해, 몇 개의 후보 파티션 컬럼셋 후보를 통합한다(S130). 이러한 통합 처리를 통하여 최종적으로 파티션 컬럼셋을 구성한다(S140). 한편, 단계(S130)는 선택적으로 수행될 수 있다. The candidate partition column sets thus generated fit very well into the query workload, so they can be processed quickly when executing a specific query, but the overall query performance can be rather poor. By combining several candidate partition column set candidates, certain query processing can be performed less quickly, but the overall query performance can be improved. Therefore, in order to improve the performance of the entire query, several candidate partition column set candidates are integrated (S130). Finally, a partition column set is configured through the integration process (S140). On the other hand, step S130 may be selectively performed.

한편, 본 발명의 실시 예에 따른 파티션 컬럼셋은 수평 파티션에 대하여 0개 이상 형성될 수 있다. 즉, 임의 수평 파티션에 대하여 적어도 하나의 파티션 컬럼셋이 형성될 수 있으며, 다른 수평 파티션에 대해서는 파티션 컬럼셋이 형성되지 않을 수 있다. 이는 수평 파티션 각각에 대하여 파티션 컬럼셋이 적어도 하나 선택적으로 형성될 수 있음을 나타낼 수 있다. In the meantime, the partition column set according to the embodiment of the present invention may be formed with zero or more for the horizontal partition. That is, at least one partition column set may be formed for an arbitrary horizontal partition, and a partition column set may not be formed for another horizontal partition. This may indicate that at least one partition column set may be selectively formed for each horizontal partition.

다음에는 위에 기술된 바와 같이 구성된 파티션 컬럼셋을 이용하여 질의를 처리하는 방법에 대하여 설명한다. Next, a method of processing a query using a partitioned column set configured as described above will be described.

도 3은 본 발명의 실시 예에 따른 질의 처리 방법의 흐름도이다. 3 is a flowchart of a query processing method according to an embodiment of the present invention.

질의 처리 장치는, 단말로부터 질의가 입력되면(S300), 먼저 질의에 해당되는 데이터 테이블들에 파티션이 있는지를 판단한다(S310). 질의에 해당하는 데이터 테이블에 파티션(수평 파티션 등)이 있는 경우, 질의를 분석하여 필요한 파티션을 선택한다(S320). 예를 들어, 질의의 WHERE, GROUPBY, HAVING 등의 조건절을 분석하여 복수의 파티션 중에서 질의에 필요한 데이터들이 있는 파티션을 선택한다. When a query is input from the terminal (S300), the query processing apparatus determines whether there is a partition in the data tables corresponding to the query (S310). If there is a partition (horizontal partition, etc.) in the data table corresponding to the query, the necessary partition is selected by analyzing the query (S320). For example, by analyzing conditional clauses such as WHERE, GROUPBY, and HAVING of a query, a partition having data necessary for a query among a plurality of partitions is selected.

이후, 선택된 파티션에 대해서 컬럼셋(예: 파티션 컬럼셋)이 있는지를 판단한다(S330). 선택된 파티션에 컬럼셋 예를 들어, 파티션 컬럼셋이 있는 경우, 질의를 분석하여 필요한 파티션 컬럼셋을 선택한다(S340). 즉, 질의의 WHRE, GROUPBY, HAVING, SELECT 등의 절을 분석하여 필요한 파티션 컬럼셋을 선택한다. 이와 같이, 파티션이 선택되고 해당 파티션에 대하여 파티션 컬럼섹이 선택된 다음에, 질의 처리 장치는 선택된 파티션 컬럼셋에 대해 질의를 처리하고(S350), 그 결과를 반환한다(S380).Then, it is determined whether there is a column set (e.g., a partition column set) for the selected partition (S330). If the selected partition has a column set, for example, a partition column set, the query is analyzed to select a required partition column set (S340). That is, it analyzes WHERE, GROUPBY, HAVING, and SELECT clauses of the query to select the required partition column set. After the partition is selected and the partition column is selected for the partition, the query processor processes the selected partition column set (S350) and returns the result (S380).

한편, 선택된 파티션에 파티션 컬럼셋이 없는 경우, 질의 처리 장치는 선택된 파티션에 대해 질의를 처리하고(S360), 그 결과를 반환한다(S380).On the other hand, if there is no partition column set in the selected partition, the query processing unit processes the selected partition (S360) and returns the result (S380).

한편, 위의 단계(S310)에서, 데이터 테이블들에 파티션이 없는 경우, 질의 처리 장치는 데이터 테이블에 대해 질의를 처리하고(S390), 그 결과를 반환한다(S380). On the other hand, if there is no partition in the data tables in step S310, the query processing unit processes the query on the data table (S390) and returns the result (S380).

하둡과 같은 분산 파일 시스템에서는 사용자가 지정한 범위 등에 의해 원래 테이블이 논리적 파티션으로 나뉘어지고, 논리적 파티션은 여러 노드에 물리적으로 중복 분산되어 저장된다. 질의가 들어오면 질의 처리에 필요한 데이터가 들어 있는 파티션에 대해서만 스캔이나 조인과 같은 연산을 수행하면 된다. 따라서, 데이터를 처리하는 노드로 필요한 데이터 파티션만 이동시키면 되므로 불필요한 디스크 I/O나 네트워크 전송을 줄일 수 있다. In a distributed file system such as Hadoop, the original table is divided into logical partitions according to the range specified by the user, and the logical partitions are physically distributed redundantly in a plurality of nodes. When a query comes in, you can perform operations such as scans and joins only on partitions that contain data necessary for query processing. Therefore, unnecessary disk I / O and network transmission can be reduced since only the necessary data partitions can be moved to the node processing the data.

이러한 본 발명의 실시 예에 따른 질의 처리 방법을 토대로, 파티션 컬럼셋을 분산 파일 시스템에 적용하여 질의를 처리하는 과정을 살펴보면 다음과 같다. A process of processing a query by applying a partition column set to a distributed file system based on the query processing method according to the embodiment of the present invention will be described.

도 4는 본 발명의 실시 예에 따른 질의 처리 과정을 나타낸 예시도이다. 4 is a diagram illustrating an example of a query process according to an embodiment of the present invention.

여기서는 분산 파일 시스템의 여러 노드에 수평 파티션과 파티션 컬럼셋이 분산 저장되어 있는 것으로 가정한다. Here, it is assumed that the horizontal partition and the partition column set are distributedly stored in various nodes of the distributed file system.

데이터 테이블(410)을 기준(범위 혹은 해시값, 여기서는 예를 들어, SHIPDATE)에 따라 논리적인 수평 파티션으로 나누고, 수평 파티션(411, 412, 413)들은 시스템의 설정에 따라 정해진 크기의 데이터 블록(431, 433, 435)으로 나뉘어 여러 데이터 노드(N1~N4)에 분산 저장된다. 여기서 데이터 노드는 하둡의 데이터 노드를 지칭한다. 각 수평 파티션에 대하여 파티션 컬럼셋(421~423)이 구성되고, 파티션 컬럼셋(421~423)도 정해진 크기의 데이터 블록(432, 434, 436)으로 나뉘어져 분산 저장된다. The horizontal partitions 411, 412 and 413 divide the data table 410 into logical horizontal partitions according to the reference (range or hash value, for example SHIPDATE in this example) 431, 433, and 435, and are distributedly stored in the plurality of data nodes N1 to N4. Here, the data node refers to the data node of Hadoop. The partition column sets 421 to 423 are configured for each horizontal partition and the partition column sets 421 to 423 are divided and stored into data blocks 432, 434 and 436 of a predetermined size.

이와 같이, 데이터 테이블의 수평 파티션들과, 수평 파티션 각각에 해당하는 파티션 컬럼셋에 대응하는 데이터 블록들이 데이터 노드(N1~N4)에 분산 저장되어 있는 상태에서, 질의(401)가 들어오면, 질의에 따라 수평 파티션이 선택된다. 예를 들어, 질의(401)의 WHERE 절의 SHIPDATE(1993-09-02) 값에 따라 파티션들 중에서 수평 파티션(411)이 선택되고, 선택된 수평 파티션(411)에 파티션 컬럼셋(421)이 존재하고, 파티션 컬럼셋(421)이 질의에 필요한 컬럼을 모두 포함하므로 파티션 컬럼셋(421)에 대해서 질의를 처리한다. In this way, when the query 401 is entered while the horizontal partitions of the data table and the data blocks corresponding to the partition column set corresponding to each horizontal partition are distributed and stored in the data nodes N1 to N4, The horizontal partition is selected. For example, the horizontal partition 411 is selected from the partitions according to the SHIPDATE (1993-09-02) value of the WHERE clause of the query 401, the partition column set 421 exists in the selected horizontal partition 411 , And the partition column set 421 includes all the columns required for the query, so that the partition column set 421 is processed for the query.

이때, 질의를 처리하는 데이터 노드(N1)에서는 질의를 처리하기 위한 데이터 블록(432)을 읽고 가져온다. 따라서, 전체 데이터 테이블을 읽고 이를 해당 노드로 전송하거나 해당 수평 파티션을 읽고 해당 노드로 전송하는 비용에 비하여, 본 발명의 실시 예에서는 파티션 컬럼셋만을 읽고 이를 해당 노드로 전송하면 되므로, 질의 처리 속도를 높일 수 있다. 다시 말하면, 데이터를 읽는 I/O 비용과 질의를 처리하는 노드에 데이터를 전송하는 비용을 모두 절약할 수 있으므로, 질의 처리 속도를 향상시킬 수 있다. 이와 같이 파티션 컬럼셋 구조를 사용하면 질의에서 필요로 하지 않는 데이터는 초기 단계에서 필터링되므로써, 불필요한 디스크 I/O나 네트워크 전송을 줄일 수 있다. At this time, the data node N1 that processes the query reads and fetches the data block 432 for processing the query. Therefore, compared to the cost of reading the entire data table, transferring the entire data table to the corresponding node, or reading the corresponding horizontal partition and transferring the same to the corresponding node, in the embodiment of the present invention, only the partition column set is read and transmitted to the corresponding node. . In other words, both the I / O cost of reading data and the cost of transferring data to the node processing the query can be saved, thereby improving the query processing speed. By using the partitioned column set structure like this, data that is not needed in the query is filtered at the initial stage, thereby reducing unnecessary disk I / O and network transmission.

도 5는 본 발명의 실시 예에 따른 질의 처리 장치의 구조도이다. 5 is a structural diagram of a query processing apparatus according to an embodiment of the present invention.

첨부한 도 5에 도시되어 있듯이, 본 발명의 실시 예에 따른 질의 처리 장치(1)는, 프로세서(10), 메모리(20) 및 입출력부(30)를 포함한다. 프로세서(10)는 위의 도 1 내지 도 4를 토대로 설명한 방법들을 구현하도록 구성될 수 있다. 5, the query processing apparatus 1 according to the embodiment of the present invention includes a processor 10, a memory 20, and an input / output unit 30. The processor 10 may be configured to implement the methods described above based on Figs.

메모리(20)는 프로세서(10)와 연결되고 프로세서(10)의 동작과 관련한 다양한 정보를 저장한다. 메모리(20)는 프로세서(10)에서 수행하기 위한 동작을 위한 명령어(instructions)를 저장하고 있거나 저장 장치(도시하지 않음)로부터 명령어를 로드하여 일시 저장할 수 있다. The memory 20 is connected to the processor 10 and stores various information related to the operation of the processor 10. [ The memory 20 stores instructions for operations to be performed by the processor 10 or temporarily stores the instructions loaded from a storage device (not shown).

프로세서(10)는 메모리(20)에 저장되어 있거나 로드된 명령어를 실행할 수 있다. 프로세서(10)와 메모리(20)는 버스(도시하지 않음)를 통해 서로 연결되어 있으며, 버스에는 입출력 인터페이스(도시하지 않음)도 연결되어 있을 수 있다. The processor 10 may execute instructions that are stored or loaded into the memory 20. The processor 10 and the memory 20 are connected to each other via a bus (not shown), and an input / output interface (not shown) may be connected to the bus.

입출력부(30)는 프로세서(10)의 처리 결과를 출력하거나 질의를 입력받아 프로세서(10)로 제공하도록 구성된다. The input / output unit 30 is configured to output the processing result of the processor 10 or to receive the query and provide it to the processor 10.

본 발명의 실시 예는 이상에서 설명한 장치 및/또는 방법을 통해서만 구현이 되는 것은 아니며, 본 발명의 실시예의 구성에 대응하는 기능을 실현하기 위한 프로그램, 그 프로그램이 기록된 기록 매체 등을 통해 구현될 수도 있으며, 이러한 구현은 앞서 설명한 실시예의 기재로부터 본 발명이 속하는 기술분야의 전문가라면 쉽게 구현할 수 있는 것이다.The embodiments of the present invention are not limited to the above-described apparatuses and / or methods, but may be implemented through a program for realizing functions corresponding to the configuration of the embodiment of the present invention, a recording medium on which the program is recorded And such an embodiment can be easily implemented by those skilled in the art from the description of the embodiments described above.

이상에서 본 발명의 실시 예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, It belongs to the scope of right.

Claims

A query processing apparatus, as a method for processing a query,
Selecting a partition corresponding to the input query if a partition exists in the data table when a query is input;
Selecting at least one partitioned column set corresponding to the input query if the selected partition has at least one partitioned column set; And
Processing the query for the selected partitioned column set
The query processing method comprising:

The method according to claim 1,
Wherein the partitioned column set is a data structure in which, when the data table is divided into at least one horizontal partition, a column set including at least one column constituting the data table for each of the horizontal partitions is stored as a cache table, Query processing method.

The method according to claim 1,
Wherein at least one partition column set is selectively formed for each partition of the data table, and the number of partition column sets formed for each partition is different from the type of columns forming the partition column set.

The method of claim 3,
Wherein the step of selecting the partitioned column set comprises: analyzing a condition of the input query; if at least one partitioned column set formed for the selected partition is formed, selecting one of the at least one partitioned column set A query processing method that selects a set of partitioned columns.

The method according to claim 1,
Processing the query for the data table if the data table does not have a partition; And
If the selected partition does not have a partition column set, processing the query for the selected partition
Further comprising the steps of:

The method according to claim 1,
Wherein the query processing device is a distributed query processing engine.

A method for constructing a set of columns for query processing,
Analyzing the workload of the query to divide the data table into a plurality of horizontal partitions; And
Selectively constructing at least one partition column set including at least one column constituting the data table based on the analysis result of the query workload for each horizontal partition
&Lt; / RTI >

8. The method of claim 7,
Wherein the number of partition column sets formed for each horizontal partition is different.

8. The method of claim 7,
Wherein a type of a column constituting a partition column set is different for each horizontal partition.

8. The method of claim 7,
Wherein the configuring step comprises storing the partitioned column set as a cache table.

8. The method of claim 7,
The constructing step
Integrating at least two of the plurality of partitioned column sets into at least one horizontal partition when a plurality of partitioned column sets are formed for each horizontal partition;
&Lt; / RTI >

An input / output unit configured to receive a query; And
A processor connected to the input / output unit and performing a query processing,
The processor comprising:
A horizontal partition corresponding to the input query is selected from horizontal partitions of the data table when the input is a query through the input / output unit, and if there is at least one partition column set in the selected horizontal partition, Select at least one partitioned column set, and process the query for the selected partitioned column set.

13. The method of claim 12,
Wherein the partitioned column set is a data structure in which, when the data table is divided into at least one horizontal partition, a column set including at least one column constituting the data table for each of the horizontal partitions is stored as a cache table, Query processing device.

13. The method of claim 12,
Wherein at least one partition column set is selectively formed for each partition of the data table, and the number of partition column sets formed for each partition is different from the type of columns forming the partition column set.

15. The method of claim 14,
Wherein the processor analyzes the condition of the input query and, if at least one partition column set is formed for the selected partition, selects one of the at least one partition column set based on the analysis result To the query processing unit.

13. The method of claim 12,
A data block corresponding to the horizontal partition and a data block corresponding to the partition column set are distributedly stored in various nodes of the distributed file system,
Wherein the query processing device reads a data block corresponding to a partition column set of a horizontal partition corresponding to the input query, and processes the query.