KR20160064569A

KR20160064569A - Sql query processing method using mapreduce

Info

Publication number: KR20160064569A
Application number: KR1020140168339A
Authority: KR
Inventors: 강우람; 김현규
Original assignee: 삼육대학교산학협력단
Priority date: 2014-11-28
Filing date: 2014-11-28
Publication date: 2016-06-08
Also published as: KR101638048B1

Abstract

The present invention relates to a method of processing big data in a data processing system consisting of: a web server (10) for providing data for processing big data; a master node (20) for dividing data of the web server (10) and delivering divided data into various distribution nodes such that a given task can be processed in parallel; a mapper (30) as a subnode for receiving map tasks allocated by the master node (20); and Reducer (40) as a subnode for receiving Reduce tasks. The method of processing big data comprises: an SQL query analyzing step (S10) of receiving the data from the web server (10) and determining, when an SQL query occurs, which data attributes are to be delivered to the mapper (30) and the Reducer (40); a data dividing step (S20) of extracting only the attributes which are acquired in the SQL query analyzing step (S10) from input data and dividing the extracted data for the mapper (30) and the Reducer (40); a map step (S30) of transmitting the divided data to the mapper (30) and outputting only an identification (ID) of a record which satisfies predefined SQL conditions; and a Reduce step (S40) of receiving record information for a record ID from a distributed file system (DFS) on the Reducer (40) by using the record ID outputted from the mapper (30), and then performing a computation process. According to the present invention, it is possible to process a large amount of data generated from a network service in a short period of time.

Description

{SQL QUERY PROCESSING METHOD USING MAPREDUCE}

본 발명은 맵리듀스를 이용한 SQL 질의처리방법에 관한 것으로, 보다 상세하게는 맵리듀스를 이용한 SQL 질의처리방법에 있어서 빅데이터 처리 성능을 향상시키기 위한 방법에 관한 것이다.
The present invention relates to an SQL query processing method using MapReduce, and more particularly, to a method for improving big data processing performance in an SQL query processing method using MapReduce.

인터넷이 발전함에 따라 하루에도 수없이 많은 데이터가 인터넷 상에서 생성 및 유통이 되고 있으며, 최근 많은 기업, 특히 검색 엔진 회사 및 웹 포탈들 간에는 이와 같이 엄청난 양의 데이터를 가능한 많이 수집 및 축적하고 수집된 데이터를 처리하여 가능한 빨리 의미 있는 정보를 추출하는 능력이 기업의 경쟁력이 되고 있다.
As the Internet develops, numerous data are being generated and circulated on the Internet, and many companies, especially search engine companies and web portals, collect and accumulate such a huge amount of data as much as possible, The ability to extract meaningful information as quickly as possible is becoming a competitive advantage for companies.

이를 위하여, 현재 많은 기업에서 저비용으로 대규모 클러스터를 구축하여 대용량 데이터 분산 관리 및 작업 분산 병렬 처리하는 기술에 대하여 많은 연구를 하고 있으며, 작업 분산 병렬 처리 기술 중에서 맵리듀스(MapReduce) 모델이 대표적인 작업 분산 병렬 처리 방법 중에 하나로 주목을 받고 있다.
For this purpose, many companies are currently studying large-scale data distribution management and job-sharing parallel processing by constructing large-scale clusters at low cost, and MapReduce model is a representative distributed parallel processing It is attracting attention as one of processing methods.

맵리듀스(MapReduce) 모델은 Google 사에서 저비용 대규모 노드로 구성된 클러스터 상에 저장된 대용량 데이터에 대한 분산 병렬 연산을 지원하기 위하여 제안한 분산 병렬 처리 프로그래밍 모델이며, 맵리듀스(MapReduce) 모델에서 사용자가 작성하는 하나의 작업은 사용자가 작성하는 맵(Map) 함수가 주축이 되는 맵(Map) 단계와 사용자가 작성하는 리듀스(Reduce) 함수가 주축이 되는 리듀스(Reduce) 단계의 2단계로 구성이 되어 순차적으로 수행이 되고, 각 맵(Map) 단계 및 리듀스(Reduce) 단계 내에서는 다중 노드에 다중 태스크로 복제가 되어 분산 병렬 수행이 되며, 맵(Map) 단계에서는 기본적으로 입력 데이터로부터 키/값 쌍을 추출하는 연산을 수행하고, 리듀스(Reduce) 단계에서는 맵(Map) 단계에서 추출된 키/값 쌍에 비즈니스 로직을 적용하여 원하는 최종 결과 키/값 쌍을 구하는 연산을 수행한다.
The MapReduce model is a distributed parallel processing programming model proposed by Google to support distributed parallel operations on large amounts of data stored in a cluster composed of low-cost large nodes. In the MapReduce model, Is composed of two steps: a map step in which a map function created by a user is a main axis, and a reduction step in which a user-created reduction function is a main axis. In the map stage, the multiple stages are replicated as multiple tasks and distributed parallel execution is performed in each of the stages of the map and the reduction. In the map stage, basically, key / value pairs In the Reduce step, the business logic is applied to the key / value pair extracted in the map step, and the desired final result key / value And performs an operation of obtaining a pair.

하지만, 맵리듀스(MapReduce) 모델은 대용량의 일회성 데이터(예:로그 데이터)를 처리하기 위한 목적으로 설계된 것으로, 맵(Map) 함수에 할당되는 데이터를 처음부터 끝까지 읽고 처리하는 방식이다. 이와 같이 매번 입력 데이터를 전체 스캔하는 방식은 작업 처리 성능 저하의 원인이 되고 있다.
However, the MapReduce model is designed to process large amounts of one-time data (eg, log data), and reads and processes the data allocated to the map functions from start to finish. The method of scanning the entire input data every time causes a deterioration of the work processing performance.

이와 같은 문제점을 해결하기 위해 대한민국 등록특허 제10-1331383호(데이터 처리 방법 및 장치, 이하 '선행기술'이라 함)가 있다. In order to solve such a problem, Korean Patent Registration No. 10-1331383 (data processing method and apparatus, hereinafter referred to as "prior art") is available.

상기 선행기술은 맵리듀스(MapReduce) 방식으로 데이터를 처리하는 방법으로서,입력 데이터를 저장하는 단계, 상기 입력 데이터에 대한 저장된 색인 파일이 존재하는지 검사하는 단계, 상기 색인 파일이 존재하지 않는 경우, 상기 입력 데이터에 대한 색인 파일을 생성하는 단계, 상기 색인 파일이 존재하는 경우, 상기 저장된 색인 파일을 이용하여 상기 입력 데이터에서 색인에 대한 검색조건을 만족하는 특정 데이터만을 선별하는 단계 및 상기 입력 데이터 또는 상기 특정 데이터를 맵리듀스 방식으로 처리하는 단계를 포함하되, 상기 색인 파일 생성 단계는 상기 입력 데이터를 스캔하여 색인을 생성하는 단계 및 상기 생성된 색인을 포함하는 색인 파일을 생성하는 단계를 포함하여 구성된다.
The prior art is a method for processing data in a MapReduce method, comprising the steps of: storing input data; checking if a stored index file exists for the input data; if the index file does not exist, Selecting only specific data satisfying a search condition for an index from the input data by using the stored index file if the index file exists, And processing the specific data in a map-driven manner, wherein the step of generating an index file comprises the step of generating an index by scanning the input data and generating an index file including the generated index .

하지만 상기 선행기술에 의해서 입력 데이터 중 일부는 맵 혹은 리듀스 작업에 이용되지 않을 수도 있어, 불필요한 데이터 전송을 막지 못해 컴퓨터 성능을 향상시키지 못하는 문제점이 있다.
However, according to the prior art, some of the input data may not be used for map or reduce operations, and unnecessary data transmission can not be prevented, thereby failing to improve computer performance.

대한민국 등록특허 제10-1331383호(데이터 처리 방법 및 장치, 등록일자 2013년 11월 13일)Korean Patent No. 10-1331383 (Data Processing Method and Apparatus, Registration Date November 13, 2013)

본 발명은 상기와 같은 문제점을 해결하기 위해 안출된 것으로, SQL 질의가 주어졌을때, 어떠한 데이터가 맵과 리듀스 작업에 필요한지 우선 판별을 하고, 이를 통해 맵/리듀스 작업에 필요한 데이터만을 추출하여 선별적으로 전송함으로써, 네트워크 및 로컬 I/O를 줄여 데이터 처리 성능을 향상시키는데 그 목적이 있다.
DISCLOSURE OF THE INVENTION The present invention has been conceived to solve the above problems, and it is an object of the present invention to first determine what data is required for a map and a redo job when an SQL query is given, And selectively transmitting the data to the network, thereby improving the data processing performance by reducing the network and local I / O.

상기 목적을 해결하기 위해 본 발명은 상기 웹서버(10)의 데이터를 제공받아 SQL질의문이 발생했을 때, 어떤 데이터 속성이 맵퍼(30) 및 리듀서(40)에 전달되어야 하는지를 파악하는 SQL질의문 분석단계(S10)와, 상기 SQL질의문 분석단계(S10)를 통해 파악된 속성들만을 입력 데이터로부터 추출한 후, 추출된 데이터를 상기 맵퍼(30) 및 리듀서(40)로 데이터를 분할하는 데이터 분할단계(S20)와, 분할된 데이터를 맵퍼(30)로 전송하고, 주어진 SQL 조건에 부합하는 레코드의 ID 만을 출력하는 맵 단계(S30) 및 상기 맵퍼(30)로부터 출력된 레코드 ID를 이용하여 DFS으로부터 ID에 대한 레코드 정보를 리듀서(40)로 전달받은 후 집계 처리를 수행하는 리듀스 단계(S40)로 구성된다.
In order to solve the above-mentioned problem, the present invention provides an SQL query statement that receives data of the web server 10 and grasps which data attribute should be transmitted to the mapper 30 and the reducer 40 when an SQL query statement is generated. And extracts only the attributes identified through the analysis step S10 and the SQL query analysis step S10 from the input data and then splits the extracted data into data by the mapper 30 and the reducer 40 A map step S30 for transmitting the segmented data to the mapper 30 and outputting only the IDs of the records conforming to the given SQL condition and the DFS And a reduction step (S40) of receiving record information on the ID from the reducer 40 and performing an aggregation process.

또한, 상기 맵퍼(30)와 리듀서(40)에서 이용되는 맵 및 리듀스 함수는 키밸류페어(Key-value pair)를 입력받아, 또 다른 형태의 키밸류페어를 출력하도록 정의되고, 상기 리듀스 함수는 집계 키 단위로 호출되며, 최종 집계 결과를 생성하는 것을 특징으로 한다.
The map and the reduce function used in the mapper 30 and the reducer 40 are defined to receive a key-value pair and output another type of key value pair, Function is called in units of an aggregate key, and generates a final aggregation result.

또한, 상기 맵 단계(S30)는 입력 키밸류페어로 레코드 아이디와 레코드 정보를 전달하는 데이터 전달단계(S31)와, 상기 전달받은 텍스트 형식의 데이터를 여러 속성값으로 분리하고 추출하는 파싱단계(S32)와, 상기 파싱단계(S32)에서 추출된 속성값을 맵 함수에서 특정 레코드를 출력할지의 여부를 판단하는 데이터 출력 판단단계(S33)를 포함하여 구성된다.
The map step S30 includes a data transfer step S31 for transferring the record ID and record information to the input key value pair, a parsing step S32 for separating the received text format data into a plurality of attribute values, And a data output determination step (S33) of determining whether to output a specific record in the map function from the attribute value extracted in the parsing step (S32).

또한, 상기 리듀스 단계(S40)는 상기 맵 함수에 의해 어떤 레코드가 선택되었는지 알아내기 위한 특정 레코드 선택 확인단계(S41)와, 해당 ID를 이용하여 DFS으로부터 ID에 대한 레코드를 읽는 단계(S42)와, 상기 DFS으로부터 레코드를 가져온 후, 각 레코드에 대해 필요한 집계 결과를 계산하는 단계(S43)와, 상기 계산된 집계 결과를 통해 최종적으로 키값과 함께 출력단계(S44)를 포함하여 구성된다.
The redesing step S40 may include a specific record selection step S41 for determining which record is selected by the map function, a step S42 of reading a record for the ID from the DFS using the ID, A step S43 of fetching a record from the DFS and calculating an aggregation result necessary for each record, and an outputting step S44 together with the key value finally through the calculated aggregation result.

이에 따라서, 맵리듀스를 이용한 SQL 질의처리방법을 이용하여 맵리듀스 프레임워크에서 SQL 질의 처리 시간을 단축시킴으로써, Facebook이나 Twitter 등의 소셜 네트워크 서비스로부터 발생되는 대량의 데이터를 보다 빠른 시간에 처리할 수 있는 효과가 있다.
Accordingly, by using the SQL query processing method using MapReduce to shorten the SQL query processing time in the MapReduce framework, it is possible to process a large amount of data generated from social network services such as Facebook and Twitter in a shorter time It is effective.

도 1은 종래의 데이터 처리방식을 나타내는 개념도이다.
도 2는 본 발명의 데이터 처리 방식을 나타내는 개념도이다.
도 3은 본 발명의 데이터 처리 시스템의 구성도이다.
도 4은 본 발명의 SQL질의처리방법에 대한 흐름도이다.
도 5는 도 4에서의 맵단계(S30)에 대한 세부적인 순서도이다.
도 6는 도 4에서의 리듀스단계(S40)에 대한 세부적인 순서도이다.
도 7는 본 발명의 일실시예에 따른 웹 페이지 광고 클릭에 대한 로그 데이터표이다.1 is a conceptual diagram showing a conventional data processing method.
2 is a conceptual diagram showing a data processing method of the present invention.
3 is a configuration diagram of a data processing system according to the present invention.
4 is a flowchart of an SQL query processing method according to the present invention.
5 is a detailed flowchart of the map step S30 in FIG.
FIG. 6 is a detailed flowchart of the reduction step S40 in FIG.
7 is a log data table for a click on a web page advertisement according to an embodiment of the present invention.

이하, 도면을 참조하여 본 발명을 상세하게 설명한다. Hereinafter, the present invention will be described in detail with reference to the drawings.

도 1은 종래의 데이터 처리방식을 나타내는 개념도이고, 도 2는 본 발명의 데이터 처리 방식을 나타내는 개념도이다. FIG. 1 is a conceptual diagram showing a conventional data processing method, and FIG. 2 is a conceptual diagram showing a data processing method of the present invention.

도 1 및 도 2를 참조하여 설명하면, 기존의 데이터 처리 방식은 맵리듀스에서 주어진 작업을 처리하기 위해, 먼저 입력 파일을 분산파일시스템(Distributed File System ,이하 'DFS'라 함)에 위치시킨 후, 해당 파일을 여러 개의 논리 단위인 스플릿(Split)으로 분할한다. 1 and 2, in order to process a given task in the MapReduce, an existing data processing method first places an input file in a Distributed File System (DFS) , And splits the file into a plurality of logical units split.

입력 스플릿이 준비되면, 마스터 노드(Master node)는 각 서브 노드(Slave Node)에 대해 맵(Map) 혹은 리듀스(Reduce) 작업을 할당한 후 아래의 단계로 작업을 수행시킨다. 편의상 맵 작업을 할당받은 노드를 맵퍼(30)(Mapper), 리듀스(Reduce) 작업을 할당받은 노드를 리듀서(40)(Reducer)라 명명한다.
When the input split is ready, the master node assigns a map or reduce job to each slave node, and then performs the following steps. A mapper 30 (Mapper) is assigned to the node to which the map operation is allocated for convenience, and a node (Reducer) 40 to which the reducing operation is assigned.

(1) 맵 단계(S30)(Map phase): 각 스플릿은 하나의 맵퍼(30)로 전송된다. 각 맵퍼(30)는 스플릿 내의 각 레코드에 대해 선택(Selection)/필터링(Filtering) 작업을 수행한다. 작업 결과는 체크포인팅(Checkpointing)을 위해 각 맵퍼(30)의 로컬 디스크에 저장된다. (1) Map step S30 (Map phase): Each split is transmitted to one mapper 30. Each mapper 30 performs a selection / filtering operation for each record in the split. The work results are stored on the local disk of each mapper 30 for checkpointing.

(2) 리듀스 단계(S40)(Reduce phase): 모든 맵퍼(30)가 작업을 종료하면, 리듀서(40)는 맵퍼(30)의 디스크에 저장된 데이터를 읽어와서 집계(Aggregation) 작업을 수행한다. 최종 집계 결과는 다시 DFS에 저장된다. (2) Reduce phase S40 (Reduce phase): When all the mapper 30 finishes the work, the reducer 40 reads the data stored in the disk of the mapper 30 and performs an aggregation operation . The final aggregate result is stored in DFS again.

상기에서 서술한 바와 같이 종래의 데이터 처리방식은 하나의 맵리듀스 작업을 수행하기 위해 네트워크 및 로컬I/O가 지속적으로 일어나게 되는 문제점이 있다. As described above, the conventional data processing method has a problem that the network and the local I / O continuously occur in order to perform one mapping task.

종래의 데이터 처리방식은 장애복구 성능을 향상시키기 위해 필요하나, 동시에 데이터 처리 효율성은 크게 저하시키는 문제점이 있었다.
Conventional data processing methods are needed to improve the failover performance, but at the same time, data processing efficiency is greatly reduced.

상기와 같은 문제점을 해결하기 위해 본 발명은 맵리듀스 프레임워크에서 SQL질의를 이용하여 데이터를 분석할 때, 맵 작업과 리듀스 작업에 필요한 데이터만을 선별적으로 추출하여 전송함으로써 데이터 처리 성능을 향상시킬 수 있다.
In order to solve the above problems, when analyzing data using an SQL query in the MapReduce framework, the present invention selectively extracts only the data necessary for the map operation and the redo operation, .

도 3은 본 발명의 데이터 처리 시스템의 구성도이다.3 is a configuration diagram of a data processing system according to the present invention.

도 3을 참조하여 설명하면, 빅데이터 처리를 위한 데이터를 제공하는 웹서버(10)와, 상기 웹서버(10)의 데이터를 분할한 후 여러 분산 노드로 이동시켜 주어진 작업을 병렬로 처리할 수 있도록 지원하는 마스터 노드(20)와, 상기 마스터 노드(20)에 의해 맵 작업을 할당받은 서브 노드인 맵퍼(30) 및 리듀스 작업을 할당받은 서브 노드인 리듀서(40)로 구성된다.
Referring to FIG. 3, the web server 10 includes a web server 10 for providing data for processing large data, a plurality of distributed nodes for dividing data of the web server 10, A mapper 30 which is a sub-node assigned a map job by the master node 20 and a reducer 40 which is a sub-node to which a redo job is allocated.

도 4는 본 발명의 SQL질의처리방법에 대한 흐름도이고, 도 5는 도 4에서의 맵퍼(30)단계에 대한 세부적인 순서도이고, 도 6은 도 4에서의 리듀스단계에 대한 세부적인 순서도이다.FIG. 4 is a flow chart of the SQL query processing method of the present invention, FIG. 5 is a detailed flowchart of the mapper 30 in FIG. 4, and FIG. 6 is a detailed flowchart .

도 4 내지 도 6을 참조하여 설명하면, 본 발명은 도 3의 데이터 처리 시스템에서 빅데이터를 처리하는 방법에 있어서, 상기 웹서버(10)의 데이터를 제공받아 SQL질의문이 발생했을 때, 어떤 데이터 속성이 맵퍼(30) 및 리듀서(40)에 전달되어야 하는지를 파악하는 SQL질의문 분석단계(S10)와, 상기 SQL질의문 분석단계(S10)를 통해 파악된 속성들만을 입력 데이터로부터 추출한 후, 추출된 데이터를 상기 맵퍼(30) 및 리듀서(40)로 데이터를 분할하는 데이터 분할단계(S20)와, 분할된 데이터를 맵퍼(30)로 전송하고, 주어진 SQL 조건에 부합하는 레코드의 ID 만을 출력하는 맵 단계(S30) 및 상기 맵퍼(30)로부터 출력된 레코드 ID를 이용하여 DFS으로부터 ID에 대한 레코드 정보를 리듀서(40)로 전달받은 후 집계 처리를 수행하는 리듀스 단계(S40)로 구성된다.4 to 6, the present invention is a method for processing big data in the data processing system of FIG. 3, wherein when an SQL query is received and data of the web server 10 is received, An SQL query analysis step S10 for determining whether a data attribute should be transmitted to the mapper 30 and the reducer 40 and a step for extracting only attributes identified through the SQL query analysis step S10 from input data, A data dividing step (S20) of dividing the extracted data into data by the mapper (30) and the reducer (40); sending the divided data to the mapper (30) And a reduction step S40 of receiving the record information on the ID from the DFS to the reducer 40 using the record ID outputted from the mapper 30 and performing the aggregation processing .

이때, 상기 맵 및 리듀스 함수는 키밸류페어(Key-value pair)을 입력받아, 또 다른 형태의 키밸류페어를 출력하도록 정의된다.At this time, the map and the reduction function are defined to receive a key-value pair and output another type of key value pair.

또한, 상기 리듀스 함수는 집계 키 단위로 호출되며, 최종 집계 결과를 생성하는 것을 특징으로 한다.The reduction function is called in units of an aggregation key, and generates a final aggregation result.

여기서, 상기 키밸류페어(Key-Value pair)는 키-값 쌍을 의미하는 것으로 키(key)는 텍스트 형식으로 구성된 데이터, 값(Value)은 키(Key) 데이터의 합계(Count)를 의미한다.
Herein, the key-value pair means a key-value pair, the key is data configured in a text format, and the Value means a sum (Count) of key data .

또한, 상기 맵 단계(S30)는, 입력 키밸류페어로 레코드 아이디와 레코드 정보를 전달하는 데이터 전달단계(S31)와, 상기 전달받은 텍스트 형식의 데이터를 여러 속성값으로 분리하고 추출하는 파싱단계(S32)와, 상기 파싱단계(S32)에서 추출된 속성값을 기반으로 맵 함수에서 특정 레코드를 출력할지의 여부를 판단하는 데이터 출력 판단단계(S33)를 포함하여 구성된다.
The map step S30 includes a data transfer step S31 for transferring a record ID and record information to an input key value pair, a parsing step for separating the received data of the text format into a plurality of attribute values S32) and a data output determination step (S33) for determining whether to output a specific record in the map function based on the attribute value extracted in the parsing step (S32).

또한, 상기 리듀스 단계(S40)는, 상기 맵 함수에 의해 어떤 레코드가 선택되었는지 알아내기 위한 특정 레코드 선택 확인단계(S41)와, 해당 ID를 이용하여 DFS으로부터 ID에 대한 레코드를 읽는 단계(S42)와, 상기 DFS으로부터 레코드를 가져온 후, 각 레코드에 대해 필요한 집계 결과를 계산하는 단계(S43)와, 상기 계산된 집계 결과를 통해 최종적으로 키값과 함께 출력단계(S44)를 포함하여 구성된다.
The redesing step S40 may include a specific record selection step S41 for determining which record is selected by the map function, a step S42 for reading the ID record from the DFS using the ID, (S43) of calculating a necessary aggregation result for each record after fetching the record from the DFS, and finally outputting the key value together with the key value through the calculated aggregation result (S44).

도 7는 본 발명의 일실시예에 따른 웹 페이지 광고 클릭에 대한 로그 데이터표이다. 7 is a log data table for a click on a web page advertisement according to an embodiment of the present invention.

본 발명의 맵리듀스에서 SQL질의처리방법을 이용하여 ADClick의 웹 페이지 광고 클릭에 대한 로그 데이터표를 일실시예로 가정하여 설명한다. A description will be made on the assumption that the log data table for the ADClick click on the Web page advertisement is an embodiment using the SQL query processing method in the MapReduce of the present invention.

ADClick에서 한국 사용자들의 클릭 수를 카운트한다고 가정하고, 이때 클릭 카운트는 패션이나 가전 등 각 광고 카테고리(Category) 별로 집계된다고 가정한다.
Assuming that ADClick counts the number of clicks of Korean users, it is assumed that the click count is counted for each advertisement category such as fashion or home appliance.

이에 대한 SQL 질의는, The SQL query for this,

Q1. SELECT ADCategory, COUNT(*) Q1. SELECT ADCategory, COUNT (*)

FROM ADClick FROM ADClick

WHERE Nationality = "Korea" WHERE Nationality = "Korea"

GROUP BY ADCategory
GROUP BY ADCategory

상기와 같은 질의가 주어졌을 때 먼저 어떤 데이터 속성이 맵퍼(30)와 리듀서(40)에 전달되어야 하는지 결정되어야 한다. 이때, 맵퍼(30)에 전달되는 속성은 집계에 이용되는 키와 레코드의 선택(Selection) 작업에 이용되는 속성들이다.When such a query is given, it is first determined which data attribute should be passed to the mapper 30 and the reducer 40. [ At this time, the attributes transmitted to the mapper 30 are the keys used for the aggregation and the attributes used in the selection operation of the record.

이때, 상기 Q1에서 제시되는 'ADUrl', 'ADName', 'ADCategory', 'UserID', 'Nationality'는 상기 ADClick에서 제시하는 레코드 ID이므로, 별도의 의미로 한정짓지 않는다.
At this time, 'ADUrl', 'ADName', 'ADCategory', 'UserID', and 'Nationality' presented in Q1 are not limited to the meaning as they are the record ID provided by the ADClick.

또한, 집계키는 선택된 레코드의 ID리스트를 만들기 위해 내부적으로 이용되며 집계키는 GROUP BY 절에 정의된 속성에 해당한다. 이에 따라 GROUP BY 절에 정의된 속성을 중심으로 집계 결과가 출력되고, Q1에서 집계키는 ADCategory이며, 각 ADCategory 별로 클릭 카운트가 계산된다. 이어서, 선택 속성들은 질의문의 WHERE 절에 정의된 속성들에 해당하고, Q1에서는 Nationality가 선택 속성으로 이용되었다. Also, the aggregation key is used internally to create the ID list of the selected record, and the aggregation key corresponds to the attribute defined in the GROUP BY clause. Accordingly, the aggregation result is output based on the attribute defined in the GROUP BY clause. In Q1, the aggregation key is ADCategory, and click count is calculated for each ADC category. Then, the optional attributes correspond to the attributes defined in the WHERE clause of the query statement. In Q1, Nationality is used as an optional attribute.

따라서, Q1이 주어졌을 때, 맵퍼(30)에 ADCategory와 Nationality 속성 값만이 선별적으로 전달된다.
Therefore, when Q1 is given, only the ADCategory and the Nationality attribute values are selectively transmitted to the mapper 30.

리듀서(40)에 전달되는 속성은 집계에 이용되는 키와 최종적으로 출력되는 속성들이다. 상기 Q1에서는 집계키로써 ADCategory가 이용되고, 출력 속성은 질의문의 SELECT 절에 정의된 속성들에 해당한다. Q1의 경우 별도의 출력 속성이 주어지지 않았기 때문에, 리듀서(40)에는 도 7에 정의된 속성 중 ADCategory만 전달된다.
The attributes transmitted to the reducer 40 are the keys used for the aggregation and the attributes finally output. In Q1, ADCategory is used as the aggregation key, and the output attribute corresponds to the attributes defined in the SELECT clause of the query. In the case of Q1, since no separate output attribute is given, only the ADCategory of the attribute defined in FIG. 7 is transmitted to the reducer 40. [

따라서 전달될 속성이 결정되고 나면, 주어진 질의문을 수행하기 위한 맵리듀스 프로그램을 생성한다. 상기 맵리듀스 프로그램은 2개의 맵리듀스 함수, 즉 Map함수와 Reduce함수로 구성된다. Thus, once the attributes to be passed are determined, a MapReduce program is created to perform the given query. The MapReduce program consists of two MapReduce functions, a Map function and a Reduce function.

상기 Map함수은 레코드 단위로 호출되며, 주어진 레코드를 리듀서(40)로 출력할지의 여부를 결정하고, 상기 Reduce함수는 집계 키 단위로 호출되며, 최종 집계 결과를 생성한다.The Map function is called on a record-by-record basis, and determines whether to output a given record to the reducer 40. The Reduce function is called in units of an aggregate key to generate a final aggregate result.

이에 따라, 두 함수 모두 키밸류페어를 입력받아, 또 다른 형태의 키밸류페어를 출력하도록 정의된다. Accordingly, both functions are defined to receive a key value pair and output another type of key value pair.

C1. Function Map(RowID, Record)C1. Function Map (RowID, Record)

Parse Record into (ADCategory, Nationality) Parse Record into (ADCategory, Nationality)

If Nationality = "Korea" If Nationality = "Korea"

Output (ADCategory, RowID) Output (ADCategory, RowID)

End If
End If

상기 C1은 질의문 Q1이 주어졌을 때, 이를 처리하기 위한 Map함수의 구현 부분을 나타낸다. 편의를 위해, 함수의 구현 부분은 의사코드로 기술한다. 상기 Map함수의 입력 키밸류페어로는 레코드 아이디와 레코드 정보가 전달된다. 상기 C1에서 두 파라미터는 RowID와 Record로 표기되었다. 상기 RowID는 주어진 Record가 분산파일시스템(DFS)에 저장된 논리적인 위치를 가르키며, RowID를 이용하여 DFS에서 해당 레코드에 언제든 접근할 수 있다.
C1 represents the implementation part of the Map function to process it, given the query Q1. For convenience, the implementation part of the function is described in pseudocode. The record ID and the record information are transmitted to the input key value pair of the Map function. In C1, the two parameters are marked RowID and Record. The RowID indicates the logical location where a given Record is stored in the Distributed File System (DFS), and can be accessed at any time in the DFS using the RowID.

이때, 상기 Record는 실제 값을 지닌 레코드에 해당하며, 파싱을 통해 여러 속성 값으로 분리된다. At this time, the Record corresponds to a record having an actual value, and is divided into several attribute values through parsing.

따라서 상기 파싱을 통해 ADCategory와 Nationality 속성값이 추출되고, 나머지 속성들은 맵퍼(30)로 전달되지 않았기 때문에 추출되지 않는다.
Therefore, the ADCategory and the Nationality attribute values are extracted through the parsing, and the remaining attributes are not extracted to the mapper 30.

상기 파싱이 끝난 후 Map함수에서는 주어진 레코드를 출력할지의 판단유무를 결정하고 레코드의 선택 조건은 주어진 질의의 WHERE절에 정의된 조건이 그대로 이용된다. 상기 Q1에서의 카테고리별 클릭 수는 WHERE절을 만족하는 레코드에 한해서 계산되어야하고, 선택 조건이 만족될 경우 키밸류페어를 출력하게 된다. After the parsing is completed, the Map function determines whether or not to output the given record, and the condition of the record selection condition is used as it is in the WHERE clause of the given query. The number of clicks per category in Q1 should be calculated only for records satisfying the WHERE clause, and if the selection condition is satisfied, the key value pair is output.

이때, 상기 키밸류페어의 키는 ADCategory가 전달되며 값은 RowID가 전달된다. At this time, the key of the key value pair is delivered as ADCategory, and the value of RowID is transmitted.

이로 인해 Record를 출력하지 않고 RowID를 리듀서(40)로 전달함으로써 네트워크 I/O를 줄일 수 있다.
Therefore, the network I / O can be reduced by transmitting the RowID to the reducer 40 without outputting the Record.

Map함수가 모든 입력 레코드에 대해 호출되고 나면, 맵리듀스 프레임워크는 내부적으로 출력 키인 ADCategory 값을 기준으로 출력된 레코드를 그룹화한다.Once the Map function is called for all input records, the MapReduce framework internally groups the output records based on the ADCategory value, which is the output key.

그 후, 각각의 ADCategory값을 기준으로 카운트 계산할 수 있도록 Reduce함수를 호출한다. 상기 Reduce함수는 각각의 ADCategory 값마다 호출되며, 키-값 의 쌍을 입력 파라미터로 전달받는다.Then, we call the Reduce function to calculate the count based on each ADCategory value. The Reduce function is called for each ADCategory value and receives a key-value pair as an input parameter.

이때, 상기 키는 ADCategory이며, 값은 ADCategory 값을 지닌 RowID의 그룹이 전달된다.
At this time, the key is an ADCategory, and the value is a group of RowID having an ADCategory value.

예를 들면, 도 7의 웹 페이지 광고 클릭에 대한 로그 데이터표에서 WHERE 조건을 만족하는 레코드는 1, 3, 5, 6번째 레코드이다. 따라서 Map함수로부터 <"Fashion", 1>, <"Game", 3>, <"Electronics", 5>, <"Electronics", 6>를 포함한 네 개의 레코드가 출력된다. For example, the records satisfying the WHERE condition in the log data table for the click on the web page advertisement of FIG. 7 are the 1 st, 3 rd, 5 th, and 6 rd records. Therefore, four records including <Fashion>, 1>, <Game>, 3>, <"Electronics", 5>, <"Electronics", 6> are output from the Map function.

상기 Map함수가 실행이 끝나면 맵리듀스 프레임 워크는 ADCategory 값을 기준으로 출력된 레코드를 그룹화한다. 이 경우 <"Fashion", 1>, <"Game", 3>, <"Electronics", (5, 6)>의 세 개의 그룹이 생성된다. When the above Map function is executed, the MapReduce framework groups the output records based on the ADCategory value. In this case, three groups of <"Fashion", 1>, <"Game", 3>, <"Electronics", (5, 6)> are created.

그리고 각각의 그룹에 대해 Reduce함수가 호출된다.
The Reduce function is then called for each group.

C2. Finction Reduce(Key, RewIDs) C2. Finction Reduce (Key, RewIDs)

COUNT := 0 COUNT: = 0

RecList := getRecordsFromDFS(RowIDs); RecList: = getRecordsFromDFS (RowIDs);

For each record in RecList For each record in RecList

Count := Count + 1 Count: = Count + 1

End for End for

Output (Key, Count);
Output (Key, Count);

상기 Reduce함수에서는 Map함수에 의해 어떤 레코드가 선택되었는지 알아내기 위해 RowIDs를 참고한다. 상기 Map함수에서는 WHERE절을 만족하는 레코드 대신 해당 레코드의 ID를 출력하므로, Reduce함수는 해당 ID를 이용하여 분산파일시스템(DFS)으로부터 ID에 대한 레코드를 읽어와야 한다.
In the above Reduce function, refer to RowIDs to find out which record is selected by the Map function. Since the Map function outputs the ID of the corresponding record instead of the record satisfying the WHERE clause, the Reduce function must read the ID record from the distributed file system (DFS) using the corresponding ID.

상기 getRecordsFromDFS(RowIDs)함수는 이러한 역할을 수행하기 위해 작성된 함수에 해당한다. 또한, 상기 getRecordsFromDFS(RowIDs)함수로부터 레코드를 가져온 후, 각 레코드에 대해 필요한 집계 결과를 계산한다. Q1의 경우 레코드의 카운트를 계산하도록 정의되었으며, 이 경우 Count 변수를 정의한 후 0으로 초기화하고 각 레코드 별로 count를 1씩 증가시킴으로써 원하는 집계 결과를 얻을 수 있다. 얻어진 집계 결과는 최종적으로 ADCategory 키값과 함께 출력된다.
The getRecordsFromDFS (RowIDs) function corresponds to a function created to perform this role. Also, after retrieving a record from the getRecordsFromDFS (RowIDs) function, the necessary aggregation result is calculated for each record. In case of Q1, it is defined to calculate the count of the record. In this case, it is possible to obtain the desired aggregation result by initializing the count variable to 0 and incrementing count by 1 for each record. The obtained aggregation result is finally output together with the ADC category key value.

10 : 웹서버
20 : 마스터 노드(Master Node)
30 : 맵퍼(Mapper)
40 : 리듀서(Reducer)10: Web server
20: Master Node
30: Mapper
40: Reducer

Claims

A master node 20 for dividing the data of the web server 10 and moving the data to a plurality of distributed nodes to process a given job in parallel; A method for processing big data in a data processing system comprising a mapper (30) which is a sub-node assigned a map job by a node (20) and a reducer (40)
An SQL query analysis step (S10) of determining which data attribute is to be delivered to the mapper (30) and the reducer (40) when an SQL query is received based on the data of the web server (10);
A data segmenting step (S20) of extracting only the attributes identified through the SQL query analysis step (S10) from the input data and then dividing the extracted data into the mapper (30) and the reducer (40);
A map step (S30) of transmitting the divided data to the mapper (30) and outputting only the IDs of the records meeting the given SQL condition; And a reduction step (S40) of receiving record information on the ID from the DFS using the record ID output from the mapper (30) to the reducer (40) and performing an aggregation process thereon An SQL query processing method using MapReduce.

The method according to claim 1,
Wherein the map and reduce function is defined to receive a key-value pair and to output another type of key value pair.

The method according to claim 1,
Wherein the reduction function is called in units of an aggregate key, and generates a final aggregation result.

The method according to claim 1,
The map step (S30)
A data transfer step (S31) of transferring record ID and record information to the input key value pair;
A parsing step (S32) of extracting and extracting data of the received text format into a plurality of attribute values;
And a data output determination step (S33) of determining whether to output a specific record in the map function from the attribute value extracted in the parsing step (S32).

The method according to claim 1,
The reduction step (S40)
A specific record selection check step (S41) for determining which record has been selected by the map function;
Reading the record of the ID from the DFS system 20 using the corresponding ID (S42);
A step (S43) of fetching a record from the DFS system (20) and calculating a necessary aggregation result for each record;
And an output step (S44) with the key value finally through the calculated aggregation result.