KR101640231B1

KR101640231B1 - Cloud Driving Method for supporting auto-scaled Hadoop Distributed Parallel Processing System

Info

Publication number: KR101640231B1
Application number: KR1020150021461A
Authority: KR
Inventors: 송동호; 인연진; 김영필; 조성완; 이승
Original assignee: 소프트온넷(주)
Priority date: 2015-02-12
Filing date: 2015-02-12
Publication date: 2016-07-18

Abstract

The present invention relates to a cloud driving method for supporting an auto-scaled Hadoop distributed parallel processing system. The cloud driving method for supporting the auto-scaled Hadoop distributed parallel processing system according to the present invention includes the steps of: generating an input split by loading an input file by a client; analyzing available resources of a slave node by a master node, and receiving VM resource pool information from a cloud management server to allocate a job to the slave node and a VM node; receiving the generated input split by the slave node to allocate a map task to a task tracker of the slave node and the VM node, respectively, and transmitting a generated intermediate calculation result to a reduce task; returning, when the allocated map task is finished, a resource of the corresponding VM node to the VM resource pool; executing the reduce task using the intermediate calculation result as an input by the task tracker of the slave node and the VM node, and storing result data to an HDFS of the slave node; and returning, when the allocated reduce task is finished, a resource of the corresponding VM node to the VM resource pool. According to the present invention, computing resources of data to be processed in real time is analyzed to increase and decrease the computing resources in real time, so that the computing resources are efficiently used.

Description

[0001] The present invention relates to a Hadoop Distributed Parallel Processing System (HATS)

본 발명은 클라우드(cloud) 구동 방법에 관한 것으로서, 더 상세하게는 하둡 (Hadoop) 시스템을 사용하여 빅 데이터(big data)를 처리할 때, 입력 데이터를 실시간으로 분석하여 데이터의 크기와 종류에 따른 예상 처리 시간에 따라 필요한 컴퓨팅 자원을 자동으로 증감시켜 줌으로써, 맵 리듀스(Map Reduce) 연산 처리 효율을 향상시킬 수 있는 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법에 관한 것이다.The present invention relates to a method of driving a cloud, and more particularly, to a method and apparatus for operating a cloud by using a Hadoop system to analyze input data in real time, The present invention relates to a cloud driving method for supporting automatic distributed parallel processing Hadoop system capable of improving the efficiency of map reduction processing by automatically increasing / decreasing required computing resources according to expected processing time.

오늘날 소셜 미디어(social media) 및 멀티미디어의 확산에 따른 비정형 데이터(예를 들면, 기업의 매출액, 개인의 나이와 성별 등의 정형화된 데이터가 아닌 문자 메시지, 음성, 영상, 위치 등 다양한 유형의 데이터)의 폭증으로 인해 기존의 방식으로 저장, 관리 및 분석하기 어려울 정도로 큰 규모를 갖는, 엄청난 양의 데이터인 빅 데이터(big data)가 등장하게 되었다.Today, various types of data, such as text messages, voice, video, and location, rather than structured data such as social media and multimedia, Big data, which is a huge amount of data, which is so large as to be difficult to store, manage and analyze in the conventional way due to the explosion of the data.

스마트폰, 웨어러블(wearable) 디바이스 등으로 인해 개인이 만들어내는 데이터의 정보량이 엄청나게 증가하고, 이러한 빅 데이터의 분석을 통해 기업들은 소비자의 구매 패턴뿐만 아니라 소비자의 관계와 관심, 더 나아가 생활 습관까지 파악할 수 있게 되었다.Smart phones, wearable devices, etc., the amount of data that individuals create is increasing tremendously. Through analysis of these big data, companies are able to identify not only the patterns of consumers' purchases, but also their relationships and interests, It was possible.

이러한 빅 데이터를 분석하기 위한 방법의 하나로 하둡(Hadoop)이 개발되었다. 하둡은 구글 파일 시스템(Google File System)과 상호보완 관계에 있는 하둡 분산 파일 시스템(HDFS: Hadoop Distributed File System)과 데이터를 분산시켜 처리한 뒤 하나로 합치는 기술인 맵 리듀스를 구현한 오픈 소스(open source) 기반의 프레임워크(framework)이다. 하둡의 핵심은 두 부분으로 구성되는데, 스토리지 부분(HDFS)과 프로세싱 부분(맵 리듀스)이다. 하둡은 파일들을 대형 블록(예컨대, 64MB 또는 128MB)으로 나누고 이 블록들을 클러스터(cluster) 내의 노드들에 분산하여 저장한다. 저장된 데이터를 프로세싱하기 위하여 하둡 맵 리듀스는 수행 코드를 각 데이터를 저장하고 있는 노드에 전송하고, 이 노드들이 동시에 병렬로 데이터 처리를 수행하게 된다. 이 방식은 데이터와 수행 코드가 동일 노드에서 더 빠르게, 그리고 효과적으로 작동하기 때문에 데이터를 빠르게 처리할 수 있는 장점이 있다.Hadoop was developed as a method for analyzing such big data. Hadoop is a Hadoop Distributed File System (HDFS) that is complementary to the Google File System (Google File System), an open source implementation of MapReduce, source-based framework. The core of Hadoop consists of two parts: the storage part (HDFS) and the processing part (map redess). Hadoop divides files into large blocks (for example, 64MB or 128MB), and stores these blocks in a distributed manner in the nodes in the cluster. To process the stored data, Hadoop MapReduce sends the execution code to the node storing each data, and these nodes concurrently perform data processing in parallel. This approach has the advantage of being able to process data faster because the data and execution code operate faster and more efficiently at the same node.

그러나, 종래 하둡 시스템에서 사용하는 맵 리듀스는 노드 호스트 서버(node host server)의 성능이 상이한 경우 효과적이지 못하고, 한 클러스터에서 여러 맵 리듀스 작업을 동시에 수행하는 경우에는 효율적인 다중-작업 스케줄링을 제공하지 못한다.However, the mapping reduction used in the conventional Hadoop system is not effective when the performance of the node host server is different, and efficient multi-task scheduling is provided when performing multiple map reduction operations in one cluster at the same time can not do.

또한, 종래 하둡 시스템에서 사용하는 데이터 분석 방법은 맵 리듀스를 시작할 때, 데이터 분석에 할당된 컴퓨팅 자원 내에서 데이터를 처리하기 때문에, 데이터의 양이 기하급수적으로 늘어나는 상황에서는 유휴 컴퓨팅 자원이 있더라도 데이터 분석을 시작하게 되면 유휴 컴퓨팅 자원을 사용하지 못하는 단점이 있다.In addition, since the data analysis method used in the conventional Hadoop system processes the data in the computing resources allocated to the data analysis at the start of the map reuse, even in the situation where the amount of data increases exponentially, When the analysis starts, there is a disadvantage that the idle computing resources can not be used.

한편, 한국 공개특허 제10-2012-0041907호(선행문헌 1)에는 데이터 마이닝 (data mining) 등을 이용한 게놈 계산(genomic computation) 등의 대규모 데이터의 병렬 처리를 보장하기 위해 맵 리듀스(Map Reduce) 상에서 맵퍼(mapper)와 리듀서 (reducer)를 신뢰하지 않으면서 HE(Homomorphic Encryption)를 이용하여 일관된 정확도로 결과값을 사용하여 최종적으로 엔드-투-엔드(end-to-end) 기밀성을 보장하면서 분산 계산하는 "맵 리듀스 기반의 대용량 데이터 분산 계산 방법 및 그 시스템"이 개시되어 있다. 또한, 한국 공개특허 제10-2014-0080795호(선행문헌 2)에는 복수 개의 가상 머신이 가상화 플랫폼으로부터 복수 개의 슬롯(slot)별 태스크 수행 완료시간을 각각 수신하고, 가상 머신이 현재부터 각 가상 머신 내 모든 슬롯이 태스크 수행을 완료하는 수행 완료시간까지의 남은 시간을 연산하여 마스터 노드에 전달하며, 마스터 노드가 수신한 복수 개의 남은 시간에 대한 평균값을 연산하고 그 평균값에 따라 각 가상 머신의 CPU 자원 할당량을 조절하여 각 가상 머신의 태스크 수행시간이 동일하도록 제어함으로써, 가상화 클러스터 환경에서 전체 맵 리듀스 태스크가 균일한 시간 동안 처리되도록 하여 성능을 향상시키고자 하는 "가상화 환경 내 하둡 맵 리듀스의 부하 분산 방법 및 시스템"이 개시되어 있다.In Korean Patent Laid-Open No. 10-2012-0041907 (Prior Art 1), Map Reduce (R) is used to guarantee parallel processing of large-scale data such as genomic computation using data mining or the like. (Homomorphic encryption) to ensure end-to-end confidentiality by using the results with consistent accuracy without relying on mapper and reducer Quot; large-scale data dispersion calculation method based on map reduction and its system " Korean Patent Laid-Open Publication No. 10-2014-0080795 (Prior Art 2) discloses a technique in which a plurality of virtual machines each receive a task completion time for each of a plurality of slots from a virtualization platform, Calculates the remaining time until completion of execution of all the slots of the task and transfers the remaining time to the master node, calculates an average value of a plurality of remaining times received by the master node, and calculates a CPU resource By controlling quotas so that the task execution time of each virtual machine is the same, it is possible to improve the performance by ensuring that the entire map reduction tasks are processed in a uniform time in a virtualized cluster environment. "The load of Hadoop MapReduce in a virtualized environment Dispersion method and system "

그러나, 상기 선행문헌 1은 대용량 데이터의 저장과 병렬처리 능력을 제공하기 위한 클라우드 컴퓨팅 환경에서 분산 처리기술로 사용되는 맵 리듀스(Map Reduce)가 대규모 데이터를 병렬처리하여 분산 계산하는 프레임워크로서, 데이터 마이닝, 게놈 계산 등에 다양하게 응용될 수는 있지만, 데이터를 맵/리듀스(map/reduce)하는 과정에서 프라이버시(privacy)의 침해가 발생할 수 있는 문제를 해결하기 위한 것이고, 상기 선행문헌 2는 하둡 맵 리듀스가 태스크를 클러스터에 분산하여 처리할 때, 데이터 지역성(data locality)과 구성된 클러스터 환경의 차이로 인해 모든 태스크가 각 노드에서 동시에 끝나지 않고 물결 형태로 종료됨에 따라 모든 태스크가 완료되어야만 종료되는 맵 리듀스의 특성상 심각한 성능 저하가 야기되는 문제에 대응하기 위한 것으로서, 이와 같은 선행문헌 1,2는 전술한 바와 같은 문제점을 내포하고 있다. 즉, 한 클러스터에서 여러 맵 리듀스 작업을 동시에 수행하는 경우에 효율적인 다중-작업 스케줄링을 제공하지 못하고, 데이터의 양이 기하급수적으로 늘어나는 상황에서 유휴 컴퓨팅 자원이 있더라도 데이터 분석을 시작하게 되면 유휴 컴퓨팅 자원을 사용하지 못하게 되는 문제점이 있다.However, the prior art document 1 is a framework in which Map Reduce used as a distributed processing technique in a cloud computing environment for storing large capacity data and parallel processing capability performs distributed processing of parallel processing of a large amount of data, Data mining, and genome computation. However, in order to solve the problem that privacy infringement may occur in the process of map / reduce / map data, the prior art document 2 When the Hadoop map redistributes tasks in clusters, the task is terminated in wave form because all tasks are not concurrently terminated on each node due to differences in data locality and configured cluster environment. In order to cope with the problem that serious performance degradation is caused due to the nature of the map reduction, The prior art documents 1 and 2 have the above-described problems. In other words, when performing multiple map reduction tasks simultaneously in one cluster, efficient multi-task scheduling can not be provided. Even if idle computing resources are present in a situation where the amount of data increases exponentially, It is not possible to use the apparatus.

한국 공개특허 제10-2012-0041907호(2012.05.03 공개)Korean Patent Laid-Open No. 10-2012-0041907 (published May 23, 2012) 한국 공개특허 제10-2014-0080795호(2014.07.01 공개)Korean Patent Publication No. 10-2014-0080795 (published on Jul. 01, 2014)

본 발명은 상기와 같은 사항을 감안하여 창출된 것으로서, 실시간으로 처리해야 할 데이터의 컴퓨팅 자원을 분석하여 실시간으로 컴퓨팅 자원을 증감시켜 줌으로써, 컴퓨팅 자원을 효율적으로 사용할 수 있도록 해주는 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법을 제공함에 그 목적이 있다.The present invention has been made in view of the above problems, and it is an object of the present invention to provide an automatic distributed parallel processing Hadoop system that can efficiently use computing resources by analyzing computing resources of data to be processed in real time, The present invention provides a method of operating a cloud for supporting the cloud.

상기의 목적을 달성하기 위하여 본 발명에 따른 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법은, 빅 데이터 처리를 위한 하둡 시스템을 지원하기 위한 것으로, 클라이언트, 마스터 노드, 슬레이브 노드, 클라우드 관리 서버, 및 클라우드 호스트 서버를 구비하는 클라우드 시스템에 기반하여 클라이언트가 요청한 잡(job)의 업무량을 컴파일 타임 혹은 런 타임에 실시간으로 분석하여 이에 필요한 클라우드 시스템의 가상 머신(virtual machine:VM) 리소스 풀(resource pool)로부터 여유 가상 머신(VM)을 계산한 후 자동으로 VM 노드를 증감하여 잡(job)을 재할당하여 구동시킴으로써 잡(job)의 수행 시간을 줄이는 클라우드 구동 방법으로서, a) 상기 클라이언트에 의해 입력 파일을 로딩하여 입력 스플릿(split)을 생성하는 단계와, b) 상기 마스터 노드에 의해 상기 슬레이브 노드의 가용 리소스를 분석하고, 상기 클라우드 관리 서버로부터 VM 리소스 풀 정보를 전달받아 상기 슬레이브 노드와 VM 노드에 잡(job)을 할당하는 단계와, c) 상기 생성된 입력 스플릿을 상기 슬레이브 노드에 의해 입력받아 상기 슬레이브 노드와 VM 노드의 태스크 트랙커에 각각 맵 태스크를 할당하고, 생성된 중간 산출물을 리듀스 태스크에 전달하는 단계와, d) 상기 할당된 맵 태스크가 종료되면, 해당 VM 노드의 자원을 상기 클라우드 관리 서버를 통해 상기 VM 리소스 풀로 환원하는 단계와, e) 상기 슬레이브 노드와 VM 노드의 태스크 트랙커에 의해 상기 중간 산출물을 입력으로 리듀스 태스크를 실행하여 결과 데이터를 슬레이브 노드의 HDFS에 저장하는 단계, 및 f) 상기 할당된 리듀스 태스크가 종료되면, 해당 VM 노드의 자원을 상기 클라우드 관리 서버를 통해 상기 VM 리소스 풀로 환원하는 단계를 포함하는 점에 그 특징이 있다.In order to accomplish the above object, a cloud driving method for supporting a Hadoop system for automatic distributed parallel processing according to the present invention is for supporting a Hadoop system for processing large data, including a client, a master node, a slave node, , And a cloud host system, and analyzes the workload of a job requested by the client in real time at a compile time or a run time to obtain a virtual machine (VM) resource pool the method comprising the steps of: a) calculating a spare virtual machine (VM) from a virtual machine (VM) pool and automatically decreasing or decreasing a VM node to reallocate and run a job to reduce the execution time of a job, Loading an input file to create an input split; b) Analyzing available resources of the slave node by receiving the VM resource pool information from the cloud management server, and assigning a job to the slave node and the VM node; c) Assigning a map task to the slave node and the task tracker of the VM node, respectively, and transferring the generated intermediate artifacts to the resume task; d) if the allocated map task is terminated, Returning the resources of the node to the VM resource pool through the cloud management server; e) executing the resume task by inputting the intermediate artifacts by the slave node and the task tracker of the VM node, HDFS, and f) when the allocated reuse task is terminated, the resource of the corresponding VM node Group include those characterized by comprising the step of reducing the VM pool of resources through the cloud management server.

여기서, 상기 단계 a)의 입력 스플릿을 생성하는 단계는, a-1) 상기 클라이언트의 잡 클라이언트에 의해 상기 슬레이브 노드의 HDFS에 저장되어 있는 상기 입력 파일을 모두 읽어 들여 파일 포맷에 따라 분석 방법을 구분하여 파일을 분석하는 단계와, a-2) 상기 파일 포맷에 따라 청크(chunk) 단위로 저장되어 있는 상기 입력 파일을 어떻게 분할할지 상세히 지정한 입력 스플릿 데이터를 작성하는 단계, 및 a-3) 상기 파일 포맷에 따라 상기 입력 스플릿 데이터에서 맵 태스크가 쉽게 액세스할 수 있는 키(key)와 값(value)의 쌍으로 이루어진 입력 스플릿 정보를 생성하는 단계를 포함하여 구성될 수 있다.The step of generating the input splits in step a) includes the steps of: a-1) reading all the input files stored in the HDFS of the slave node by the client client of the client and analyzing the analysis method according to the file format; (A-2) creating input splitting data specifying how to divide the input file stored in units of chunks according to the file format, and a-3) And generating input split information consisting of a pair of a key and a value that can be easily accessed by the map task in the input split data according to a format.

또한, 상기 단계 b)의 상기 슬레이브 노드와 VM 노드에 잡(job)을 할당하는 단계는, b-1) 상기 마스터 노드의 잡 트랙커가 상기 입력 파일의 청크(chunk)가 물리적으로 어느 슬레이브 노드의 HDFS에 있는지의 정보와, 상기 슬레이브 노드의 가용 리소스를 분석하여 상기 슬레이브 노드에 할당 가능한 맵 태스크와 리듀스 태스크의 개수를 계산하는 단계와, b-2) 상기 클라우드 관리 서버의 리소스 풀 관리부로부터 상기 VM 노드를 할당할 수 있는 VM 리소스 풀 정보를 전달받아, 사용 가능한 VM 노드의 개수와 VM 노드에 할당 가능한 맵 태스크와 리듀스 태스크를 다시 계산하는 단계와, b-3) 상기 마스터 노드의 잡 큐(job queue)에 등록되어 할당되지 않은 잡(job)을 데이터 또는 프로세싱 관점에서 잡(job)의 크기를 분석하고, 가용할 수 있는 상기 VM 리소스 풀의 정보를 이용하여 잡(job)을 큐(queue)에 재배치하는 단계와, b-4) 할당되지 않은 상기 맵 태스크를 상기 계산된 정보에 근거하여, 상기 슬레이브 노드와 상기 VM 노드의 태스크 트랙커에 컴파일 타임에는 정적으로, 런 타임에는 동적으로 각각 할당하여 실행하는 단계, 및 b-5) 맵 태스크들이 완료되었는지 주기적으로 확인하여, 완료되지 않았을 경우, 상기 슬레이브 노드의 가용 리소스를 분석하고, 상기 VM 리소스 풀 정보를 전달받아 상기 슬레이브 노드와 VM 노드에 잡(job)을 할당하는 과정을 반복하는 단계를 포함하여 구성될 수 있다.In addition, the step of assigning a job to the slave node and the VM node in the step b) may include: b-1) a job tracker of the master node sends a job to the slave node and the VM node, HDFS, and analyzing available resources of the slave node and calculating the number of map tasks and redessing tasks that can be allocated to the slave node; and b-2) Receiving the VM resource pool information capable of allocating a VM node and recalculating the number of usable VM nodes and a map task and a reduce task assignable to the VM node; b-3) the size of a job registered in a job queue and unassigned job is analyzed from the viewpoint of data or processing, and information of the available VM resource pool is used B) assigning the unassigned map task to a task tracker of the slave node and the VM node based on the computed information, at a compile time, And b-5) periodically checking whether map tasks have been completed. If the map resources are not completed, analyzing available resources of the slave node, and analyzing the VM resource pool information And allocating a job to the slave node and the VM node.

또한, 상기 단계 c)의 생성된 중간 산출물을 리듀스 태스크에 전달하는 단계는, c-1) 상기 슬레이브 노드와 상기 VM 노드의 태스크 트랙커가 사용자가 정의한 작업을 키(key)와 값(value)의 입력 포맷에 맞게 상기 맵 태스크를 실행하여 중간 산출물을 생성하는 단계와, c-2) 각각의 상기 슬레이브 노드와 상기 VM 노드의 분할 관리부가 실행중인 맵 태스크의 상기 중간 산출물을 키(key) 값에 따라 교환하고 병합하는 단계, 및 c-3) 상기 맵 태스크의 실행으로 생성된 중간 산출물을 상기 슬레이브 노드와 VM 노드의 태스크 트랙커에서 실행중인 리듀스 태스크에 분산하여 전달하는 단계를 포함하여 구성될 수 있다.The task tracker of the slave node and the VM node transmits a job defined by the user to a key and a value, (C-2) executing the map task according to the input format of the VM node to generate the intermediate output, and (c-2) assigning the intermediate output of the map task, which is executed by the slave node and the VM node's partition management unit, And c-3) distributing and delivering the intermediate artifacts generated by execution of the map task to the slave node and the resume task being executed in the task tracker of the VM node, .

또한, 상기 단계 e)의 결과 데이터를 슬레이브 노드의 HDFS에 저장하는 단계는, e-1) 상기 슬레이브 노드와 상기 VM 노드의 분할 관리부로부터 분산하여 전달받은 중간 산출물을 정렬하고 합산하는 단계, 및 e-2) 상기 슬레이브 노드와 상기 VM 노드의 태스크 트랙커가 상기 합산된 중간 산출물을 입력 값으로 리듀스 태스크를 실행하여 출력 파일을 상기 슬레이브 노드의 HDFS에 저장하는 단계를 포함하여 구성될 수 있다.In addition, the step of storing the resultant data of the step e) in the HDFS of the slave node may include the steps of: (e-1) sorting and summing intermediate deliverables distributed and distributed from the slave node and the division management unit of the VM node, and -2) The task tracker of the slave node and the VM node executes a task of reducing the sum of the intermediate outputs to an input value and storing the output file in the HDFS of the slave node.

이와 같은 본 발명에 의하면, 실시간으로 처리해야 할 데이터의 컴퓨팅 자원을 분석하여 실시간으로 컴퓨팅 자원을 증감시켜 줌으로써 컴퓨팅 자원을 효율적으로 사용할 수 있다.According to the present invention, computing resources of data to be processed in real time can be analyzed and computing resources can be increased or decreased in real time, so that computing resources can be efficiently used.

또한, 클라우드 시스템을 통해 맵 리듀스 작업에 따라 별도의 잡 클러스터를 구성하여 효과적인 다중-작업 스케줄링을 제공함으로써 맵 리듀스 성능을 향상시킬 수 있다.In addition, the cloud system can improve the performance of mapping by configuring a separate job cluster according to the map reduction task to provide effective multi-task scheduling.

도 1은 본 발명에 따른 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법의 구현을 위해 채용되는 클라우드 시스템의 구성을 개략적으로 나타낸 도면이다.
도 2는 도 1에 도시된 클라우드 시스템의 클라이언트, 마스터 노드 및 슬레이브 노드의 각각의 내부 구성을 나타낸 도면이다.
도 3은 도 1에 도시된 클라우드 시스템의 마스터 노드에 내장된 잡 트랙커 모듈의 내부 구성을 나타낸 도면이다.
도 4는 도 1에 도시된 클라우드 시스템의 슬레이브 노드에 내장된 태스크 트랙커 모듈의 내부 구성을 나타낸 도면이다.
도 5는 도 1에 도시된 클라우드 시스템의 클라우드 관리 서버 및 클라우드 호스트 서버의 각 내부 구성을 나타낸 도면이다.
도 6은 본 발명에 따른 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법에 있어서, 맵 리듀스의 실행에 따른 데이터의 흐름을 나타낸 도면이다.
도 7은 본 발명에 따른 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법의 전체적인 실행 과정을 나타낸 흐름도이다.
도 8은 본 발명에 따른 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법에 있어서, 입력 스플릿을 생성하는 과정을 나타낸 흐름도이다.
도 9는 본 발명에 따른 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법에 있어서, 슬레이브 노드와 VM 노드에 잡(job)을 할당하는 과정을 나타낸 흐름도이다.
도 10은 본 발명에 따른 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법에 있어서, 생성된 중간 산출물을 리듀스 태스크에 전달하는 과정을 나타낸 흐름도이다.
도 11은 본 발명에 따른 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법에 있어서, 결과 데이터를 슬레이브 노드의 HDFS에 저장하는 과정을 나타낸 흐름도이다.1 is a diagram schematically illustrating a configuration of a cloud system employed for implementing a cloud driving method for supporting an automatic distributed parallel processing Hadoop system according to the present invention.
FIG. 2 is a diagram showing an internal configuration of each of a client, a master node, and a slave node of the cloud system shown in FIG. 1. FIG.
3 is a diagram illustrating an internal configuration of a job tracker module embedded in a master node of the cloud system shown in FIG.
FIG. 4 is a diagram illustrating an internal configuration of a task tracker module built in a slave node of the cloud system shown in FIG. 1. FIG.
FIG. 5 is a diagram illustrating internal configurations of a cloud management server and a cloud host server of the cloud system shown in FIG. 1. FIG.
6 is a diagram illustrating a flow of data according to execution of map reduction in a cloud driving method for supporting an automatic distributed parallel processing Hadoop system according to the present invention.
FIG. 7 is a flowchart illustrating an overall execution process of a cloud drive method for supporting an automatic distributed parallel processing Hadoop system according to the present invention.
FIG. 8 is a flowchart illustrating a process of generating an input split in a cloud driving method for supporting an automatic distributed parallel processing Hadoop system according to the present invention.
FIG. 9 is a flowchart illustrating a process of assigning a job to a slave node and a VM node in a cloud driving method for supporting an automatic distributed parallel processing Hadoop system according to the present invention.
10 is a flowchart illustrating a process of transferring generated intermediate artifacts to a redessing task in a cloud driving method for supporting an automatic distributed parallel processing Hadoop system according to the present invention.
11 is a flowchart illustrating a process of storing result data in an HDFS of a slave node in a cloud driving method for supporting an automatic distributed parallel processing Hadoop system according to the present invention.

이하 첨부된 도면을 참조하여 본 발명의 실시예를 상세히 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

여기서, 본 발명에 따른 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법에 대해 설명하기에 앞서, 이와 같은 본 발명의 클라우드 구동 방법의 구현을 위해 채용되는 클라우드 시스템에 대해 먼저 설명한다.Before describing the cloud driving method for supporting the Hadoop system in the automatic distributed parallel processing according to the present invention, the cloud system employed for implementing the cloud driving method of the present invention will be described first.

도 1은 본 발명에 따른 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법의 구현을 위해 채용되는 클라우드 시스템의 구성을 개략적으로 나타낸 도면이다.1 is a diagram schematically illustrating a configuration of a cloud system employed for implementing a cloud driving method for supporting an automatic distributed parallel processing Hadoop system according to the present invention.

도 1을 참조하면, 본 발명의 클라우드 구동 방법의 구현을 위해 채용되는 클라우드 시스템(100)은 클라이언트(110), 마스터 노드(120), 슬레이브 노드(130), 클라우드 관리 서버(140), 및 클라우드 호스트 서버(150)를 포함하여 구성된다.Referring to FIG. 1, a cloud system 100 employed for implementing the cloud driving method of the present invention includes a client 110, a master node 120, a slave node 130, a cloud management server 140, And a host server 150.

여기서, 상기 마스터 노드(120)와 슬레이브 노드(130)는 VM 리소스 풀(102)을 구성하고, 상기 클라우드 관리 서버(140)와 클라우드 호스트 서버(150)는 물리 서버 팜(104)을 구성한다. VM 리소스 풀(102)은 VM 노드(130e∼130g)(도 6 참조)에 CPU, 메모리, HDD, 네트워크 자원을 할당하여 가상 머신(virtual machine;VM)으로 동작하도록 자원을 제공하는 가상화 리소스들의 풀(pool)이다.The master node 120 and the slave node 130 constitute a VM resource pool 102 and the cloud management server 140 and the cloud host server 150 constitute a physical server farm 104. VM resource pool 102 is a pool of virtualization resources that provide resources to operate as virtual machines (VMs) by allocating CPUs, memory, HDD, and network resources to VM nodes 130e-130g lt; / RTI >

상기 클라이언트(110)는, 도 2에 도시된 바와 같이, 상기 슬레이브 노드(130)의 하둡 분산 파일 시스템(HDFS)(133)에 저장되어 있는 입력 파일을 분석하여 입력 스플릿(split), 설정 파일(예를 들면, 다수의 슬레이브 노드 중 어느 슬레이브 노드를 사용할 것인지에 등에 대한 정보가 실려있는 파일), 특정 파일(이것은 일을 수행할 일종의 코드에 해당하는 것으로서, 예를 들면, "Map Reduce.jar"와 같은 파일)을 상기 슬레이브 노드(130)의 HDFS(133)에 저장하고, 상기 잡 트랙커(121)에 맵 리듀스를 시작할 준비가 되었음을 알려주는 잡 클라이언트(job client)(111)와, 상기 마스터 노드(120) 및 슬레이브 노드(130)와 네트워크 통신망(160)을 통해 데이터 통신을 수행하는 통신 인터페이스(112)를 포함하여 구성될 수 있다.2, the client 110 analyzes an input file stored in the Hadoop Distributed File System (HDFS) 133 of the slave node 130 and stores the input split file and the configuration file (For example, a file containing information on which slave node is to be used among a plurality of slave nodes, etc.), a specific file (this corresponds to a kind of code for performing work, for example, "Map Reduce.jar" A job client 111 that stores the same file in the HDFS 133 of the slave node 130 and informs the job tracker 121 that it is ready to start the map re-start, And a communication interface 112 for performing data communication through the network 120 and the slave node 130 and the network communication network 160.

상기 마스터 노드(120)는 도 2에 도시된 바와 같이, 상기 잡 클라이언트(111)가 요청한 잡(job)을 상기 슬레이브 노드(130)와 상기 가상 머신 노드의 태스크 트랙커에 맵 태스크와 리듀스 태스크로 각각 할당하는 잡 트랙커(121)와, 상기 슬레이브 노드(130)의 HDFS(133)(후술됨)의 디렉터리 구조를 관리하는 마스터 노드 관리부(122), 및 상기 슬레이브 노드(130)와 네트워크 통신망(160)을 통해 데이터 통신을 수행하는 통신 인터페이스(123)를 포함하여 구성될 수 있다.2, the master node 120 sends a job requested by the job client 111 to the slave node 130 and the task tracker of the virtual machine node as a map task and a resume task A master node management unit 122 for managing the directory structure of the HDFS 133 (to be described later) of the slave node 130 and a slave node 130 and a network communication network 160 And a communication interface 123 for performing data communication through the communication interface 123.

여기서, 상기 잡 트랙커(121)는 도 3에 도시된 바와 같이, 요청된 잡(job)을 큐(queue)로 관리하여 슬레이브 노드(130) 또는 VM 노드(130e∼130g)(도 6 참조)의 태스크 트랙커(131)에 잡(job)을 할당하는 잡(job) 관리부(121a)와, 슬레이브 노드 (130)의 리소스와 클라우드 관리 서버(140)가 관리하는 VM 리소스 풀(102)을 분석하는 리소스 분석부(121b), 및 태스크 트랙커(131)에 할당된 잡(job)을 처리하기 위한 시간을 분석하는 잡(job) 분석부(121c)를 포함하여 구성될 수 있다.As shown in FIG. 3, the job tracker 121 manages the requested jobs as a queue to manage the slave node 130 or the VM nodes 130e to 130g (see FIG. 6) A job management unit 121a for assigning jobs to the task tracker 131 and a resource management unit 121a for analyzing the resources of the slave node 130 and the VM resource pool 102 managed by the cloud management server 140 An analysis unit 121b and a job analyzer 121c for analyzing a time for processing a job allocated to the task tracker 131. [

상기 슬레이브 노드(130)는 도 2에 도시된 바와 같이, 상기 마스터 노드(120)의 잡 트랙커(121)의 잡 관리부(121a)가 우선 순위에 따라 할당해 준 맵 태스크들과 리듀스 태스크들을 관리하여 실행하는 태스크 트랙커(task tracker)(131)와, 슬레이브 노드의 가용 리소스 정보를 상기 마스터 노드(120)의 잡 트랙커(121)의 리소스 분석부(121b)에 하트 비트(heart beat)로 전달하는 리소스 모니터부(132)와, 맵 리듀스를 실행하기 위한 실제 데이터가 저장되어 있는 HDFS(133)와, 그 HDFS(133)에 저장된 데이터를 관리하는 슬레이브 노드 관리부(134), 및 상기 마스터 노드(120)와 네트워크 통신망 (160)을 통해 데이터 통신을 수행하는 통신 인터페이스(135)를 포함하여 구성될 수 있다. 그리고, 이와 같은 슬레이브 노드(130)는 다수의 슬레이브 노드들(130a∼130c)로 구성될 수 있다.2, the slave node 130 manages map tasks and resume tasks assigned by the job manager 121a of the job tracker 121 of the master node 120 according to the priority order, And transmits the available resource information of the slave node as a heart beat to the resource analyzer 121b of the job tracker 121 of the master node 120 A slave node management unit 134 for managing the data stored in the HDFS 133, and a slave node management unit 134 for managing data stored in the master node And a communication interface 135 for performing data communication through the network communication network 160. [ The slave node 130 may include a plurality of slave nodes 130a to 130c.

여기서, 상기 태스크 트랙커(131)는 도 4에 도시된 바와 같이, 마스터 노드 (120)의 잡 트랙커(121)의 잡 관리부(121a)가 할당해 준 맵 태스크와 리듀스 태스크를 관리하는 태스크 관리부(131a)와, 맵 태스크의 실행으로 나온 중간 산출물을 리듀스 태스크가 진행할 수 있는 입력 데이터로 변환해 주는 분할 관리부(131b)를 포함하여 구성될 수 있다.4, the task tracker 131 includes a task manager for managing a map task and a reduce task assigned by the job manager 121a of the job tracker 121 of the master node 120 And a division management unit 131b for converting the intermediate output resulting from the execution of the map task into input data that can be processed by the reduction task.

상기 클라우드 관리 서버(140)는 도 5에 도시된 바와 같이, 상기 클라우드 호스트 서버(150)에서 동작하는 상기 가상 머신(VM) 노드와 연결하여 제어하는 VM 노드 연결부(141)와, 상기 클라우드 호스트 서버(150)에서 동작하는 상기 VM 노드의 생성, 실행, 삭제를 관리하는 클러스터 관리부(142)와, 상기 VM 노드가 사용하는 리소스 풀을 관리하는 리소스 풀 관리부(143), 및 상기 마스터 노드(120)와 네트워크 통신망(160)을 통해 데이터 통신을 수행하는 통신 인터페이스(144)를 포함하여 구성될 수 있다.As shown in FIG. 5, the cloud management server 140 includes a VM node connection unit 141 for connecting to and controlling the virtual machine (VM) node operating in the cloud host server 150, A resource pool management unit 143 for managing a resource pool used by the VM node, and a resource pool management unit 143 for managing the creation, execution, and deletion of the VM node operating in the master node 150, And a communication interface 144 for performing data communication through the network communication network 160. [

상기 클라우드 호스트 서버(150)는 도 5에 도시된 바와 같이, 상기 클라우드 관리 서버(140)의 VM 노드 연결부(141)의 요청에 따라 활성화된 상기 VM 노드(130e∼130g)(도 6 참조)를 관리하는 VM 노드 관리부(151)와, 하이퍼바이저(Hypervisor) 엔진을 통해 가상화에 필요한 모든 기능을 제공하는 가상화 관리부 (152)와, 상기 클라우드 관리 서버(140)의 클러스터 관리부(142)의 요청에 따라 필요한 VM 노드의 생성, 실행, 삭제 요청을 실행하는 클러스터 에이전트부(153)와, 클라우드 호스트 서버의 CPU, 메모리, 디스크 사용량, VM 노드의 사용량을 상기 클라우드 관리 서버(140)의 리소스 풀 관리부(143)로 전달하는 리소스 모니터부(154), 및 상기 클라우드 관리 서버(140)와 네트워크 통신망(160)을 통해 데이터 통신을 수행하는 통신 인터페이스(155)를 포함하여 구성될 수 있다. 그리고, 이와 같은 클라우드 호스트 서버(150)는 다수의 클라우드 호스트 서버들(150a∼150c)로 구성될 수 있다.5, the cloud host server 150 receives the VM nodes 130e to 130g (see FIG. 6) activated according to a request from the VM node connection unit 141 of the cloud management server 140 A virtualization management unit 152 for providing all the functions necessary for virtualization through a hypervisor engine and a virtualization management unit 152 for managing the virtualization management unit 152 in response to a request from the cluster management unit 142 of the cloud management server 140 A cluster agent unit 153 for executing a request for creation, execution, and deletion of a required VM node, and a resource management unit 143 for managing a usage amount of a CPU, a memory, a disk usage, and a VM node of the cloud host server, And a communication interface 155 for performing data communication with the cloud management server 140 through the network communication network 160. [ The cloud host server 150 may include a plurality of cloud host servers 150a to 150c.

여기서, 상기 VM 노드(130e∼130g)는 CPU, 메모리, HDD, 네트워크 자원을 클라우드 호스트 서버(150)로부터 제공받아 가상 머신(virtual machine;VM)의 역할을 하는 것으로서, 로컬의 물리적인 슬레이브 노드(130)와 동일한 동작을 수행한다.The VM nodes 130e to 130g serve as a virtual machine (VM) by receiving CPU, memory, HDD, and network resources from the cloud host server 150, and are connected to a local physical slave node 130).

그러면, 이제 이상과 같은 구성을 갖는 클라우드 시스템을 기반으로 본 발명에 따른 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법에 대해 설명한다.Now, a cloud driving method for supporting the automatic distributed parallel processing Hadoop system according to the present invention based on the cloud system having the above-described configuration will be described.

도 6은 본 발명에 따른 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법에 있어서, 맵 리듀스의 실행에 따른 데이터의 흐름을 나타낸 도면이고, 도 7은 본 발명의 실시예에 따른 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법의 전체적인 실행 과정을 나타낸 흐름도이다.FIG. 6 is a flowchart illustrating a data flow according to execution of map reduction in a cloud driving method for supporting an automatic distributed parallel processing Hadoop system according to the present invention. FIG. Parallel Processing It is a flowchart showing the entire execution process of the cloud drive method for supporting the Hadoop system.

도 6 및 도 7을 참조하면, 본 발명에 따른 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법은, 빅 데이터 처리를 위한 하둡 시스템을 지원하기 위한 것으로, 전술한 바와 같은 클라이언트(110), 마스터 노드(120), 슬레이브 노드 (130), 클라우드 관리 서버(140), 및 클라우드 호스트 서버(150)를 구비하는 클라우드 시스템에 기반하여 클라이언트(110)가 요청한 잡(job)의 업무량을 컴파일 타임 또는 런 타임에 실시간으로 분석하여 이에 필요한 클라우드 시스템의 가상 머신 (virtual machine:VM) 리소스 풀(resource pool)로부터 여유 가상 머신(VM)을 계산한 후 자동으로 VM 노드를 증감하여 잡(job)을 재할당하여 구동시킴으로써 잡(job)의 수행 시간을 줄이고자 한다. 6 and 7, the cloud driving method for supporting the Hadoop system for automatic distributed parallel processing according to the present invention is for supporting the Hadoop system for processing big data, and includes the client 110, Based on a cloud system having a master node 120, a slave node 130, a cloud management server 140 and a cloud host server 150, After analyzing in real time at runtime to calculate the free virtual machine (VM) from the virtual machine (VM) resource pool of the cloud system that is needed for it, And to reduce the execution time of the job.

이상과 같은 본 발명의 클라우드 구동 방법은, 먼저 상기 클라이언트(110)의 잡 클라이언트(111)가 맵 리듀스 연산 처리를 위해 입력 파일(601)을 로딩하여 입력 스플릿(split)(602a∼602d)을 생성한다(단계 S710).In the cloud driving method of the present invention as described above, the client client 111 of the client 110 firstly loads the input file 601 for the map reduction processing and outputs the input split 602a to 602d (Step S710).

여기서, 이상과 같은 입력 스플릿(split)(602a∼602d)의 생성에 대하여 도 8을 참조하여 더 상세히 설명한다.Here, generation of the input splits 602a to 602d as described above will be described in detail with reference to FIG.

도 8은 본 발명에 따른 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법에 있어서, 입력 스플릿을 생성하는 과정을 나타낸 흐름도이다.FIG. 8 is a flowchart illustrating a process of generating an input split in a cloud driving method for supporting an automatic distributed parallel processing Hadoop system according to the present invention.

도 8을 참조하면, 먼저 상기 클라이언트(110)의 잡 클라이언트(111)에 의해 상기 슬레이브 노드(130)의 HDFS(133)에 저장되어 있는 상기 입력 파일(601)을 모두 읽어 들여 파일 포맷에 따라 분석 방법을 구분하여 파일을 분석한다(단계 S711).Referring to FIG. 8, first, the input file 601 stored in the HDFS 133 of the slave node 130 is read by the job client 111 of the client 110 and analyzed according to the file format And analyzes the file by classifying the method (step S711).

그런 후, 상기 파일 포맷에 따라 청크(chunk) 단위로 저장되어 있는 상기 입력 파일(601)을 어떻게 분할할지 상세히 지정한 입력 스플릿(602a∼602d) 데이터를 작성한다(단계 S712).Then, in step S712, the input splits 602a to 602d are created in which the input file 601 stored in units of chunks is divided according to the file format.

그런 다음, 상기 파일 포맷에 따라 상기 입력 스플릿 데이터에서 맵 태스크가 쉽게 액세스할 수 있는 키(key)와 값(value)의 쌍으로 이루어진 입력 스플릿(602a∼602d) 정보를 생성한다(단계 S713).Then, in accordance with the file format, information on the input splits 602a to 602d consisting of pairs of keys and values that can be easily accessed by the map task in the input split data is generated (step S713).

이렇게 하여, 입력 스플릿(split)(602a∼602d)의 생성이 완료되면, 상기 마스터 노드(120)의 잡 트랙커(121)에 의해 슬레이브 노드(130)의 리소스 모니터부(132)를 통해 슬레이브 노드(130)의 가용 리소스를 분석하고, 상기 클라우드 관리 서버(140)의 리소스 풀 관리부(143)로부터 VM 노드(130e∼130g)를 할당할 수 있는 VM 리소스 풀(102) 정보를 전달받아 상기 슬레이브 노드(130)와 VM 노드(130e∼130g)에 잡(job)을 할당한다(단계 S720).When the creation of the input splits 602a to 602d is completed, the job tracker 121 of the master node 120 transmits the slave node (slave node) 602a through the resource monitor unit 132 of the slave node 130 130 and receives information on the VM resource pool 102 from which the VM nodes 130e to 130g can be allocated from the resource pool management unit 143 of the cloud management server 140. The slave node 130 and the VM nodes 130e to 130g (step S720).

여기서, 이와 같은 잡(job) 할당에 대하여 도 9를 참조하여 더 상세히 설명한다.Here, such job allocation will be described in more detail with reference to FIG.

도 9는 본 발명에 따른 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법에 있어서, 슬레이브 노드와 VM 노드에 잡(job)을 할당하는 과정을 나타낸 흐름도이다.FIG. 9 is a flowchart illustrating a process of assigning a job to a slave node and a VM node in a cloud driving method for supporting an automatic distributed parallel processing Hadoop system according to the present invention.

도 9를 참조하면, 먼저 마스터 노드(120)의 잡 트랙커(121)가 상기 입력 파일 (601)의 청크(chunk)가 물리적으로 어느 슬레이브 노드(130)의 HDFS(133)에 있는지의 정보와, 상기 슬레이브 노드(130)의 가용 리소스를 분석하여 상기 슬레이브 노드 (130)에 할당 가능한 맵 태스크와 리듀스 태스크의 개수를 계산한다(단계 S721).9, the job tracker 121 of the master node 120 determines whether the chunk of the input file 601 is physically located in the HDFS 133 of the slave node 130, In step S721, the available resources of the slave node 130 are analyzed and the number of the map tasks and the redess tasks that can be allocated to the slave node 130 is calculated.

그리고, 상기 클라우드 관리 서버(140)의 리소스 풀 관리부(143)로부터 상기 VM 노드(130e∼130g)를 할당할 수 있는 VM 리소스 풀(102) 정보를 전달받아, 사용 가능한 VM 노드(130e∼130g)의 개수와 VM 노드(130e∼130g)에 할당 가능한 맵 태스크와 리듀스 태스크를 다시 계산한다(단계 S722).The VM node 130e to 130g receives information on the VM resource pool 102 from which the VM nodes 130e to 130g can be allocated from the resource pool management unit 143 of the cloud management server 140, And the map task and the reduce task that can be allocated to the VM nodes 130e to 130g are calculated again (step S722).

그런 후, 마스터 노드(120)의 잡 큐(job queue)에 등록되어 할당되지 않은 잡(job)을 데이터 또는 프로세싱 관점에서 잡(job)의 크기를 분석하고, 가용할 수 있는 VM 리소스 풀(102)의 정보를 이용하여 잡(job)을 큐(queue)에 재배치한다(단계 S723).Thereafter, a job registered in the job queue of the master node 120 is analyzed for the size of a job in terms of data or processing, and the available VM resource pool 102 The job is relocated to a queue using the information of the job (step S723).

그런 다음, 할당되지 않은 맵 태스크를 상기 계산된 정보에 근거하여, 슬레이브 노드(130)와 상기 VM 노드(130e∼130g)의 태스크 트랙커(131)에 컴파일 타임에는 정적으로, 런 타임에는 동적으로 각각 할당하여 실행한다(단계 S724).Then, the unassigned map tasks are dynamically allocated to the slave node 130 and the task tracker 131 of the VM nodes 130e to 130g, statically at compile time and dynamically at run time, based on the calculated information (Step S724).

이후, 맵 태스크들이 완료되었는지 주기적으로 확인하여, 완료되지 않았을 경우, 상기 슬레이브 노드의 가용 리소스를 분석하고, 상기 VM 리소스 풀 정보를 전달받아 상기 슬레이브 노드와 VM 노드에 잡(job)을 할당하는 상기 단계 S721부터 단계S724까지의 과정을 반복한다(단계 S725). Thereafter, it is periodically checked whether the map tasks have been completed. If the map tasks are not completed, the slave node analyzes the available resources of the slave node, and receives a VM resource pool information and allocates a job to the slave node and the VM node. The process from step S721 to step S724 is repeated (step S725).

이상에 의해 상기 슬레이브 노드(130)와 VM 노드(130e∼130g)에 잡(job)을 할당하는 과정이 완료되면, 상기 슬레이브 노드(130)의 태스크 트랙커(131)는 상기 생성된 입력 스플릿(602a∼602d)을 입력으로 슬레이브 노드(130)와 VM 노드(130e∼130g)의 태스크 트랙커(131)에 각각 맵 태스크를 할당하고, 생성된 중간 산출물을 리듀스 태스크에 전달한다(단계 S730).The task tracker 131 of the slave node 130 transmits the generated input split 602a to the VM nodes 130e to 130g after completing the job assignment to the slave node 130 and the VM nodes 130e to 130g. And assigns map tasks to the slave node 130 and the task tracker 131 of the VM nodes 130e to 130g, respectively, and transfers the generated intermediate artifacts to the reduce task (step S730).

여기서, 이상과 같이 생성된 중간 산출물을 리듀스 태스크에 전달하는 것에 대하여 도 10을 참조하여 더 상세히 설명한다.Here, the transfer of the intermediate product generated as described above to the reduce task will be described in more detail with reference to FIG.

도 10은 본 발명에 따른 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법에 있어서, 생성된 중간 산출물을 리듀스 태스크에 전달하는 과정을 나타낸 흐름도이다.10 is a flowchart illustrating a process of transferring generated intermediate artifacts to a redessing task in a cloud driving method for supporting an automatic distributed parallel processing Hadoop system according to the present invention.

도 10을 참조하면, 먼저 슬레이브 노드(130)와 VM 노드(130e∼130g)의 태스크 트랙커(131)는 사용자가 정의한 작업을 키(key)와 값(value)의 입력 포맷에 맞게 맵 태스크를 실행하여 중간 산출물을 생성한다(단계 S731).10, the task tracker 131 of the slave node 130 and the VM nodes 130e to 130g executes a map task according to an input format of a key and a value, To generate an intermediate output (step S731).

그런 후, 각각의 슬레이브 노드(130)와 VM 노드(130e∼130g)의 태스크 트랙커 (131)의 분할 관리부(131b)가 실행중인 맵 태스크의 상기 중간 산출물을 키(key) 값에 따라 교환하고 병합한다(단계 S732).Then the partition management unit 131b of the slave node 130 and the VM nodes 130e to 130g of the task tracker 131 exchanges the intermediate output of the map task being executed in accordance with the key value, (Step S732).

그런 다음, 상기 맵 태스크의 실행으로 생성된 중간 산출물을 상기 슬레이브 노드(130)와 VM 노드(130e∼130g)의 태스크 트랙커(131)에서 실행중인 리듀스 태스크에 분산하여 전달한다(단계 S733).In step S733, intermediate products generated by execution of the map task are distributed to the slave node 130 and the redess task being executed by the task tracker 131 of the VM nodes 130e to 130g.

이렇게 하여 생성된 중간 산출물의 리듀스 태스크로의 전달이 완료된 후, VM 노드(130e∼130g)의 태스크 트랙커(131)에 할당된 맵 태스크가 종료되면, 해당 VM 노드(130e∼130g)의 자원을 상기 클라우드 관리 서버(140)의 리소스 풀 관리부(143)를 통해 상기 VM 리소스 풀(102)로 환원한다(단계 S740). After the completion of the transfer of the generated intermediate products to the task for reduction, when the map task assigned to the task tracker 131 of the VM nodes 130e to 130g is terminated, the resource of the corresponding VM node 130e to 130g And returns to the VM resource pool 102 through the resource pool management unit 143 of the cloud management server 140 (step S740).

그리고, 상기 슬레이브 노드(130)와 VM 노드(130e∼130g)의 태스크 트랙커(131)는 상기 중간 산출물을 입력으로 리듀스 태스크를 실행하여 결과 데이터를 슬레이브 노드(130)의 HDFS(133)에 저장한다(단계 S750).The task tracker 131 of the slave node 130 and the VM nodes 130e to 130g executes the resume task with the intermediate output as input and stores the resultant data in the HDFS 133 of the slave node 130 (Step S750).

여기서, 이상과 같은 결과 데이터를 슬레이브 노드(130)의 HDFS(133)에 저장하는 것에 대하여 도 11을 참조하여 더 상세히 설명한다.Here, storing the above result data in the HDFS 133 of the slave node 130 will be described in more detail with reference to FIG.

도 11은 본 발명에 따른 자동 분산병렬 처리 하둡 시스템의 지원을 위한 클라우드 구동 방법에 있어서, 결과 데이터를 슬레이브 노드의 HDFS에 저장하는 과정을 나타낸 흐름도이다.11 is a flowchart illustrating a process of storing result data in an HDFS of a slave node in a cloud driving method for supporting an automatic distributed parallel processing Hadoop system according to the present invention.

도 11을 참조하면, 먼저 상기 슬레이브 노드(130)와 VM 노드(130e∼130g)의 태스크 트랙커(131)의 분할 관리부(131b)는 분산하여 전달받은 중간 산출물을 정렬하고 합산한다(단계 S751).11, the partition management unit 131b of the task tracker 131 of the slave node 130 and the VM nodes 130e to 130g sorts and sums intermediate deliverables distributed and delivered (step S751).

그런 후, 상기 슬레이브 노드(130)와 VM 노드(130e∼130g)의 태스크 트랙커 (131)는 상기 합산된 중간 산출물을 입력 값으로 리듀스 태스크를 실행하여 출력 파일(603a,603b)을 상기 슬레이브 노드(130)의 HDFS(133)에 저장한다(단계 S752).Then, the task tracker 131 of the slave node 130 and the VM nodes 130e to 130g executes the task of reducing the summed intermediate output to the input value to output the output files 603a and 603b to the slave node (Step S752).

이렇게 하여 결과 데이터가 슬레이브 노드(130)의 HDFS(133)에 저장된 후, VM 노드의 태스크 트랙커(131)에 할당된 리듀스 태스크가 종료되면, 해당 VM 노드(130e∼130g)의 자원을 상기 클라우드 관리 서버(140)의 리소스 풀 관리부(143)를 통해 상기 VM 리소스 풀(102)로 환원한다(단계 S760).After the result data is stored in the HDFS 133 of the slave node 130 and the task assigned to the task tracker 131 of the VM node is terminated, resources of the VM nodes 130e to 130g are allocated to the cloud And is returned to the VM resource pool 102 through the resource pool management unit 143 of the management server 140 (step S760).

이상의 설명에서와 같이, 본 발명에 따른 자동 분산병렬 처리 하둡 시스템을 지원하는 클라우드 시스템은 실시간으로 처리해야 할 데이터의 컴퓨팅 자원을 분석하여 실시간으로 컴퓨팅 자원을 증감시켜 주며, 이에 따라 컴퓨팅 자원을 효율적으로 사용할 수 있는 장점이 있다.As described above, the cloud system supporting the automatic distributed parallel processing Hadoop system according to the present invention analyzes computing resources of data to be processed in real time and increases / decreases computing resources in real time, There are advantages to use.

또한, 본 발명에 따른 자동 분산병렬 처리 하둡 시스템을 지원하는 클라우드 시스템은 클라우드 시스템을 통해 맵 리듀스 작업에 따라 별도의 잡 클러스터(job cluster)를 구성하여 효과적인 다중-작업 스케줄링을 제공함으로써 맵 리듀스 성능을 향상시킬 수 있는 장점이 있다.In addition, the cloud system supporting the automatic distributed parallel processing Hadoop system according to the present invention provides a separate multi-task scheduling by configuring a separate job cluster according to the map reduction task through the cloud system, There is an advantage that performance can be improved.

이상, 바람직한 실시 예를 통하여 본 발명에 관하여 상세히 설명하였으나, 본 발명은 이에 한정되는 것은 아니며, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 다양하게 변경, 응용될 수 있음은 당해 기술분야의 통상의 기술자에게 자명하다. 따라서, 본 발명의 진정한 보호 범위는 다음의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술적 사상은 본 발명의 권리 범위에 포함되는 것으로 해석되어야 할 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but many variations and modifications may be made without departing from the spirit and scope of the invention. Be clear to the technician. Accordingly, the true scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of the same should be construed as being included in the scope of the present invention.

102: VM 리소스 풀 104: 물리 서버 팜
110: 클라이언트 111: 잡 클라이언트
112: 통신 인터페이스 120: 마스터 노드
121: 잡(job) 트랙커 121a: 잡(job) 관리부
121b: 리소스 분석부 121c: 잡(job) 분석부
122: 마스터 노드 관리부 123: 통신 인터페이스
130,130a∼130c: 슬레이브 노드 130e∼130g: VM 노드
131: 태스크 트랙커 131a: 태스크 관리부
131b: 분할 관리부 132: 리소스 모니터부
133: HDFS 134: 슬레이브 노드 관리부
135: 통신 인터페이스 140: 클라우드 관리 서버
141: VM 노드 연결부 142: 클러스터 관리부
143: 리소스 풀 관리부 144: 통신 인터페이스
150,150a∼150c: 클라우드 호스트 서버 151: VM 노드 관리부
152: 가상화 관리부 153: 클러스터 에이전트부
154: 리소스 모니터부 155: 통신 인터페이스
160: 네트워크 통신망 601: 입력 파일
602a∼602d: 입력 스플릿 603a,603b: 출력 파일102: VM resource pool 104: physical server farm
110: client 111: job client
112: communication interface 120: master node
121: job tracker 121a: job manager
121b: resource analysis unit 121c: job analysis unit
122: master node management unit 123: communication interface
130, 130a to 130c: slave nodes 130e to 130g: VM node
131: Task tracker 131a: Task manager
131b: division management unit 132: resource monitor unit
133: HDFS 134: Slave Node Management Unit
135: Communication interface 140: Cloud management server
141: VM node connection unit 142: Cluster management unit
143: resource pool management unit 144: communication interface
150, 150a to 150c: Cloud host server 151: VM node management unit
152: virtualization management unit 153: cluster agent unit
154: resource monitor unit 155: communication interface
160: network communication network 601: input file
602a to 602d: input splits 603a and 603b:

Claims

To support the Hadoop system for large data processing, the workload of a job requested by a client based on a cloud system including a client, a master node, a slave node, a cloud management server, and a cloud host server, After analyzing in real time at runtime to calculate the free virtual machine (VM) from the virtual machine (VM) resource pool of the cloud system that is needed for it, A method of driving a cloud, the method comprising:
a) loading an input file by the client to generate an input split,
b) analyzing available resources of the slave node by the master node, receiving VM resource pool information from the cloud management server, and assigning a job to the slave node and the VM node;
c) receiving the generated input split by the slave node, assigning a map task to the slave node and a task tracker of the VM node, respectively, and delivering the generated intermediate artifacts to the reduce task,
d) returning resources of the corresponding VM node to the VM resource pool through the cloud management server when the allocated map task is terminated;
e) executing the resume task by inputting the intermediate output by the slave node and the task tracker of the VM node and storing the resultant data in the HDFS of the slave node, and
and f) returning resources of the corresponding VM node to the VM resource pool through the cloud management server when the allocated redis task is terminated. The automatic distributed parallel processing according to claim 1, Way.

The method according to claim 1,
The step of generating an input split of step a)
a) reading all the input files stored in the HDFS of the slave node by a client client of the client and analyzing the files by classifying the analysis method according to the file format;
(a-2) creating input split data specifying how to divide the input file stored in units of chunks according to the file format, and
(a-3) generating input split information consisting of a pair of key and value that can be easily accessed by the map task in the input split data according to the file format Distributed Parallel Processing A method of running a cloud to support Hadoop systems.

The method according to claim 1,
Wherein the step of assigning a job to the slave node and the VM node in step b)
b-1) the job tracker of the master node analyzes information of which slave node the chunk of the input file is physically located in and which of the slave nodes, analyzes the available resources of the slave node, Calculating a number of redundancy tasks;
b-2) receives VM resource pool information capable of allocating the VM node from the resource pool management unit of the cloud management server, recalculates the number of available VM nodes and a map task that can be allocated to the VM node and a reduce task , &Lt; / RTI &
b-3) Analyzing a job registered in the job queue of the master node and not allocated from the viewpoint of data or processing, and analyzing the size of the available VM resource pool information Relocating a job to a queue using the queue,
b-4) assigning the map task not allocated to the slave node and the task tracker of the VM node statically at compile time and dynamically at run time, respectively, based on the calculated information, and
b-5) periodically checks whether map tasks have been completed, and analyzes the available resource of the slave node when it has not been completed, receives a VM resource pool information and allocates a job to the slave node and the VM node The method of claim 1, wherein the Hadoop system includes a plurality of processors.

The method according to claim 1,
The step of delivering the generated intermediate product of step c) to a reduce task comprises:
c-1) executing the map task according to an input format of a key and a value by a task tracker of the slave node and the VM node to generate an intermediate artifact;
c-2) exchanging and merging the intermediate artifacts of each slave node and the map task being executed by the division management unit of the VM node according to a key value, and
c-3) distributing and delivering the intermediate artifacts generated by the execution of the map task to the slave node and the redundancy task executing in the task tracker of the VM node, and delivering the distributed artifacts. How to run a cloud for support.

The method according to claim 1,
The step of storing the resultant data in the HDFS of the slave node includes:
e-1) sorting and summing intermediate deliverables distributed and distributed from the slave node and the division management unit of the VM node, and
and e-2) the task tracker of the slave node and the VM node executes a task of reducing the sum of the intermediate outputs to an input value and storing the output file in the HDFS of the slave node. Distributed Parallel Processing A method of running a cloud to support Hadoop systems.